PATCH: logical_work_mem and logical streaming of large in-progress transactions

Started by Tomas Vondraabout 8 years ago563 messages
#1Tomas Vondra
tomas.vondra@2ndquadrant.com
6 attachment(s)

Hi all,

Attached is a patch series that implements two features to the logical
replication - ability to define a memory limit for the reorderbuffer
(responsible for building the decoded transactions), and ability to
stream large in-progress transactions (exceeding the memory limit).

I'm submitting those two changes together, because one builds on the
other, and it's beneficial to discuss them together.

PART 1: adding logical_work_mem memory limit (0001)
---------------------------------------------------

Currently, limiting the amount of memory consumed by logical decoding is
tricky (or you might say impossible) for several reasons:

* The value is hard-coded, so it's not quite possible to customize it.

* The amount of decoded changes to keep in memory is restricted by
number of changes. It's not very unclear how this relates to memory
consumption, as the change size depends on table structure, etc.

* The number is "per (sub)transaction", so a transaction with many
subtransactions may easily consume significant amount of memory without
actually hitting the limit.

So the patch does two things. Firstly, it introduces logical_work_mem, a
GUC restricting memory consumed by all transactions currently kept in
the reorder buffer.

Secondly, it adds a simple memory accounting by tracking the amount of
memory used in total (for the whole reorder buffer, to compare against
logical_work_mem) and per transaction (so that we can quickly pick
transaction to spill to disk).

The one wrinkle on the patch is that the memory limit can't be enforced
when reading changes spilled to disk - with multiple subtransactions, we
can't easily predict how many changes to pre-read for each of them. At
that point we still use the existing max_changes_in_memory limit.

Luckily, changes introduced in the other parts of the patch should allow
addressing this deficiency.

PART 2: streaming of large in-progress transactions (0002-0006)
---------------------------------------------------------------

Note: This part is split into multiple smaller chunks, addressing
different parts of the logical decoding infrastructure. That's mostly to
allow easier reviews, though. Ultimately, it's just one patch.

Processing large transactions often results in significant apply lag,
for a couple of reasons. One reason is network bandwidth - while we do
decode the changes incrementally (as we read the WAL), we keep them
locally, either in memory, or spilled to files. Then at commit time, all
the changes get sent to the downstream (and applied) at the same time.
For large transactions the time to do the network transfer may be
significant, causing apply lag.

This patch extends the logical replication infrastructure (output plugin
API, reorder buffer, pgoutput, replication protocol etc.) so allow
streaming of in-progress transactions instead of spilling them to local
files.

The extensions to the API are pretty straightforward. Aside from adding
methods to stream changes/messages and commit a streamed transaction,
the API needs a function to abort a streamed (sub)transaction, and
functions to demarcate a block of streamed changes.

To decode a transaction, we need to know all it's subtransactions, and
invalidations. Currently, those are only known at commit time (although
some assignments may be known earlier), but invalidations are only ever
written in the commit record.

So far that was fine, because we only decode/replay transactions at
commit time, when all of this is known (because it's either in commit
record, or written before it).

But for in-progress transactions (i.e. the subject of interest here),
that is not the case. So the patch modifies WAL-logging to ensure those
two bits of information are written immediately (for wal_level=logical).

For assignments that was fairly simple, thanks to existing caching. For
invalidations, it requires a new WAL record type and a couple of changes
in inval.c.

On the apply side, we simply receive the streamed changes, write them
into a file (one file for toplevel transaction, which is possible thanks
to the assignments being known immediately). And then at commit time the
changes are replayed locally, without having to copy a large chunk of
data over network.

WAL overhead
------------

Of course, these changes to WAL logging are not for free - logging
assignments individually (instead of multiple subtransactions at once)
means higher xlog record overhead. Similarly, (sub)transactions doing a
lot of DDL may result in a lot of invalidations written to WAL (again,
with full xlog record overhead per invalidation).

I've done a number of tests to measure the impact, and for extreme
corner cases the additional amount of WAL is about 40% in both cases.

By an "extreme corner case" I mean a workloads intentionally triggering
many assignments/invalidations, without doing a lot of meaningful work.

For assignments, imagine a single-row table (no indexes), and a
transaction like this one:

BEGIN;
UPDATE t SET v = v + 1;
SAVEPOINT s1;
UPDATE t SET v = v + 1;
SAVEPOINT s2;
UPDATE t SET v = v + 1;
SAVEPOINT s3;
...
UPDATE t SET v = v + 1;
SAVEPOINT s10;
UPDATE t SET v = v + 1;
COMMIT;

For invalidations, add a CREATE TEMPORARY TABLE to each subtransaction.

For more realistic workloads (large table with indexes, runs long enough
to generate FPIs, etc.) the overhead drops below 5%. Which is much more
acceptable, of course, although not perfect.

In both cases, there was pretty much no measurable impact on performance
(as measured by tps).

I do not think there's a way around this requirement (having assignments
and invalidations), if we want to decode in-progress transactions. But
perhaps it would be possible to do some sort of caching (say, at command
level), to reduce the xlog record overhead? Not sure.

All ideas are welcome, of course. In the worst case, I think we can add
a GUC enabling this additional logging - when disabled, streaming of
in-progress transactions would not be possible.

Simplifying ReorderBuffer
-------------------------

One interesting consequence of having assignments is that we could get
rid of the ReorderBuffer iterator, used to merge changes from subxacts.
The assignments allow us to keep changes for each toplevel transaction
in a single list, in LSN order, and just walk it. Abort can be performed
by remembering position of the first change in each subxact, and just
discarding the tail.

This is what the apply worker does with the streamed changes and aborts.

It would also allow us to enforce the memory limit while restoring
transactions spilled to disk, because we would not have the problem with
restoring changes for many subtransactions.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-me.patch.gzapplication/gzip; name=0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-me.patch.gzDownload
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical.patch.gzapplication/gzip; name=0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical.patch.gzDownload
0003-Issue-individual-invalidations-with-wal_level-logica.patch.gzapplication/gzip; name=0003-Issue-individual-invalidations-with-wal_level-logica.patch.gzDownload
0004-Extend-the-output-plugin-API-with-stream-methods.patch.gzapplication/gzip; name=0004-Extend-the-output-plugin-API-with-stream-methods.patch.gzDownload
0005-Implement-streaming-mode-in-ReorderBuffer.patch.gzapplication/gzip; name=0005-Implement-streaming-mode-in-ReorderBuffer.patch.gzDownload
0006-Add-support-for-streaming-to-built-in-replication.patch.gzapplication/gzip; name=0006-Add-support-for-streaming-to-built-in-replication.patch.gzDownload
#2Erikjan Rijkers
er@xs4all.nl
In reply to: Tomas Vondra (#1)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 2017-12-23 05:57, Tomas Vondra wrote:

Hi all,

Attached is a patch series that implements two features to the logical
replication - ability to define a memory limit for the reorderbuffer
(responsible for building the decoded transactions), and ability to
stream large in-progress transactions (exceeding the memory limit).

logical replication of 2 instances is OK but 3 and up fail with:

TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
"reorderbuffer.c", Line: 1773)

I can cobble up a script but I hope you have enough from the assertion
to see what's going wrong...

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Erikjan Rijkers (#2)
6 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 12/23/2017 03:03 PM, Erikjan Rijkers wrote:

On 2017-12-23 05:57, Tomas Vondra wrote:

Hi all,

Attached is a patch series that implements two features to the logical
replication - ability to define a memory limit for the reorderbuffer
(responsible for building the decoded transactions), and ability to
stream large in-progress transactions (exceeding the memory limit).

logical replication of 2 instances is OK but 3 and up fail with:

TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
"reorderbuffer.c", Line: 1773)

I can cobble up a script but I hope you have enough from the assertion
to see what's going wrong...

The assertion says that the iterator produces changes in order that does
not correlate with LSN. But I have a hard time understanding how that
could happen, particularly because according to the line number this
happens in ReorderBufferCommit(), i.e. the current (non-streaming) case.

So instructions to reproduce the issue would be very helpful.

Attached is v2 of the patch series, fixing two bugs I discovered today.
I don't think any of these is related to your issue, though.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch.gzapplication/gzip; name=0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch.gzDownload
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch.gzapplication/gzip; name=0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch.gzDownload
0003-Issue-individual-invalidations-with-wal_level-log-v2.patch.gzapplication/gzip; name=0003-Issue-individual-invalidations-with-wal_level-log-v2.patch.gzDownload
0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch.gzapplication/gzip; name=0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch.gzDownload
0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch.gzapplication/gzip; name=0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch.gzDownload
���>Z0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch�<is�F���_1��sH�I�R�D��XY�J����R,�x�$&�}����h�Cj]�M�3==}wO���T�Y�-�c������o��w�o�8���~����=�q����7r&�{��;���������CqL�H�+���?��������X����v0}U?�by(>J�-q<��?8����V����o���I;>��pr{�V�l��&.�3ON��(�5u�;1
)\_\� td�:�eX�_�Q,-Gc���<�O�����c6�q�{a�c
�iO����zr��T��t� {b�w2���r�s��������x���(�J�'���� ��������^
_>����0 �N��7�O��L�Qi�74���U�)�Q�FA��"���;�VT/�<
�S7�i4�[��-_8Lvq%;$[����-��)b���vB���;F�|	l����(=u��O`<p�AzEM����K�w�XX������g��;�( ����SS@�b�Ha9���-�}V�b�R��H�	E�z4s=�b�8�I`�������f���!����SF6@^���ys%���;�>I���f�w��,��NNN���$�g�5p������E����Q�$�-3/�%$��#<P��	�_�(`R����>*"���g=����<�m!�V�����G�8q�	����Hz��;,������=(��1F���$j��X9��;��* YzQ����I��
��m+t��>hHH�D��=
����L��d��S��t�t:u�����������y�~��&i���3�������_�`w[���(x�z^���<������+Zu1c�C����Q=������F��`��I~�*:�x,:�;�����VF���;�I8�v��`��������%���no#=���d�L~�It��m �3��$���#I`xN</b����.jU���1Xt0�+����<�t���|xr}}�����w���;���������F�%6��p��UlU�
�`��f �
�$�N��d]��.xJw���?�0�yhS��V�� 9~�J����!0�_Z�C����`8Q�Y��V(3��m!��f������Lr�[�[�O�����?�a���#xRaz��R���F���>�j�+��5�dy $G�o����fb�F<n�?d���CX>���!����.����J�Z�5%^`'67�����V&<���
p�������g��������jOK@�������7��F�F�'����N���B�^6AC�s�$�$�F�U,W�p(M�=��|��
z��H9A��E�I���Z���d��C�O�V�26��FN��b|����O�?y<��Vn]31�����D%4�n��P[����'�����g(�����:|�X;;�~�c���
�@��G�ZF l���|�]�9��h�!��P���:�����'�u�m���@J�f5k��b4��R��V%��<���������y!m��WB
 �"8MS����0p���`�i����%��I�TY����Qx��[>�T��R�fI�7����u ��[G��Q�jQ�XE�������]t��[�����Ld�DYo�2����}	��P�`�������O��@ 4
�:$hW�^^�i=`����w^�s��,���S%j'�
�f
T��yO�9�$�c��/an��
����!P��}��ZGD��=
����������]F�{�i� ��? 	p�h�G7���bL��@�z�p�#�z1�����������/W?��F5��(��B�0�!a:��H�QA���P
43I��0�\'aY�������������,��H��Qyj��0���*��T�c�(l�r8�p�0�QXP�.�0q��bih}F�/����g��pj��������6I��v�?()�	��N'��<O��KP^[�9Q2���?��TG�@�!�lb��u�P����e��bV?���j41�e��FC,��C+���1�SBRR��X�YP���"�cA0�[(]S���m�{���������@�q�������|�8�L�N
~���������|��*,�i%m:����
�V��P�'Y
�������y���G�Ll= ��&���T��_�T$�AS����y�i��������)G	_@�4�9�W^S}BP���3+t#��H��3��F(��?�;Yn�S�fW�������RRJ�e��������?�.(�����&W�z{[T���@��������YQ�Y���P.(�
�d�����V������4V�l
[3�hh�l��y�KG�8�?�4��o]�%SG�����0�\<-������C'�/�>�X
�����S�;8�3�b�m'�`>(^���ro),?+�p����F\8
�/��JI'/�����PGni�>N!����u�TJ�1�x������d�B��U��b�|���[^c4��(�bT@�.>w���Z03���N�}��V$1�Gs�X}�l+�q�ZZ���@l$N-�N�����P`� 	I�E�K!��0��`EP�������}��uW��k�����!�:�\�F$0 >N�g}>;���evf7���6i��R�\�#^A��*���}M>H�����
5�H��,��4���tI�	(���nQ��3]-d��Q�rd�}��������������C.?��Iz4�b�'+�XA<���:(��K�u,e�6G	�`a��9���z�yV����_���6�Q��������0rA*�sX��`�
vS���D'M��HqH������9�\7^�q�yG�����d3�J�Ev�2�c�u�J���r���F1:���Q�"��(��0~�m�p<m�i1���	��h|����&�����K���r�#�n^t\����]���V��`p�Wt�@�"��q���[ar�����YCez���|-C�����LQFub_9:E,�v��b�k
��S�6yl� �x��Z�f�4@��B<�Lf�5[�\�{�����0y��|#���L�)�������,�<��V~P���"
��q������XP�4L,Q*��Lj�x�����'�N>��x@�0�������QC4�!r!�N[zDEh��������^P�aY��mY�Lg����%G����>��S�
�Z
�"g?�u�a����@j�
�u^�&C�y��f$NH�#�E�L�P���;������q(�nX��vJM{�������t^��k�n��i�"��+��w*(��r:��U�Aj���Zm�2O0cK�?W�vT��,��0���4bI����F���������Ic ��t}�S����3\S�&V�*)�	��>`��{�|qys%�|%
Q���8��[
����!.�x�6P8��mK�R.2|&6��h~���j���E�	��$�V|���5�����y������B�b������Q�;�o��eYU+���]e2���!�=����2lL�G����?��9���f��3������J3y#�gp+�=�I���������_k��{��*�f� rQbj��i�<<W
#DvJd�8�B.�h�$J��Z��j3���8�3CV�:�U�*���-��l��wz���rY�Y�9!�*S]]� d�ma���nIa����WRO]��=���bj�fT���oy4'p������}$�oH�n@y�1X��.v������r������I�������"�f[�4-fr�t"�]���3U���dR=<���*cm#r��pOh��_��)}'GI"m��u�Z�*������Bx��)�wjoOn���^�j���{Y)c��D�4^�x3�:��BG	�X^p7����/QT(>�����V�����4��D�D�����q8
w<���Ny��d$�����E���C���,<m�4��%��4����sd[Y��B���7��i6<c"��������v����T�c�dy���-\���i��	+0
���	Vm��kV'As�}��t}����}��`%4HF�R�%Cef^�AuE���F������YG�@:�sT6)S�=�uT��c��~TtF�C���a$�41$��c�1}�,������T����d�i���Ly^O�Wz�\/�
9UF>�!��vc0�
&J�>�N�[��^�$�����^�	�>�����>���#�\mh�C�3��F������l��1��_����W�U�V~��h��<�v��/2��/5���������i�MLK�V�8$�����������xA� ��B`:R:�_��������K�����_��/���n�?�6���?��������v��~��%K��LB���F����<E�;��TjK�J/W1�}Pgte��g�[���\����gu*������9�,t���������<�p�/^ZR���QE���|Dw!���0@�d�n������h��=M�oAG��'$��L��rL'�
�����_�#���@I�o�������
M�.��Y.�*hR�8�g��a��,h����P�29���+ecX5��AeuTS�>�kl*��y���~����5;�[���6@S��G����������g�Mr���[�/����������H4f`�\=�)��UA�h3YQO�8�a}�G~0Hm@A���QYH_Q������������[|���7�sl����g��	><���T���jy�� .���
��n�����,���p0������Q��&��G0��{����k&C��������R����R�G��qfNa
�5����A�l�������Qu�����gF��/���F��������B�?�V?-i��W9d�t&mG��n���!B���*m����1XO�

�����i����Sukb=u�>�<���w��:����RkYK�X�,�=��C�f������2�����5M.��i��_��N
��?AW�j�~H<e����TW��p���yJ�D�!�4���	�B9{�����@��A!���>}v~=|���7���������=]_�\o>��b0�����!�r���]����V{�+II���`�v�����:S��-f��Z������l��u�g�\z��$����Z��Pt�OW���-uS��m_?�&pX(2{�r�6C7r��j�����Ca���>�W�r~i`bxA��c���B���&����B���>J���Vx���Q;���R��[�w��}]H��Z��4B.Zc���j!�1�u��*?"�|���3s���U�	�/@�0�!�2���T�����Ow��\�����H�z��>��^�����\����>�iA/	�������n�u�����
i�,4����QM��,)`8�{n�&��`?Z���V�_�7'�s%�����(�I��|�������)���.r�R�$pc�"�8	<I�T��?�^3�F�(�/���BC��"2$�(t�I|%Q�:�c!@��:�����:�^��M�_vT:�ie-�t�m	�^�2!��'�}�q9H����Q����c�E��L0.A��3�G����ln9v��L 4O���|.�"���$�je'�(���}V�,�� ���P*+V<;���S��Y?����:g���]L�_�D�V������!f�IU����X�8�����W���3a��A�-P�A��U�u=T���36��C����X���G�������rrOfLd�0�Qe�������()IT�
-��B���/��G\4L;M�*`��I��odoPN� �>3��K��c�4���	e�s���6
Q����|�#���Fy5���z*p���,�Z-�}��}'���������N�L�����z���R���<�<X�w���+{���']��|�De��i�YaW���}�>����G�������
�o�py~zqvT��`����Y}]p�<����;Ti`�u��\�-_t���=��Un6��U
�`����h��s3�	T���"�|��7����M�����&��������3�AF�J�.�F�����.#����<
��'�o�2��������+�L�
�3��X�.#w��IF�d[YrL��Cf8 ��P����L�{��v�K�r�t�O�%�{;��|k�\��qVto@f,G_�r�41dX�x����C�q���X���������0����n���_+de��|�xr����Iv����$
	L.,[a,���Yo{z�\R� ��Ox|�0i!$��E��=z��X!']����u����-���e�X/������d9�%bv�U�-&qs���F�3��4h�p�����I��Ab�
	���x������>�����
~$��y4�0��-����(���8���t�#��t>� p�i��Cx�z�f4A��&�0�w�4~�%��"�%�j���_�Ewb��3x? ���O��s���*H�@ @#$�������:KR"�C�Ui���WW�����L]�x6��(�+�a�#��E[��h�?�Q�4���G�B�h�m�����f�w�t���x��=�>�[ 
�8��s,�-�XJa���4:�&�t�+~&f���D�^�r�v��������K�Q��>�
>��Y!o�l����<�`i5��l�m�uQU"����9n���BO�?|AF�W_nv��n�>�<M
wJ��
�,� ��V<��]��o�l�sR�' ��\s�h���j5R���JQ��OD�-[�0���(J�.�4R;?`d�����j^6n�9�ko�n<��P�N�M��x�{�]bO}
�Z�s����7=13����

*�'>�1T�*�1�{
�{���o7�?��k�K������al��K��b�������������M���"��)dn�6ep�+U������!��;������;+���hE��]�VU�7������0��Z"��,Io)��6��n
�a�
2�>�	�j�e������Xr�"�� /��kS�d��C�c���Y,F����Yw{m�Q�3��H���������7��y���;����M(�Q��s�@g��No.��gOG�`Uq1^{�_Y�����c{,����D�	�`�h��F6r�n�q_$s������2����9���	�K?rf��[�S7�v,��$��iy���x�t7��sI�c�����[�2FF���qb��{��3<��U�g��3���&��]rrdi���;�m_�j�y��/���A{P��G����F�U/����������.��F���6d���n>�H\�7(���u��&"����l���C	�-\����.�!������T���(\
��Yk��N�T,���6��d���Z^�(���`h�(��&�I1*�.���_����5DZ���]^����Q��s�u��Hc9�����-�\9�3����i�����6O0�q�L��G���������#�
U�S�qW�.�@+���B��Id��b������lX������;��`QMo�zi9���l���-lJ�h��v� ��0�X*�[��U�Ul��4i�T��	]HA�Q��i����+E��T���iN���w��jRE~W[��8
� H/p��������g����1eyg��F��_�{�p����"?��'F�����k*�"��\�������W��8Q����e���,^���w=$�������]�9C��4�w��@g����D�����#`�����`��s|f�*4Bs��^�[�Y��;�
7�M[g��
���?���{�����]]�����k0��qg���=�����#m�>:&�_X���\�w�m�2�J8��V�V'��=T;�UimI������@���=T|=�k��E6LF����(��s+����B4;�V�-�h���bb�I�|{;�g���.R�?��OR����$$����P8{��'_��[�� %	�R�J��n���vF�wR���k�:����}xwrtu��s'����97IJa���i��g%�w��}=U?L����Y����O/:D;�UO�/T�Y{�i#�]�}l'�K�IN��a}O��f�I`�L����X��M�VF�D�
�������fZ�hg�����Bx�YU�t��2�M�u��K����1�'�\�OO�����|U�~�g'O���=P��M�q�����=:���|������p��x�{��6�F�����f����5*\#�'��pDRk���H+@*����xn�
�R�{������>r^U@�{�X�
b���y��������K��V�*�����c�0e���HSr!H�KZY��i�qz�I�2a����!{��
r�f�����>�V�3���o�ll�c�5�- "��z�����xt��������f4K���s;���o}b�?��A��pQ~�W��	j�t**�C�<�����l$���c	JYeX�#��R���$��� ���cp��L�Z���Unl�2ja�����E����Z�s��6�
t���Eg�K�����2�`��X��+q��)�v����mm�^]wV>X��s<i�����J�������L2h�c��3�jt�G�x�E^�1���Tt�*��������y���"��	x�2@���W5F�����i�W�`pa^U����T��K��v����U����<�������%����0z�7^Os�F������u�f�X���X�����]~�&�i]rXT���}��{������������`nZ�v������j��^���'n'���:����������b�E4�I�v��"��Kv]su�*_���0�B���KC����#cO�����%�Ryy�w��R�	o���JE�uG�������%j��"�IgF���UX��V���Zf��k���~s�-�P�����3��_u�,��li(���-p1��jqV~��� !�i;w^=�f� M�\#��3�l���tf�M.d�?>��Tf��������Is���4
^��Fq���FG�O�T��z��b|���I�����z8��,���S����������������K��a]Z�*[�����lPk��6K�����^�yn��������j������}��rw�yj�4i�R�	��K�wTj(��v���m��`�2�F.����=��5<�H"G��uu�-�X����f,f)�L8nO�-�p���g5:C���{�3���c�r���yw_��S�/
�/�9�t����$(X�U*��3*�y�/��=����I+(N�P�m_.��:}���%+���rsY�H9jt���c�{f�!��Y�^�$-�B]��F�y�V�e$�(��d�t�Z�X��J�;���`�wn��vJWop C��A2����k�P���VT�6wVsC�Ef��_���&���7���$8��>������������b�3�_O6��`��������j?k�a��2���ccp&�G���,7�M����2�Y��eU��h��|��b�mGl�� "�����?��L��!���H$R��\�K4���?���O;o�� m����
��1�l
�"��%\!p��=��{���:����g��lq��S0�?W��\�U���M7��d��$�/�/E�b��a�{��M�Te5�_�#�����D&]���������!�c�p4��B��]�<9����}m���tg����?P*.�3��`3��B?���w��U����nf�����_��1�=�Ht�W�gE�3���bf�@D���p�"�vE(@�8����nP�/�t6�Kt&#��\�~]U��3���#� �Yg��7 ���;g�Pk;���?V�;�us/��\�68w���f|��C�v���A����6��!����_����+��
�fz�]�  ��"fu���m;)S�y�|���{M�4d�����I�UQ,���(+����x_�!?*��49��-��+hy�z||tu�F?v����rj��&g�Ee	/��*}$Z-����
��$����/���p�#�7u�������+@B#���H�"hi�t��eb������4�+��t|�������9�������4�5���''q��1������~,+���Y]&�"O�:,�\Y���`��54\a���Mh7�Z^���{UU|��f���|������>d�l���\CI�nSOw�CP,3�|�w�.3~���&�l��V^f�S�w�"����a�?�O��do�/��[;;;��u�����x3q)����>2)-���
����.:�[	dD��C���XA$�^��=D������:
����#<��&���Cf
���l{F�{�r<:�:�'���]�*�L-�)�<�&�I�}a����,����<ld�������}N���D���pu0bT�*x�������H�#`o�F�L��3�Tb�������+
��=�A�G<�g��9��_�_��Jus�������@@���zF��Z���X#�xT��n�8����M�����,��7hQ��"�Er�����mrL�GmyZ,yQ��s�\���B��kN��p�4�f�g���wn3����������z'�:~5|���o�0����
0006-Add-support-for-streaming-to-built-in-replication-v2.patch.gzapplication/gzip; name=0006-Add-support-for-streaming-to-built-in-replication-v2.patch.gzDownload
#4Erik Rijkers
er@xs4all.nl
In reply to: Tomas Vondra (#3)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 2017-12-23 21:06, Tomas Vondra wrote:

On 12/23/2017 03:03 PM, Erikjan Rijkers wrote:

On 2017-12-23 05:57, Tomas Vondra wrote:

Hi all,

Attached is a patch series that implements two features to the
logical
replication - ability to define a memory limit for the reorderbuffer
(responsible for building the decoded transactions), and ability to
stream large in-progress transactions (exceeding the memory limit).

logical replication of 2 instances is OK but 3 and up fail with:

TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
"reorderbuffer.c", Line: 1773)

I can cobble up a script but I hope you have enough from the assertion
to see what's going wrong...

The assertion says that the iterator produces changes in order that
does
not correlate with LSN. But I have a hard time understanding how that
could happen, particularly because according to the line number this
happens in ReorderBufferCommit(), i.e. the current (non-streaming)
case.

So instructions to reproduce the issue would be very helpful.

Using:

0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
0006-Add-support-for-streaming-to-built-in-replication-v2.patch

As you expected the problem is the same with these new patches.

I have now tested more, and seen that it not always fails. I guess that
it here fails 3 times out of 4. But the laptop I'm using at the moment
is old and slow -- it may well be a factor as we've seen before [1]/messages/by-id/3897361c7010c4ac03f358173adbcd60@xs4all.nl.

Attached is the bash that I put together. I tested with
NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails
often. This same program run with HEAD never seems to fail (I tried a
few dozen times).

thanks,

Erik Rijkers

[1]: /messages/by-id/3897361c7010c4ac03f358173adbcd60@xs4all.nl
/messages/by-id/3897361c7010c4ac03f358173adbcd60@xs4all.nl

Attachments:

test.shtext/x-shellscript; name=test.shDownload
#5Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Erik Rijkers (#4)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 12/23/2017 11:23 PM, Erik Rijkers wrote:

On 2017-12-23 21:06, Tomas Vondra wrote:

On 12/23/2017 03:03 PM, Erikjan Rijkers wrote:

On 2017-12-23 05:57, Tomas Vondra wrote:

Hi all,

Attached is a patch series that implements two features to the logical
replication - ability to define a memory limit for the reorderbuffer
(responsible for building the decoded transactions), and ability to
stream large in-progress transactions (exceeding the memory limit).

logical replication of 2 instances is OK but 3 and up fail with:

TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
"reorderbuffer.c", Line: 1773)

I can cobble up a script but I hope you have enough from the assertion
to see what's going wrong...

The assertion says that the iterator produces changes in order that does
not correlate with LSN. But I have a hard time understanding how that
could happen, particularly because according to the line number this
happens in ReorderBufferCommit(), i.e. the current (non-streaming) case.

So instructions to reproduce the issue would be very helpful.

Using:

0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
0006-Add-support-for-streaming-to-built-in-replication-v2.patch

As you expected the problem is the same with these new patches.

I have now tested more, and seen that it not always fails.� I guess that
it here fails 3 times out of 4.� But the laptop I'm using at the moment
is old and slow -- it may well be a factor as we've seen before [1].

Attached is the bash that I put together.� I tested with
NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails
often.� This same program run with HEAD never seems to fail (I tried a
few dozen times).

Thanks. Unfortunately I still can't reproduce the issue. I even tried
running it in valgrind, to see if there are some memory access issues
(which should also slow it down significantly).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#6Craig Ringer
craig@2ndquadrant.com
In reply to: Tomas Vondra (#1)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 23 December 2017 at 12:57, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

Hi all,

Attached is a patch series that implements two features to the logical
replication - ability to define a memory limit for the reorderbuffer
(responsible for building the decoded transactions), and ability to
stream large in-progress transactions (exceeding the memory limit).

I'm submitting those two changes together, because one builds on the
other, and it's beneficial to discuss them together.

PART 1: adding logical_work_mem memory limit (0001)
---------------------------------------------------

Currently, limiting the amount of memory consumed by logical decoding is
tricky (or you might say impossible) for several reasons:

* The value is hard-coded, so it's not quite possible to customize it.

* The amount of decoded changes to keep in memory is restricted by
number of changes. It's not very unclear how this relates to memory
consumption, as the change size depends on table structure, etc.

* The number is "per (sub)transaction", so a transaction with many
subtransactions may easily consume significant amount of memory without
actually hitting the limit.

Also, even without subtransactions, we assemble a ReorderBufferTXN per
transaction. Since transactions usually occur concurrently, systems with
many concurrent txns can face lots of memory use.

We can't exclude tables that won't actually be replicated at the reorder
buffering phase either. So txns use memory whether or not they do anything
interesting as far as a given logical decoding session is concerned. Even
if we'll throw all the data away we must buffer and assemble it first so we
can make that decision.

Because logical decoding considers snapshots and cid increments even from
other DBs (at least when the txn makes catalog changes) the memory use can
get BIG too. I was recently working with a system that had accumulated 2GB
of snapshots ... on each slot. With 7 slots, one for each DB.

So there's lots of room for difficulty with unpredictable memory use.

So the patch does two things. Firstly, it introduces logical_work_mem, a

GUC restricting memory consumed by all transactions currently kept in
the reorder buffer

Does this consider the (currently high, IIRC) overhead of tracking
serialized changes?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#7Erik Rijkers
er@xs4all.nl
In reply to: Tomas Vondra (#5)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

logical replication of 2 instances is OK but 3 and up fail with:

TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
"reorderbuffer.c", Line: 1773)

I can cobble up a script but I hope you have enough from the
assertion
to see what's going wrong...

The assertion says that the iterator produces changes in order that
does
not correlate with LSN. But I have a hard time understanding how that
could happen, particularly because according to the line number this
happens in ReorderBufferCommit(), i.e. the current (non-streaming)
case.

So instructions to reproduce the issue would be very helpful.

Using:

0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
0006-Add-support-for-streaming-to-built-in-replication-v2.patch

As you expected the problem is the same with these new patches.

I have now tested more, and seen that it not always fails.  I guess
that
it here fails 3 times out of 4.  But the laptop I'm using at the
moment
is old and slow -- it may well be a factor as we've seen before [1].

Attached is the bash that I put together.  I tested with
NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which
fails
often.  This same program run with HEAD never seems to fail (I tried a
few dozen times).

Thanks. Unfortunately I still can't reproduce the issue. I even tried
running it in valgrind, to see if there are some memory access issues
(which should also slow it down significantly).

One wonders again if 2ndquadrant shouldn't invest in some old hardware
;)

Another Good Thing would be if there was a provision in the buildfarm to
test patches like these.

But I'm probably not to first one to suggest that; no doubt it'll be
possible someday. In the meantime I'll try to repeat this crash on
other machines (but that will be after the holidays).

Erik Rijkers

#8Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Craig Ringer (#6)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 12/24/2017 05:51 AM, Craig Ringer wrote:

On 23 December 2017 at 12:57, Tomas Vondra <tomas.vondra@2ndquadrant.com
<mailto:tomas.vondra@2ndquadrant.com>> wrote:

Hi all,

Attached is a patch series that implements two features to the logical
replication - ability to define a memory limit for the reorderbuffer
(responsible for building the decoded transactions), and ability to
stream large in-progress transactions (exceeding the memory limit).

I'm submitting those two changes together, because one builds on the
other, and it's beneficial to discuss them together.

PART 1: adding logical_work_mem memory limit (0001)
---------------------------------------------------

Currently, limiting the amount of memory consumed by logical decoding is
tricky (or you might say impossible) for several reasons:

* The value is hard-coded, so it's not quite possible to customize it.

* The amount of decoded changes to keep in memory is restricted by
number of changes. It's not very unclear how this relates to memory
consumption, as the change size depends on table structure, etc.

* The number is "per (sub)transaction", so a transaction with many
subtransactions may easily consume significant amount of memory without
actually hitting the limit.

Also, even without subtransactions, we assemble a ReorderBufferTXN
per transaction. Since transactions usually occur concurrently,
systems with many concurrent txns can face lots of memory use.

I don't see how that could be a problem, considering the number of
toplevel transactions is rather limited (to max_connections or so).

We can't exclude tables that won't actually be replicated at the reorder
buffering phase either. So txns use memory whether or not they do
anything interesting as far as a given logical decoding session is
concerned. Even if we'll throw all the data away we must buffer and
assemble it first so we can make that decision.

Yep.

Because logical decoding considers snapshots and cid increments even
from other DBs (at least when the txn makes catalog changes) the memory
use can get BIG too. I was recently working with a system that had
accumulated 2GB of snapshots ... on each slot. With 7 slots, one for
each DB.

So there's lots of room for difficulty with unpredictable memory use.

Yep.

So the patch does two things. Firstly, it introduces logical_work_mem, a
GUC restricting memory consumed by all transactions currently kept in
the reorder buffer

Does this consider the (currently high, IIRC) overhead of tracking
serialized changes?
 

Consider in what sense?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#9Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Erik Rijkers (#7)
6 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 12/24/2017 10:00 AM, Erik Rijkers wrote:

logical replication of 2 instances is OK but 3 and up fail with:

TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
"reorderbuffer.c", Line: 1773)

I can cobble up a script but I hope you have enough from the assertion
to see what's going wrong...

The assertion says that the iterator produces changes in order that
does
not correlate with LSN. But I have a hard time understanding how that
could happen, particularly because according to the line number this
happens in ReorderBufferCommit(), i.e. the current (non-streaming)
case.

So instructions to reproduce the issue would be very helpful.

Using:

0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
0006-Add-support-for-streaming-to-built-in-replication-v2.patch

As you expected the problem is the same with these new patches.

I have now tested more, and seen that it not always fails.  I guess that
it here fails 3 times out of 4.  But the laptop I'm using at the moment
is old and slow -- it may well be a factor as we've seen before [1].

Attached is the bash that I put together.  I tested with
NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails
often.  This same program run with HEAD never seems to fail (I tried a
few dozen times).

Thanks. Unfortunately I still can't reproduce the issue. I even tried
running it in valgrind, to see if there are some memory access issues
(which should also slow it down significantly).

One wonders again if 2ndquadrant shouldn't invest in some old hardware ;)

Well, I've done tests on various machines, including some really slow
ones, and I still haven't managed to reproduce the failures using your
script. So I don't think that would really help. But I have reproduced
it by using a custom stress test script.

Turns out the asserts are overly strict - instead of

Assert(prev_lsn < current_lsn);

it should have been

Assert(prev_lsn <= current_lsn);

because some XLOG records may contain multiple rows (e.g. MULTI_INSERT).

The attached v3 fixes this issue, and also a couple of other thinkos:

1) The AssertChangeLsnOrder assert check was somewhat broken.

2) We've been sending aborts for all subtransactions, even those not yet
streamed. So downstream got confused and fell over because of an assert.

3) The streamed transactions were written to /tmp, using filenames using
subscription OID and XID of the toplevel transaction. That's fine, as
long as there's just a single replica running - if there are more, the
filenames will clash, causing really strange failures. So move the files
to base/pgsql_tmp where regular temporary files are written. I'm not
claiming this is perfect, perhaps we need to invent another location.

FWIW I believe the relation sync cache is somewhat broken by the
streaming. I thought resetting it would be good enough, but it's more
complicated (and trickier) than that. I'm aware of it, and I'll look
into that next - but probably not before 2018.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v3.patch.gzapplication/gzip; name=0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v3.patch.gzDownload
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v3.patch.gzapplication/gzip; name=0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v3.patch.gzDownload
0003-Issue-individual-invalidations-with-wal_level-log-v3.patch.gzapplication/gzip; name=0003-Issue-individual-invalidations-with-wal_level-log-v3.patch.gzDownload
0004-Extend-the-output-plugin-API-with-stream-methods-v3.patch.gzapplication/gzip; name=0004-Extend-the-output-plugin-API-with-stream-methods-v3.patch.gzDownload
0005-Implement-streaming-mode-in-ReorderBuffer-v3.patch.gzapplication/gzip; name=0005-Implement-streaming-mode-in-ReorderBuffer-v3.patch.gzDownload
0006-Add-support-for-streaming-to-built-in-replication-v3.patch.gzapplication/gzip; name=0006-Add-support-for-streaming-to-built-in-replication-v3.patch.gzDownload
#10Erik Rijkers
er@xs4all.nl
In reply to: Tomas Vondra (#9)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

That indeed fixed the problem: running that same pgbench test, I see no
crashes anymore (on any of 3 different machines, and with several
pgbench parameters).

Thank you,

Erik Rijkers

#11Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Erik Rijkers (#10)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 25 December 2017 at 18:40, Tomas Vondra <tomas.vondra@2ndquadrant.com>

wrote:

The attached v3 fixes this issue, and also a couple of other thinkos

Thank you for the patch, it looks quite interesting. After a quick look at
it
(mostly the first one so far, but I'm going to continue) I have a few
questions:

+ * XXX With many subtransactions this might be quite slow, because we'll

have

+ * to walk through all of them. There are some options how we could

improve

+ * that: (a) maintain some secondary structure with transactions sorted

by

+ * amount of changes, (b) not looking for the entirely largest

transaction,

+ * but e.g. for transaction using at least some fraction of the memory

limit,

+ * and (c) evicting multiple transactions at once, e.g. to free a given

portion

+ * of the memory limit (e.g. 50%).

Do you want to address these possible alternatives somehow in this patch or
you
want to left it outside? Maybe it makes sense to apply some combination of
them, e.g. maintain a secondary structure with relatively large
transactions,
and then start evicting them. If it's somehow not enough, then start to
evict
multiple transactions at once (option "c").

+ /*
+ * We clamp manually-set values to at least 64kB. The

maintenance_work_mem

+ * uses a higher minimum value (1MB), so this is OK.
+ */
+ if (*newval < 64)
+ *newval = 64;
+

I'm not sure what's recommended practice here, but maybe it makes sense to
have a warning here about changing this value to 64kB? Otherwise it can be
unexpected.

#12Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Tomas Vondra (#1)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 12/22/17 23:57, Tomas Vondra wrote:

PART 1: adding logical_work_mem memory limit (0001)
---------------------------------------------------

The documentation in this patch contains some references to later
features (streaming). Perhaps that could be separated so that the
patches can be applied independently.

I don't see the need to tie this setting to maintenance_work_mem.
maintenance_work_mem is often set to very large values, which could then
have undesirable side effects on this use.

Moreover, the name logical_work_mem makes it sound like it's a logical
version of work_mem. Maybe we could think of another name.

I think we need a way to report on how much memory is actually used, so
the setting can be tuned. Something analogous to log_temp_files perhaps.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Eisentraut (#12)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 01/02/2018 04:07 PM, Peter Eisentraut wrote:

On 12/22/17 23:57, Tomas Vondra wrote:

PART 1: adding logical_work_mem memory limit (0001)
---------------------------------------------------

The documentation in this patch contains some references to later
features (streaming). Perhaps that could be separated so that the
patches can be applied independently.

Yeah, that's probably a good idea. But now that you mention it, I wonder
if "streaming" is really a good term. We already use it for "streaming
replication" and it may be quite confusing to use it for another feature
(particularly when it's streaming within logical streaming replication).

But I can't really think of a better name ...

I don't see the need to tie this setting to maintenance_work_mem.
maintenance_work_mem is often set to very large values, which could
then have undesirable side effects on this use.

Well, we need to pick some default value, and we can either use a fixed
value (not sure what would be a good default) or tie it to an existing
GUC. We only really have work_mem and maintenance_work_mem, and the
walsender process will never use more than one such buffer. Which seems
to be closer to maintenance_work_mem.

Pretty much any default value can have undesirable side effects.

Moreover, the name logical_work_mem makes it sound like it's a logical
version of work_mem. Maybe we could think of another name.

I won't object to a better name, of course. Any proposals?

I think we need a way to report on how much memory is actually used,
so the setting can be tuned. Something analogous to log_temp_files
perhaps.

Yes, I agree. I'm just about to submit an updated version of the patch
series, that also introduces a few columns pg_stat_replication, tracking
this type of stats (amount of data spilled to disk or streamed, etc.).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#14Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#1)
9 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Hi,

attached is v4 of the patch series, with a couple of changes:

1) Fixes a bunch of bugs I discovered during stress testing.

I'm not going to go into details, but the main fixes are related to
properly updating progress from the worker, and not streaming when
creating the logical replication slot.

2) Introduces columns into pg_stat_replication.

The new columns track various kinds of statistics (number of xacts,
bytes, ...) about spill-to-disk/streaming. This will be useful when
tuning the GUC memory limit.

3) Two temporary bugfixes that make the patch series work.

The first one (0008) makes sure is_known_subxact is set properly for all
subtransactions, and there's a separate fix in the CF. So this will
eventually go away.

The second one (0009) fixes an issue that is specific to streaming. It
does fix the issue, but I need a bit more time to think about it before
merging it into 0005.

This does pass extensive stress testing with a workload mixing DML, DDL,
subtransactions, aborts, etc. under valgrind. I'm working on extending
the test coverage, and introducing various error conditions (e.g.
walsender/walreceiver timeouts, failures on both ends, etc.).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0006-Add-support-for-streaming-to-built-in-replication-v4.patch.gzapplication/gzip; name=0006-Add-support-for-streaming-to-built-in-replication-v4.patch.gzDownload
��2MZ0006-Add-support-for-streaming-to-built-in-replication-v4.patch�<ks�F���_1�����~H"-�ZRV9�K�c���P 0$�������1x��Mno�Ne��LwOO���g��}w+�s8�
��p��/���xh��r0�z#�����r���O�#�����=�`�����s�p�z ~v�����}7p�_#��c����k=���\[b"�������?|�;�^������?�>{7]\�(z��~��bj�"�<��C�r}�������+��e�m���l��C�uj��+��X�JXN����/�@��e��@�7����t�����&l"�/%����Z�T�|	�c"���5pcg�0d����+,S:�����R���:��m���� ���}@wi��&�e�u�Ywjg��D�,�F��5qHd��B/
�gGk�e�[sC�fm=[n�?�t���^m�n&`V�R7>��{a������
��$�P���,dat�������~y$O���1�Rq ���!3�g�v����:)�
>�(���������
���Z!��A��{���(a.��@C_&L v�w3����m
l7��QGX+�KI\�&�\�d�t�gab]1-I(�`-�F��lh6��h}��6z [5������SSe���z�a�u�R!;h�w1~���D>���I��s��wj�v���n�������r���J
/0|�C�������h���
�.����t�u����.�ft�������;:�v�s��U(};��8d��N��������n������PX�Lg4�Ua"��b#���t2�Wq��0�<��.������^��M/!�)���`<Q��p�^��#(k�q�O��4��b���r;2�N�����j!FV��)�T=p�B��My����5�2����2s�
���D�h��;������}��^�-�@N���zdx���1�i���<������J��������@�b'�����yD���XY6�[�fK�'�����i-�7-1�	S����G{	�� ���ht���#P��#)�p$�NMIg�Rj5�V��^[�����8��a!�0�1<[���Y�����������cF���5�w���������X@����tq#��_��ng���o��\Iy�a�A�������v�2KYs��E7E�����H/k��
B+����E���c3�����Z���{
	0�}��:�n�L�$��M��"�������Rw^t�[>3C)e��2�L�
b�I�ZY����@�rh�l��6�����:(U��	�P��$�+��J�l��@����%�C���lL��b�|������4��)v0�
��rd���0-�[�0GZ'+�nA"Y5&�*�1�l���/n�F�Z�K����)�3��J��w:�l��]�]�0J�v����]�������2�g`�o-�;.-�h�bk��~�8�z�/�;��-����Hm�1DlTy��@a����HJ�^����]�F����8	J�qG(N��9{�H���J76 �a���q�����G�{���������Q��tt9,��5��U��=����sPr~Sh#�qXs=^�_���S�Mi�Tm���J}��.�2��"��	,E_�)�/N��2��z���R�{�����F����2�)�6-,X�S�Ejw�V�]�L�����~�x�I@m�P�����
XD^I��S�@G�'O��q���%)g2A?K���
F
�H"O�
�'S?���[D�j���Y��BH��C����1{$��O��F+�h���	b)���c�U��e���vX� X��p8����n�W���9
%c�q�%|����D.'
����'�>y��F�4��D�&!w��~s{��61���}\V������6���b�Q#�Fj��
����
6K!����9��dw�h����W�MN�3��+Nc�4Q:��C-LA44���|b8�O\Q��[��M�p�[nc�����������PCn���P���]�v����DM1�x�aCEE �u��U"%������������b�g!,��g=M��iu�G��.WI>L��3����h�$�?���������*��_�]�np(F+��4
�������8s�1��qbmZ����H;����u�%|�k��~��G��
������r�������N��
����Y�-�O�a1�gB�g���nz*Ocr$W�.�1�, �%b���9�qL�����7R�4�5�n����I8�:u���vK�P���J�P��Ce�\�(y����}��{�V�����a��������V�@D���"�����t�bR�5	KL�p�g���N8�Z/������+����^~x������9?=�����R��!X�����D)�uQD�J�3Z�G��&~�������t\����<�v������������.��l�;����H�/�oa	��{��c�����=%��&�PR���9�����=+#�DTs�}q>"�N���X��91%�;�TJ�i�|P�U�����1s���Pi�W� � �U��,w�����I����������x8�wUF��b��U�*�4uMi�Sm��1��Y��:�\>��7�����/;�������*��M������������e���V�Yn�|���������������t��n~�y��>L_kng����C`*�@�=�����$*�6�T^��av��~����f���L�����o���������J�<a8|���C��8l���W�W�g1���f�p�,��`���q����A��L������,o:X�-�t�������;y=�$��`+��u9���b0�����c��}�X
�o\n��	:���g
|���<��U�W�-��;�L��`�y��Qf���?5����D\^��1���UL�a�s���CR��=��i=��7Rn�����U����x��I��R�T��`�;(d�+���2����X�'V��D]M�����uW\�G�D����+Yz���t�d�Ma���F�?w:���������S{�"K�\�@C|���K��G3��*p��V+��S4Z"����F��_h����(�$��D����r��]�KB�%�1���4�c�_@���2mR��H�ebW�dP��i���!���6��K��d���l�2?Bxg{�i���	h�t�F��dMH���h�����g�2�i�|0~���g�����x4'i��p��R�Y}�����\�����
�
���WA��1Y���A�sn��|p\��i6bh2�!m���2Dh��������.`M-�*J�2*�O�
JA�����v���+�l9�C�����S��A�
��q��vK�(uoA�}G����QkJj��Qg��b����� t<����
��}(�D���Y���,u�VM��b���������ON[�&���#���S��w\6������kH*��5d���/���%���(������*��	g/��EV_,U�?�5�g�������W��"f���S}4����l��	�8���#���z��XHi��A�L"S��?a8,��lHM���AM�Q�7�����R�[���������K�w�w��\�:�6\�$�H���y��Q���8��������R�J=h<�s��.z������8������U���Ce5Y�E��h�$�s}���q�C7K�7�*��A���$D`Y���_P�G��JT��8����~�����&���{����k�C��A�19�c����SWS��Q�����PvyI�4eR�|�����R�U8�j!H4��o+����66��������
��/�G�|ac�����Dn��{��n��b�
z�}���4%�Xn���+|���V�'�B�!TaTB��4`���f���e�
�/������[F��^�m9q�m�+5xh"�6�K��������_W�K�76�}�~Y���-=�93��T�d\\K����������j%N���t?��+�J�v�}��D�:+�|x���������O�,��r�n������#�N��h�g�M����eGE�f��T �+�`�/���&��z1�����f�|Ly�������[�|9��f�Jh�9\�U�����d�R���4�~��� �0��$3e^5������TS�	H��?=����mN;��]�I��*��tu�p���P��!`l�5�_�!Ftf�kA����5e<�A���v�M�q�����dZG�;/��}����G��/U�/���u�LE�ipA�X�����'+[_c������iT+���O?�
�T�����d�d���d�"$-�)��y
�{����KgM(RD (�>b��|/+�AhzD
��<�J�f�O����3m5
�	�0�'l�����d� -�I	�Z���%DH���H�Z� �l�����9�4��c�����&��*X�}-OJjd�2j�H!<��0?P���W�<���Rw�?".�s�Gl�KY>T�'�O�]����I^���Tc�wD��5�N9g��$f��� O�b.t����n�>$|����g���^l(>����J_Y�'�s9�t��d�+>{0���up.��mc��Mp���Mc��u��n)�s��l_s��
4|���`g�����"rg��������g���BD�O����1���g������\�����|z���9���~R<�B����u[��2�&�R�` �m]���@���D-����*��������P����F!>�b�L^�a�6;��G������$�t����8*M�w�����>3�$3|���'�^���`f���N;~�KC���7�P�_)���|�w�����o�.h�_E�{6�
"��Y����*�L|y@�����~kP�����]��%3�-��F�7�����N-�\��k\R-d@J���&�s�p�!K�C�q��|QE�	�9�n�����x����r�����5-�m��G��a�
�_(~�c��9�%R$
O��=~��5�#=��S��<�G��j�?b��UC|��9�"H��"���"����������,���	�Q��!�^�k���{�R�g���3��e������B�������6�lVB�x:3G/~����_/r�
|�t�5��&�KJ]���G�T�DvZ>�n�e����sq��,�/��<cG���Ut���z����7e���Qh�A�^�!�@�;AzT\*������C��Tb����=(�N�\G	����>��F��?���}�Z0�^�'Q�BK@��� �B��E.��T��U\u�����N%����6��!��:�t	{1D5��;)�-�o�4)�r�	">`��W��Y,	B�EJ�
�G��$p���o�~�EQ���)*�Z��	/!
��`\OV��
������(��"o�w�j[�K"C�p�
^*��%L�[���L��vR��Ku~����
���[����<!�4>�Z!-7���� G������:�?�[�B�r���#�<���~��r���x<�n������-��&'��W�aR�>wUx�G�#���?����^�l�{!��d�F�W3f�o�d�v�l|vc�)8����`��w�/��R�+H�q�S��V�3�$�l]���o_�� ��m�=���g����
��M��r������5>���Yb[�~�'���>e�D# r}O�^�rEk��>���$�L��h�5=t����b���|�����e_��+]���|�!�����8�]�\��d�|5�� m	�M���.e�bN����NX�&�~�q >��]����v�+-=�,!����������g���W{���������r2HF�H�b�`�6k0p���3K�%5�k����1w��~��^��-	�����e@�����]���<A:�o]�JG�1*�!bY�B�*�b��u�?�|4+��TA�uD@bwr�����!\���v�D�����3�RBs��SSS���C_��%4jV�p�;A�^X~�+4�S������u�M��A�G�E#n(+��A���4E1#������`O1����>I"�}`�p<���X?�J}F�7�F_>��oc���1����UF&�!&������ZhG	�h5�}����������o�MU���1��ZH*���"��5�J�-�A�>G��	�0j��q� W$ aG����
�CD��>�:���X��*)���b�������@����
7�����]Xv��sJ�C�b��4tv;�������b_D��!��l��������d�G�k�X��*��g�.��"Z������H�U����{I�+�����8X>�MB�[(��������o��	����y��}F�5Xe|/=�������1-:b���<�;G��MX���@0��.����L���I^�$F�*~�M��@��������oL&�����>
'��8��6q���(��v%��;��Y��V3�3�g��6��-y8$�3/~)+�*���t�����kd��{�2����=�!�G��l��tv��8im
{�"�b�2eV��]B#,�&��������
�]�_tfq��&�$Cm�f�Xb|/���,(���������|��t���������;����E�6*��M%n��=��VK$2�a���-�(��5W�2?��7,eTzV�h�����^���?���
�p�
�^D�wVY��Y�`����*�$'��K�g�����b5�7jh����/:���T!�����!��ri��P�3D��p��q�$B[��^|�R�G��^�'@lb�s������"=t{��*�����6T�`��H/���w� ���SF<��pc�<�������1;�@�J:al	��XF�xD��W��R���wj"x�X��W�x0L��c�;<����X$3R����1k����#�~.@tP�A��u%n�Tt����4�kv��^2Fq�:�d�Z�#�*��T�1VTHq�x�}���Y�^����P�)�
9��Gu+?�1{����1�\D��\G�(��M����9L�2�������������2�Oa���2)�)SY��-�?�;������! ������[r��e���A���V����V���J
tQ5�W�[j�kF'�����4Qd�hF��]O�(����[�E���c<����Z��
��wZk��*�D@u�z"�=�OY��C�0BC}��Y;�3|����z������+�}%���v)�Bv��9v�z��f�����X��vF�#]u�����:�V�����*�*A��"Tb��A�qtJe�K��)�y������~*b�e��&u�����)S��YAf��;
��T�a���<��_�L��0��	 �')Fo����7�*�{��%��<C!�;��|�<��&_��cz���6�UDa@>&�,�O`�D=�S^��%��������/�o���������_�p)����=����=�x{N7�_�d7(�.L�9���I��>!p�I�WE	�����d*�Up�������|19_��i���[��/M����.I���m�=�]md�����������"���6p�N���,��R���;�j�y�tu�g��~��k���C��w�e�Q�$\�jr=�\Y�5t��r���/w�f��M���b�^T�&Tah)�����)<�4��
�k��
�*����cI�H-�w��n
��<+	R����}0�hxY����}���n���������
0�[�g�G/�[�V���ue����pL^O7��p�k�b��	t�� ���<����D�)��\vES������VR0u?"�"�V#��'�LNk��$��J�%�:8��9�Q���fh�'���Tqu��C
�g�l��2J?���!���?{@��}��l�,g	��U�5g�+��B��4���G �f�����Q�{w��PbEa��q�N@���M�jSJ���m���K�r+�<��so�3�iQ?3�y�g//[��f��a|����|�0��������6x[��1q��D4��@��3�������b�B=J����q-xu3�XZ������6�	:��������e���I;	�['>A�}(��	���I�;���.(%���������-�B	�[la��`]�����0�F6�6)'<K�,�e��8���P����.�C�\�j�<�$V�C_)$4?�M�Q��i�B/�]���flVU=�r���O3)��q���kO$�����J�V#hm6��h���r� {����U�v�e�~��`3�zG���%{�������s@���8���*��Ik0�>0X�E�KgK���-(R~�{4QD`HD��� )�1�#ad��&9m�m{��1F����a^��3��?��0quh���	|d�D���l��lmz��q���(�#�����3>C�x�/�#o�������8��5-��q=]JR�7�����g�Y���2![������'#��  �C~��s��j���fTy���5Wi��'{�
�����v��0��i����%Fg��.�mY~�,�c��:_b�����D�]o�*w]uwP��pT����T���'����ll�c��Ge�s��������(Qtzvrq�wr������()TT_�8��v}���E9	N�	���N�L� �_q�2���;SB�1k�`��4����E?.�9
3�0
7B� 
�z�b0P-�F����Xf=��#~=qz
���jN���1�
�^��;��#����SSe����5{�y�XM�]X�4cG��]('���W�b�n#�+�f��]4q;��[�+5*������@�Q��M�,)�xn���G�Mn�
#y����~��k��7n?����3��{��B���e"�1�2'� #RN�AS���7����������K�����,�i7��S�5��Y�+ ��T�=b��LRH������������7=MQ
$����z�6.���i���>Qz}x�IF�)6G9���S4YD�hd��u��h>t�����[U����I�����Rju
7�Z�������7����2p��W���i������.�������5��
)j���<���!��{1���j��v�G�9T������,6.2�����{��q�IZ��v�����L���#�/�q�������lQ�H-�>�!W��E#�*o$� <�;�QE�,�������Jj�wiqc)�UY�%R�1���2��d��I��>���mv�������
�jw�5��`���c���9C���������i�l��t���T$���^��
m��%(��?k��mZ����g��tT�M��C�_��,�
���W{$��O�0s�*-)���������_��'Eio��+��,��)h�\�h��V��(e�����t��x��Y�so�������9
��H�c�����]�?�jX��;6�~�<��V���@���
���d���Cz��s%�c'���t&�de^C����z�(��?���Wg���_>X�Nu2���F%`����y�����@8a�7���q��XLA�\��,���+Q�
�:#(��"��X�C�1?���TT�B&�
S=��DInV\5�;��
7��}�`u����sf��U+����)h�J��&�%��I�:�Ng�������T@�Z�*&5v��+����0��������M�"��hMxgRq�U���|kQ~��x��%G�$��R�z&�s�1�g'���sm���f������ySb���'���#��S�-��@h$t�U9����HN �����0<r��$�v^���v:;N�~M��R���w��k:Y���E��}���	S��d���C�9��A#=D�hn�Y���c���~LvZ%��Ix�G�CQ
V�����?[�J���Q|��0_ 2���
w�m+Hr�������`���Q�7��Q]�w�������@J�\��::8?��;h�oA�����e�3hwKp��gZK��T���Sz
��]4�#�m��O������|�w����������]����R��	��4���D��7�������Z���ba���9dXJ�@���6"������r
0�������y��������!���_�|�q��N�����
���%!��[�H1�z��2$z�'��D�B'�S(���h@�������Z`�A��.��9��E�� ����z^H��5Y���	g����s����{�f�d�����'��kP��O@���q�+�����$���f�!s� #$~���|nQBF��y����L��#N�p��.P����%�r�G\�P�t�N��&v��$��L �TU(;x�}���J��P7������#��zU���F}���iG�b2�6eWDW�8fLj�nQG�/2*���BD�[TZ����V��
+N�~����������8��q�����{T�Ud �h��c!�*X���"S�(]��!���P��8O-4���l@��p�u�ir!�H�
<�M>V��7��%7�_��:b�{@����US�j�+��O�E�Ji���`)��L�,���H����
;H�N)V\Et�qoBn���c
�N[bU��lEQ�AV?�e|g�������)���ZXv@���uv��f<u_�K�����T������\$$��(�/��`�����Y�B��1��.qdga�=�k�/� [�5l�8�G�c�Dpa� {��*$3��]"Q�L7�w\f�O�FU0k��JU���� e����!�NU+$^�',��	�_�*�m\to@]v��i�$]��]*|W������Y��E����@*j�1��d2rQ�����!��u��D�a:'��m8��&Y�C��Q�0#�����"�K
' ��t4�U
��q�I���q.���f7m9Yy�����s2W�e����T9�vP'!���i�_b�)��,��Tx_��c]
F�Oqmug%������
��DQ�^�������N�N<gp����,�^�]f�N�`�iTe����@+���n��l�WZ�W2^$�#�(����eF��CD��/{W7��]�3Z��O ����%��q`EN�\J�l%��nFex�1���7��R�Xc4���u�IL�����6O�Sp[��
SP�#�6NyM�����3S�0��r�L�����������w;����U
,�)��'�e���&��	'�DO�|�,����q�>����������W�n�H[5�j\3ku>1�����aScL�}�T�lt�e��/��<���Ys}�Y_����v����>[Q3_�u�|O ��U���2�X�H�����R�����.����=�L>W�A �[?)��p<��L��d�XG'��aWr�[RK~�;:7]�+��iY�TX���bJ}�<��a��9�1m 9�T�E��"��3gO�k\*��M����0����Ev]�#�w;F~juQ���Wbp�����{�IQ/H�-�2	�8��A��Qi6�gT�M�J��+OH����qQz������-
���OC�&����$I���P�f�����D�e�X@���(XL7K��7��"V@�76^g���oc�i�������Q�H'
�
��*uE���B�T�1�n~��GP?1E�P����I�p����P���.�����]��@#�R2+���/b[�x8�L5;^5��p��
I)�����C���'�.�u�u!)MBCU��Q|J\r*U=��?vvEU�[*m�"�Fy�
uv�)Z�W�$4J�g�<(HT/�G��%Q�!�%�i^k8���{	J�����3H����]DhR���F�N7������b�U���~��^�����`�*"O�0��EG����t���:&���j%6�T'y��^"���13�	�]E��)��|����	��������������H��Y���<��|gPZ��k����YZ��zT��W����Uq�J��|a���a��(%&\��w��v�^��6��~[�oZr��S-��2��)}�������J���Yw�G�%H�.�9��f,��wJFtpG�Lr}3�� �V}l4|�)���:�J�08�@���%���Z_��{W��.�	j���U��p��z84W1��&��v��3��^�D��h��g��u�}�VVV��p�gm���3��ZJ�(&<�s��O�h��F�Sp9��^������h��}�+j��.=���jX��U�=�L���)�;�@)������U�
���j�n�o�p4m@��8{N(��1���a�8���~n����-dJ	����:N=qa/�T�wj���H���'������c!���`�^?b�cayN*B�B#-W�^q�3/$a:�������[�J��&����|z��uo�.���������B%15�^�U|xy)��%k�����x��1
�\^.V�[����v��[��g��.�Z����Z�"dm;M���f�
l�9�do�6y��t0�0H:�y���M���UL�I�'�v}|��a����w�����zM6��l��W`�7�BF`_�'L+td��ii
�Qba�xVFs�����W�b���T*�D��1|Y�I),���_K
FOa���l�Qr���Y��b��y�\y����&�����\�gZd.u�A��-�ct���U���=�KMk�P���)Tu3�&e��'���g�,������%��[���#o�|�"p�E���78�(Ad��*����R��}O��R'��w{	~�u��SB�U8�
�`{���������P�DbR����Dl$8P�%~��*�=R"��#xi$�CT��.��J9Hyt.<���Dsh��,-1n�KR�y����&������K1����S��Ve��wh���7&���s�g����z��q�Y�!��.qk�L��,:E0v����le�V�'��p�k��b���,l��"������1)ex)�Z�������KmnW�k�rs�Ym�;l�h���*�.n
i�Gq��xt�a=�����q�
�"v�C8�NaV<��0^1�`PVRc9G0��5�;
�
\����{��|�����~�+'�,0��s����r�=�k��/O=����-�����
r��\X�5}�CKU3�84��t�?��z�{��x��D_�]�����z������\{U����K9�R8L����,��H�:��c1t=T[��r�	�6 f������q��N���d�X,t��sz�<
T:�P7�����>�X�>���M49C;_Y���4���v��V�g�e,9�@����	�=�%�Ij��3�l"C.)��C9i������M"�J�����"��R9�����S-J���A�z� �Ic&�M<��)�c���
{8��� ���1�X�u�����\z1�����*<�	\u�RM�����0��r�t�R����R�"���_�d@�[a��������34�k'��(���!�����zl67�z\[oV���G�(2B�^|#��R��O��tDkT���:�Gv�
��!d��3@��L@;�L���*1�r#��K�32��g}f$��mF���{Jc��D0u�+���H�zt�A����4����S4ep���s��L�
14�hX�+����4������>����~
?8�������g b��C���������3�b�l
��
�R�Q����������e��4��������o���|o���y<���=�Y�@�T������o=o�a�}�a#�����Nl7w'l+�B���A; ��y4?��D?��f_��%���7W��uZ��#q|��������*�j��`RF\�
A�
������:#G����B��x���NxG�{�$�U�.Tu\��}M�Hh�H��QP"�q��(i���{���)���m�c�XT�V�c���3�d�X$[Y���k�&bR/<[�m&�1��v�WD���5<y�L�p9{���7�������z(>0�>���p���s�|/�;%Vr����O����|��I>%�9�0�?�A��:�P����xx5��zX�@j�M���;��B6C�O!���sUd!��(V1��:hy���i,��k��r�ZV�������414AY&O�vn�*j����m<E��v�1������vGO!������\����g+�Z�OT�XK7=	���1����u�
����9z����K���2^T�6�f�r����w&�UJ�zz	>�3Iq[�����W�k(�M�S���3���x���.�i}�(�d�U-�W�%-u���|a�1c��"u�E-���B|��m���N�M���� ��cXM��K>t���|�	:�T����Q�����oUE���i|MlH�ik��x����((F�8�>+D�)'�g����*����E��J1,<QU��Wlt@W����~��/�����g���U'y������F���M�"`$^V�����6[��js���c��_�B��1�h��������p�9��"�$T2���j[���w����s�bR��;��QU�c�J�������-V�����Hz���2c2��2�����@hXf<�����������p":].+��=��������������Dg>�v��m���������s���������z���
����]�Su����PV���$H���rr����t�)��"�j�i����)<Q�C�����52��k�.`4�)��3_T6���a��7:���P�)��G^����0��j%9�a{ss#������vg�;[���Oqn������\vp�Zo����7QKE�Y���q����������d	��y(����j�3|��WW���g��K�}���[��VL!7� ;A�#y�����U��}c��K��
��q~������B��D7�s������lA�#��A�f\�	>��"��R(����`rO���FX�8 �q!aY���.����e�V�j6�as6���)&,�%��lP�*��S9K��<3�`������At�A�&����w][�	j�(����=	GE��N�������Ur�|%���D�>��z�K�$X���Fr��nd����-J�)�["^�h���K�mC8%���~ARv����Z��Fh������s<�Q�p����k�B:S�	�nu�����Zms+�o6;Yr���!��mI-$��I^����=:ySN��n�H*���������t�i������/�[Z��ui����(U��@�
_�B�X_P��(Q���t".��S8`80��M%�4^��HY��:��w�)~2��O�gG�%�A���(�e/�w�"�'.����8�t��'��,�\3�KP"e���������l��u|r�b��JED]|s��T����E�Vj��g�M����Q�����5���R�4������|��|�F��q8�$���^X����
.�v���x��n�RY��[b|2�lLo9���14mNo�!�8�����Z���Y���j�K�����_��6gh",K��������h�[n�t�J8��m�$7g�XOrk���$g�3I�;�_*��F'�J����Y����:�U�5Q�d\�,�����6v�������v)Y��;��--��@��������@w)*��xd�H����6��;+���������Y�+sS��[/��;&�Sq�D�����=�G�}��B��W},����e3�o�j���Qn���������b���G�n�XdS��>���AyU�_b
"�������\�=�o��#j&��Q�@�����id1~ma��n�
JS�� ��� _|��d��K�=����.
c��=��R?�7�Q����Z���O]eI��1��S�,�K*(]���R�����Z4��xwxl�M(��a=:��6��H�G����$�`,�^�7�����N�f$"1�P��pB�u���n�U3�v�A+�d�0��8���m	�0_E)'�'��Dz����,���v���<k4w�1QS�)�ei�B
����J]M��}�cVA@��l��@@���EY���c�B��dd����AZ�a�P�#�Y��r���p�n���7_���������}tiG<s:[)��m�q�T��Q�K�x4������AT<�N�H5��Q���gx�0���w�:y�#x��O],	�+\,k�&�������P�a����g�N��:LZ�}����'�/�7'��i{�����`�����Y�����j!^�dS���������;s���g�zCo:?�h���P���l���q���/_�����`���M��Cs�hR]*����j�Y�}z�y/���;��g^0�|��2]{�����}/�}hY!@�HA�^)�p�Q��l�
�QK>��>49�>I������~��d����!$��g��=3o�6�A����{����jo�3��{ e*DN;��:��zc�Q�5�������=�Cn[��a���Xg�~PU���u�����z�q�2�;bKf�t�cp���c�i�e�f���LD#6�.:K`URD;�z-�Y
G���q-RPI���Zs����;�2�������������f&��_�e�fm��E$����>�zV��Zm��Z����l�w�O��Q��1���>V��@�-�`�+��~
�f5�qg����<D���������d���c��|�����_����NHK~�2h2��M��C���	o��' Qr���.(�W}V��
G#��	^��b�}��r4��[�|��Z��Y^�p���,��]����"[pv��+�Td��������wd�Tmmr������3S-���,�a-�,��*����`���8\_�jn�=m~��O�
�
nF�*��[jc��8��eX
�q�|�y�@J-�&��YBv�j���B<�#��/_KrH��I�F����_����akc��+g
����� ��Z���j���c>�o��s�TIJW��������kx	�R�$YZ��6�+������R��h�`���I���j,��3��X_��7��yP���]�K�jPn���Vq�"�>�@W�{}�ht��]2� ��f���j�!���`4���#�j�a����6���|{{;=h�A��I/�1&�AeS}U��.��^�����z���_��A>�I8���'g���	��� _����K~
���z���W�{g�����b�N��(��Vv1����fl�i��Qgw9�H���"���a84�����AO���gQ�|��0���}����a���TEa����tx��O�P��1�����w��nwP�����?'
��|����q���\��%:gU�U��[G����xbx���%eyi5���r�7�<��UL�Q�
"cw��� A�L���LlUUI!%������!����\���������W^�9<��zy�,H�������S�6��*�#D�ha��()7���*�g$�N���>�$������vE�
�rXm�3��>x�Y]�eS��;y���������N"�fs�7���X�q��oq������M*�C�����&����e���4���P��PHp<
����8��$�7f_`V����9�5ut�kG�a�z	(�����������f�Ne�@�K�di���RE�������h�:�W���;H�z�������%�%T���P�'�������U�sq��)
���La=�jo]^��z�f����F�N�Q����e*S�I[|����'m�I[|����'m����Q�����,���������NO`�AR���;G��+L�������4e����k�1��A��m���dR����M��d2y2���&�{��5���Q���O3����d*Y]/W�W�4���d�I����������/c!i<YH�,$O�'�m!���-�~D����dy��<�
d�����@BA~+��,Q��/M��;z<���`MV��"��v�.�g/�����q���13�-�5k��5yM5XY�x��oE�*���+�0�^�5���R��Yw�5����r�����]�?��
������m���eI�{�!����n8�n����J�~������`t�Y�L��:���������>�H9�91`�f���L:��e���6����i�Y���>���avJi��8���O*��
��B>��O*��
��B��T��T��*�3�FU��B��������&���[m�	�w����!��z���hlj���qfAJ���u��([��y4�4���5��5���'���E������g��
��I?Z_��^�������k�Hk������O=�'�Z�Z>���Q�qJ�}���4�'��I#z���4�'��I#��hD��f��}�������S?��U�g}<��9�2=$r��
,��y�vr�w��jV5�u�������P�@}Yh��k�����Q
0007-Track-statistics-for-streaming-spilling-v4.patch.gzapplication/gzip; name=0007-Track-statistics-for-streaming-spilling-v4.patch.gzDownload
0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v4.patch.gzapplication/gzip; name=0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v4.patch.gzDownload
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v4.patch.gzapplication/gzip; name=0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v4.patch.gzDownload
0003-Issue-individual-invalidations-with-wal_level-log-v4.patch.gzapplication/gzip; name=0003-Issue-individual-invalidations-with-wal_level-log-v4.patch.gzDownload
0004-Extend-the-output-plugin-API-with-stream-methods-v4.patch.gzapplication/gzip; name=0004-Extend-the-output-plugin-API-with-stream-methods-v4.patch.gzDownload
0005-Implement-streaming-mode-in-ReorderBuffer-v4.patch.gzapplication/gzip; name=0005-Implement-streaming-mode-in-ReorderBuffer-v4.patch.gzDownload
0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as-v4.patch.gzapplication/gzip; name=0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as-v4.patch.gzDownload
0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup-v4.patch.gzapplication/gzip; name=0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup-v4.patch.gzDownload
#15Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#14)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 01/03/2018 09:06 PM, Tomas Vondra wrote:

Hi,

attached is v4 of the patch series, with a couple of changes:

1) Fixes a bunch of bugs I discovered during stress testing.

I'm not going to go into details, but the main fixes are related to
properly updating progress from the worker, and not streaming when
creating the logical replication slot.

2) Introduces columns into pg_stat_replication.

The new columns track various kinds of statistics (number of xacts,
bytes, ...) about spill-to-disk/streaming. This will be useful when
tuning the GUC memory limit.

3) Two temporary bugfixes that make the patch series work.

Forgot to mention that the v4 also extends the CREATE SUBSCRIPTION to
allow customizing the streaming and memory limit. So you can do

CREATE SUBSCRIPTION ... WITH (streaming=on, work_mem=1024)

and this subscription will allow streaming, and the logica_work_mem (on
provider) will be set to 1MB.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#16Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Tomas Vondra (#13)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 1/3/18 14:53, Tomas Vondra wrote:

I don't see the need to tie this setting to maintenance_work_mem.
maintenance_work_mem is often set to very large values, which could
then have undesirable side effects on this use.

Well, we need to pick some default value, and we can either use a fixed
value (not sure what would be a good default) or tie it to an existing
GUC. We only really have work_mem and maintenance_work_mem, and the
walsender process will never use more than one such buffer. Which seems
to be closer to maintenance_work_mem.

Pretty much any default value can have undesirable side effects.

Let's just make it an independent setting unless we know any better. We
don't have a lot of settings that depend on other settings, and the ones
we do have a very specific relationship.

Moreover, the name logical_work_mem makes it sound like it's a logical
version of work_mem. Maybe we could think of another name.

I won't object to a better name, of course. Any proposals?

logical_decoding_[work_]mem?

I think we need a way to report on how much memory is actually used,
so the setting can be tuned. Something analogous to log_temp_files
perhaps.

Yes, I agree. I'm just about to submit an updated version of the patch
series, that also introduces a few columns pg_stat_replication, tracking
this type of stats (amount of data spilled to disk or streamed, etc.).

That seems OK. Perhaps we could bring forward the part of that patch
that applies to this feature.

That would also help testing *this* feature and determine what
appropriate settings are.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#17Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Tomas Vondra (#15)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 1/3/18 15:13, Tomas Vondra wrote:

Forgot to mention that the v4 also extends the CREATE SUBSCRIPTION to
allow customizing the streaming and memory limit. So you can do

CREATE SUBSCRIPTION ... WITH (streaming=on, work_mem=1024)

and this subscription will allow streaming, and the logica_work_mem (on
provider) will be set to 1MB.

I was wondering already during PG10 development whether we should give
subscriptions a generic configuration array, like databases and roles
have, so we don't have to hardcode a bunch of similar stuff every time
we add an option like this. At the time we only had synchronous_commit,
but now we're adding more.

Also, instead of sticking this into the START_REPLICATION command, could
we just run a SET command? That should work over replication
connections as well.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#18Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Tomas Vondra (#1)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 12/22/17 23:57, Tomas Vondra wrote:

PART 1: adding logical_work_mem memory limit (0001)
---------------------------------------------------

Currently, limiting the amount of memory consumed by logical decoding is
tricky (or you might say impossible) for several reasons:

I would like to see some more discussion on this, but I think not a lot
of people understand the details, so I'll try to write up an explanation
here. This code is also somewhat new to me, so please correct me if
there are inaccuracies, while keeping in mind that I'm trying to simplify.

The data in the WAL is written as it happens, so the changes belonging
to different transactions are all mixed together. One of the jobs of
logical decoding is to reassemble the changes belonging to each
transaction. The top-level data structure for that is the infamous
ReorderBuffer. So as it reads the WAL and sees something about a
transaction, it keeps a copy of that change in memory, indexed by
transaction ID (ReorderBufferChange). When the transaction commits, the
accumulated changes are passed to the output plugin and then freed. If
the transaction aborts, then changes are just thrown away.

So when logical decoding is active, a copy of the changes for each
active transaction is kept in memory (once per walsender).

More precisely, the above happens for each subtransaction. When the
top-level transaction commits, it finds all its subtransactions in the
ReorderBuffer, reassembles everything in the right order, then invokes
the output plugin.

All this could end up using an unbounded amount of memory, so there is a
mechanism to spill changes to disk. The way this currently works is
hardcoded, and this patch proposes to change that.

Currently, when a transaction or subtransaction has accumulated 4096
changes, it is spilled to disk. When the top-level transaction commits,
things are read back from disk to do the final processing mentioned above.

This all works mostly fine, but you can construct some more extreme
cases where this can blow up.

Here is a mundane example. Let's say a change entry takes 100 bytes (it
might contain a new row, or an update key and some new column values,
for example). If you have 100 concurrent active sessions and no
subtransactions, then logical decoding memory is bounded by 4096 * 100 *
100 = 40 MB (per walsender) before things spill to disk.

Now let's say you are using a lot of subtransactions, because you are
using PL functions, exception handling, triggers, doing batch updates.
If you have 200 subtransactions on average per concurrent session, the
memory usage bound in that case would be 4096 * 100 * 100 * 200 = 8 GB
(per walsender). And so on. If you have more concurrent sessions or
larger changes or more subtransactions, you'll use much more than those
8 GB. And if you don't have those 8 GB, then you're stuck at this point.

That is the consideration when we record changes, but we also need
memory when we do the final processing at commit time. That is slightly
less problematic because we only process one top-level transaction at a
time, so the formula is only 4096 * avg_size_of_changes * nr_subxacts
(without the concurrent sessions factor).

So, this patch proposes to improve this as follows:

- We compute the actual size of each ReorderBufferChange and keep a
running tally for each transaction, instead of just counting the number
of changes.

- We have a configuration setting that allows us to change the limit
instead of the hardcoded 4096. The configuration setting is also in
terms of memory, not in number of changes.

- The configuration setting is for the total memory usage per decoding
session, not per subtransaction. (So we also keep a running tally for
the entire ReorderBuffer.)

There are two open issues with this patch:

One, this mechanism only applies when recording changes. The processing
at commit time still uses the previous hardcoded mechanism. The reason
for this is, AFAIU, that as things currently work, you have to have all
subtransactions in memory to do the final processing. There are some
proposals to change this as well, but they are more involved. Arguably,
per my explanation above, memory use at commit time is less likely to be
a problem.

Two, what to do when the memory limit is reached. With the old
accounting, this was easy, because we'd decide for each subtransaction
independently whether to spill it to disk, when it has reached its 4096
limit. Now, we are looking at a global limit, so we have to find a
transaction to spill in some other way. The proposed patch searches
through the entire list of transactions to find the largest one. But as
the patch says:

"XXX With many subtransactions this might be quite slow, because we'll
have to walk through all of them. There are some options how we could
improve that: (a) maintain some secondary structure with transactions
sorted by amount of changes, (b) not looking for the entirely largest
transaction, but e.g. for transaction using at least some fraction of
the memory limit, and (c) evicting multiple transactions at once, e.g.
to free a given portion of the memory limit (e.g. 50%)."

(a) would create more overhead for the case where everything fits into
memory, so it seems unattractive. Some combination of (b) and (c) seems
useful, but we'd have to come up with something concrete.

Thoughts?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#19Greg Stark
stark@mit.edu
In reply to: Peter Eisentraut (#18)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 11 January 2018 at 19:41, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

Two, what to do when the memory limit is reached. With the old
accounting, this was easy, because we'd decide for each subtransaction
independently whether to spill it to disk, when it has reached its 4096
limit. Now, we are looking at a global limit, so we have to find a
transaction to spill in some other way. The proposed patch searches
through the entire list of transactions to find the largest one. But as
the patch says:

"XXX With many subtransactions this might be quite slow, because we'll
have to walk through all of them. There are some options how we could
improve that: (a) maintain some secondary structure with transactions
sorted by amount of changes, (b) not looking for the entirely largest
transaction, but e.g. for transaction using at least some fraction of
the memory limit, and (c) evicting multiple transactions at once, e.g.
to free a given portion of the memory limit (e.g. 50%)."

AIUI spilling to disk doesn't affect absorbing future updates, we
would just keep accumulating them in memory right? We won't need to
unspill until it comes time to commit.

Is there any actual advantage to picking the largest transaction? it
means fewer spills and fewer unspills at commit time but that just a
bigger spike of i/o and more of a chance of spilling more than
necessary to get by. In the end it'll be more or less the same amount
of data read back, just all in one big spike when spilling and one big
spike when committing. If you spilled smaller transactions you would
have a small amount of i/o more frequently and have to read back small
amounts for many commits. But it would add up to the same amount of
i/o (or less if you avoid spilling more than necessary).

The real aim should be to try to pick the transaction that will be
committed furthest in the future. That gives you the most memory to
use for live transactions for the longest time and could let you
process the maximum amount of transactions without spilling them. So
either the oldest transaction (in the expectation that it's been open
a while and appears to be a long-lived batch job that will stay open
for a long time) or the youngest transaction (in the expectation that
all transactions are more or less equally long-lived) might make
sense.

--
greg

#20Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Greg Stark (#19)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 1/11/18 18:23, Greg Stark wrote:

AIUI spilling to disk doesn't affect absorbing future updates, we
would just keep accumulating them in memory right? We won't need to
unspill until it comes time to commit.

Once a transaction has been serialized, future updates keep accumulating
in memory, until perhaps it gets serialized again. But then at commit
time, if a transaction has been partially serialized at all, all the
remaining changes are also serialized before the whole thing is read
back in (see reorderbuffer.c line 855).

So one optimization would be to specially keep track of all transactions
that have been serialized already and pick those first for further
serialization, because it will be done eventually anyway.

But this is only a secondary optimization, because it doesn't help in
the extreme cases that either no (or few) transactions have been
serialized or all (or most) transactions have been serialized.

The real aim should be to try to pick the transaction that will be
committed furthest in the future. That gives you the most memory to
use for live transactions for the longest time and could let you
process the maximum amount of transactions without spilling them. So
either the oldest transaction (in the expectation that it's been open
a while and appears to be a long-lived batch job that will stay open
for a long time) or the youngest transaction (in the expectation that
all transactions are more or less equally long-lived) might make
sense.

Yes, that makes sense. We'd still need to keep a separate ordered list
of transactions somewhere, but that might be easier if we just order
them in the order we see them.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#21Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Eisentraut (#18)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 01/11/2018 08:41 PM, Peter Eisentraut wrote:

On 12/22/17 23:57, Tomas Vondra wrote:

PART 1: adding logical_work_mem memory limit (0001)
---------------------------------------------------

Currently, limiting the amount of memory consumed by logical decoding is
tricky (or you might say impossible) for several reasons:

I would like to see some more discussion on this, but I think not a lot
of people understand the details, so I'll try to write up an explanation
here. This code is also somewhat new to me, so please correct me if
there are inaccuracies, while keeping in mind that I'm trying to simplify.

... snip ...

Thanks for a comprehensive summary of the patch!

"XXX With many subtransactions this might be quite slow, because we'll
have to walk through all of them. There are some options how we could
improve that: (a) maintain some secondary structure with transactions
sorted by amount of changes, (b) not looking for the entirely largest
transaction, but e.g. for transaction using at least some fraction of
the memory limit, and (c) evicting multiple transactions at once, e.g.
to free a given portion of the memory limit (e.g. 50%)."

(a) would create more overhead for the case where everything fits into
memory, so it seems unattractive. Some combination of (b) and (c) seems
useful, but we'd have to come up with something concrete.

Yeah, when writing that comment I was worried that (a) might get rather
expensive. I was thinking about maintaining a dlist of transactions
sorted by size (ReorderBuffer now only has a hash table), so that we
could evict transactions from the beginning of the list.

But while that speeds up the choice of transactions to evict, the added
cost is rather high, particularly when most transactions are roughly of
the same size. Because in that case we probably have to move the nodes
around in the list quite often. So it seems wiser to just walk the list
once when looking for a victim.

What I'm thinking about instead is tracking just some approximated
version of this - it does not really matter whether we evict the really
largest transaction or one that is a couple of kilobytes smaller. What
we care about is an answer to this question:

Is there some very large transaction that we could evict to free
a lot of memory, or are all transactions fairly small?

So perhaps we can define some "size classes" and track to which of them
each transaction belongs. For example, we could split the memory limit
into 100 buckets, each representing a 1% size increment.

A transaction would not switch the class very often, and it would be
trivial to pick the largest transaction. When all the transactions are
squashed in the smallest classes, we may switch to some alternative
strategy. Not sure.

In any case, I don't really know how expensive the selection actually
is, and if it's an issue. I'll do some measurements.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#22Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Eisentraut (#20)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 01/12/2018 05:35 PM, Peter Eisentraut wrote:

On 1/11/18 18:23, Greg Stark wrote:

AIUI spilling to disk doesn't affect absorbing future updates, we
would just keep accumulating them in memory right? We won't need to
unspill until it comes time to commit.

Once a transaction has been serialized, future updates keep accumulating
in memory, until perhaps it gets serialized again. But then at commit
time, if a transaction has been partially serialized at all, all the
remaining changes are also serialized before the whole thing is read
back in (see reorderbuffer.c line 855).

So one optimization would be to specially keep track of all transactions
that have been serialized already and pick those first for further
serialization, because it will be done eventually anyway.

But this is only a secondary optimization, because it doesn't help in
the extreme cases that either no (or few) transactions have been
serialized or all (or most) transactions have been serialized.

The real aim should be to try to pick the transaction that will be
committed furthest in the future. That gives you the most memory to
use for live transactions for the longest time and could let you
process the maximum amount of transactions without spilling them. So
either the oldest transaction (in the expectation that it's been open
a while and appears to be a long-lived batch job that will stay open
for a long time) or the youngest transaction (in the expectation that
all transactions are more or less equally long-lived) might make
sense.

Yes, that makes sense. We'd still need to keep a separate ordered list
of transactions somewhere, but that might be easier if we just order
them in the order we see them.

Wouldn't the 'toplevel_by_lsn' be suitable for this? Subtransactions
don't really commit independently, but as part of the toplevel xact. And
that list is ordered by LSN, which is pretty much exactly the order in
which we see the transactions.

I feel somewhat uncomfortable about evicting oldest (or youngest)
transactions for based on some assumed correlation with the commit
order. I'm pretty sure that will bite us badly for some workloads.

Another somewhat non-intuitive detail is that because ReorderBuffer
switched to Generation allocator for changes (which usually represent
99% of the memory used during decoding), it does not reuse memory the
way AllocSet does. Actually, it does not reuse memory at all, aiming to
eventually give the memory back to libc (which AllocSet can't do).

Because of this evicting the youngest transactions seems like a quite
bad idea, because those chunks will not be reused and there may be other
chunks on the blocks, preventing their release.

Yeah, complicated stuff.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#23Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Tomas Vondra (#22)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 1/12/18 23:19, Tomas Vondra wrote:

Wouldn't the 'toplevel_by_lsn' be suitable for this? Subtransactions
don't really commit independently, but as part of the toplevel xact. And
that list is ordered by LSN, which is pretty much exactly the order in
which we see the transactions.

Yes indeed. There is even ReorderBufferGetOldestTXN().

Another somewhat non-intuitive detail is that because ReorderBuffer
switched to Generation allocator for changes (which usually represent
99% of the memory used during decoding), it does not reuse memory the
way AllocSet does. Actually, it does not reuse memory at all, aiming to
eventually give the memory back to libc (which AllocSet can't do).

Because of this evicting the youngest transactions seems like a quite
bad idea, because those chunks will not be reused and there may be other
chunks on the blocks, preventing their release.

Right. But this raises the question whether we are doing the memory
accounting on the right level. If we are doing all this tracking based
on ReorderBufferChanges, but then serializing changes possibly doesn't
actually free any memory in the operating system, that's no good. Can
we get some usage statistics out of the memory context? It seems like
we need to keep serializing transactions until we actually see the
memory context size drop.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#24Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#14)
9 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Attached is v5, fixing a silly bug in part 0006, causing segfault when
creating a subscription.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as-v5.patch.gzapplication/gzip; name=0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as-v5.patch.gzDownload
0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v5.patch.gzapplication/gzip; name=0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v5.patch.gzDownload
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v5.patch.gzapplication/gzip; name=0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v5.patch.gzDownload
0003-Issue-individual-invalidations-with-wal_level-log-v5.patch.gzapplication/gzip; name=0003-Issue-individual-invalidations-with-wal_level-log-v5.patch.gzDownload
�bZ0003-Issue-individual-invalidations-with-wal_level-log-v5.patch�;is�F���_���8��>��c��m��ZR�h*�R�@���%�������I����Vj�� ������{��:|����m���N�=vG�!oFm}�l/��QO��r��[��f���c�f���0g���5���:������m�1�j��	��k�Q.������������������g��Y��N��fSY��r=8c?���&Y�1���|?��t��i����5�4��t�=���=j���?s����M]�����>W��g{��N����/��*�F��u�0�5�a���Y�1q���l�S��8����f���`D|�k���Ls�	xh3��+�l3`�i#6��=�p�k:���j���p����'Vw��������������nvs�`Opn�$<oa�����w���w�x�]��
=�+��hk����9�t��6�|�eN���i�58���7�08@���pN�q�<B����� ��1��c8n�8���%`j�9kn���Zi'L@��}���s���G#$��'wYLI�R��]��A#���J1"�
����6��'g��g�	��\�!,�*�S�H�B\
D����&yG.#.�-bfXyx�|�(`7���w�j��	@!�U�����t�������2�8��0�`�&�:@��O\�GA������U:���g6����OB��h�
lr�ysW����/h�67LP4�/�x,�
n�L���.���O�,d�2�DD�zh��F{��$1����=�0�:��G��$�ut���j��nk��0#����#5�L{j�F���3n�����U���[�3�^g�'�].JR�����uC�����!��,Qg��������
-�w�����8�B����ii^4IUH��
�+��DQ��y��d-�N9==U���������jp�ox����'@��u������$��L@z�&(i������i�iH�����f��E�3<	0�v�CB*�o�eiEw��{�v������Vh������LC��L49}��Y6��]&��i����^e�����1e��Jd����iMQs�b��k`7�q ;,����O��8_�x�7}m��&�n9��58����}�N�-��N�o����U�_%V�����/���W~}s�!�	����V9�����g�4.���=d���PO���a@*U9��?���c�k_e�
�#����vL�)7�����.��.z�t]2c��3���	b�\>h+��[����C�#�+T��w��#}I�k	*06����(��&(��%p!�#�IXu����i�U�Dx�&�p��4+��$��>�V�h�-5�Q�GO��e��g�H��O��#8'�I���~\;6w1�����i�B�i���e����|�����d=��N?&��5����!�'��e$�����W���z*X@��pA<�����S����Z��1�����G;�����D3
@��z���C���6������b1�p}5��;��`���(yzD��,=�}�]��gp~F� �/�hf���x<=���9��$#����U(��*X���X6�T�c�@�\�����M�~�	aKt=��%��[������]�����>;"&BO�j}����M�nONby��>��
��h�$�OON��f�`:t����	���1@�y��D�#h �������xi2�Sw��
��8�<�������Oa�������KV��:<�j�[�����	_�����U/��G��7������t[�3,���i�����!���-�lAvy� B,�����7w/.�;�co���:��&.�B���Z;��*��	�������\��d�H���{z{����C���v^t�rPJ���8�����6���Nl/0���3��[�����=��[���4���N�s�g����K\%1�O���m:8E�����{���+S"�Z�q1p/	����`H�	#
��S���8���Q
a�+1d)w]�_�&��D<N�+C\��bI2�d�JYho��<x�d�>_6�^�^_
�U�������������U�K�������n�U��������)`�z�y��s�ga� ����i�B|s�3�����L�~�9N:���F�K#�����U�&A��{=�d�'1ULn���-2���� ���������
 ��U�������2������U���4��7�g����rL���V�G�+����C%c��h�$��[�U���v>�qF$)�/�|�e�������~�3����A����L��0���,�n�6�P�2����-����P�K�(�	�H�$T(Kp�M�������t�0��~?�_�:irsu5���~����o/��p���D��Cu �Z=a�:P[���RB�v��*�0YE����R*
Z���T�lw���~��;X��xJ2����&1��h��%�'��2��|��Rzyer��QN��Y(�����b!�m�|n����H�U��C���W�A������h�H�������� ���|~3�[����a������B�#��HV%6
��F�aO���(�k�x�p'���O2�����2���g�hQN�	`����j��6������|'�� �f��sjby��DA�"'��]�"��]��I���P�����|��������w�K�!}Q��gE���XQZc����)b`r�2�\�^���[����gr�8L��\
���%�F$5bI��JOK	����d�F��dD��y]�%0D��
<d��?(�1��*�W���	pe�P
������R���/��9n|��I��o��:��_p~Cx���p�!,0�������������O������x�$�f�P��g��0���� vs!�;$'���t
�- /������D��'P;l���c��?��w|���?T���V_��)�a��Yh���0q��"�?Io
5$�L�H�X��h��P.��J�L��������
�<*�.���|Lj���2r����U�=+�C����u
����l����{E3--��Ut�j0P9E@��������:�������K��z\��A��n�=l�!@c�H����
��Jf+�t��9z���������"$�@@�����C�P�Nz�����SQ�|��
�z�UO���Q���T��`���E���n
w.�\X����+���k�m:��q�X�/OJ��eS��WXn���/��H��C����:By�E0��,*FL���������@��T�G=��| ���N��]��U�����5k����}!\*�PO~��p�F��BGi\&���wG�8�_�I�L��O1���
����qDJ�������I�y�	&v��+BR��f,^�m4���*�
 "H���u5��/�������$H�O�v��-�y�v@E/��+pW��:%:5�Z�����\L=���{	���Z*U��,�g*�
IN��Z�]r��g��
DE�/M����AT�L\P���Q����a������������� dg�~%����y��J�� ��	)-��G��y�N�)�{���QT��^��9��x��v-_p.�
 �����N���a�^��I�4�}����cW���V*��$����Y���>q�K~�
����K�	�<5�H^���Hk..��*�.u�����\���A@|]4����?e}g��������"�FM_J�RE����j+����xx����'N�H'�O|�gaW�M�����j�]7n���L�;e-H�3YS��d�`��Y$�3MW�7>���WK�= f����X���C������+���J���=5�;�F4�z����HA[�R&/��w�������d��/�4Y���U�aD�z��+����$t3���"�.��j�U��Wc��,_�f�nVX\���i:�=�H$�����L�QK�5��f������d��[6�oA��
F��n�Y<;�1���hv�h����_���z����HB����3�H�- �2[E����u�
�L']��D�,.h_T�O�f�����yfX�xX��Y�1��r
,��8g��:Q�O�&q����pz�IA\v����6c��"��2-�/�����K)�
���ydF�%b3�o'B�t#m�4�&��0z�t��@���'�����EM��bL��%���j�IXA�f�z���E�s�z���&�y,b^g!e��%���!�:k
�H�G�`�jKh
1��-�!��C��[��pf�^��*�0��
&�sB{�=$�OZ"�^*��5����/���w����|>�������t�m��UJ:qHw�A��;���B�|~M�E,�L����;F��fp	�r�{F�?�a���9��\D�8z�<��c������u����	eR����}+�I��_�&��q����Q3b�{M�^o�Z��+hm9����7Z4&��D�90[Q�
���	��oA�M��������W��b�a����~9��!}��&e�G56]�Z!�n�Tk�����O��e3�����y��k/]���+��CBz���J����&�����Z+Tj��Z���*_��2�g�iO��K�Dl�G�F����c�+�������u[m��%���S$F}���k�Ps�J)���� &1���k��������H��m����L�MG����2�%{��X��}�*/�����r^��l���Xgh����2��O��FZ:Ys	��)����S���iD�A
0004-Extend-the-output-plugin-API-with-stream-methods-v5.patch.gzapplication/gzip; name=0004-Extend-the-output-plugin-API-with-stream-methods-v5.patch.gzDownload
0005-Implement-streaming-mode-in-ReorderBuffer-v5.patch.gzapplication/gzip; name=0005-Implement-streaming-mode-in-ReorderBuffer-v5.patch.gzDownload
0006-Add-support-for-streaming-to-built-in-replication-v5.patch.gzapplication/gzip; name=0006-Add-support-for-streaming-to-built-in-replication-v5.patch.gzDownload
0007-Track-statistics-for-streaming-spilling-v5.patch.gzapplication/gzip; name=0007-Track-statistics-for-streaming-spilling-v5.patch.gzDownload
0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup-v5.patch.gzapplication/gzip; name=0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup-v5.patch.gzDownload
#25Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#24)
9 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 01/19/2018 03:34 PM, Tomas Vondra wrote:

Attached is v5, fixing a silly bug in part 0006, causing segfault when
creating a subscription.

Meh, there was a bug in the sgml docs (<variable> vs. <varname>),
causing another failure. Hopefully v6 will pass the CI build, it does
pass a build with the same parameters on my system.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v6.patch.gzapplication/gzip; name=0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v6.patch.gzDownload
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v6.patch.gzapplication/gzip; name=0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v6.patch.gzDownload
0003-Issue-individual-invalidations-with-wal_level-log-v6.patch.gzapplication/gzip; name=0003-Issue-individual-invalidations-with-wal_level-log-v6.patch.gzDownload
0004-Extend-the-output-plugin-API-with-stream-methods-v6.patch.gzapplication/gzip; name=0004-Extend-the-output-plugin-API-with-stream-methods-v6.patch.gzDownload
0005-Implement-streaming-mode-in-ReorderBuffer-v6.patch.gzapplication/gzip; name=0005-Implement-streaming-mode-in-ReorderBuffer-v6.patch.gzDownload
0006-Add-support-for-streaming-to-built-in-replication-v6.patch.gzapplication/gzip; name=0006-Add-support-for-streaming-to-built-in-replication-v6.patch.gzDownload
0007-Track-statistics-for-streaming-spilling-v6.patch.gzapplication/gzip; name=0007-Track-statistics-for-streaming-spilling-v6.patch.gzDownload
0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as-v6.patch.gzapplication/gzip; name=0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as-v6.patch.gzDownload
0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup-v6.patch.gzapplication/gzip; name=0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup-v6.patch.gzDownload
#26Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tomas Vondra (#25)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Jan 20, 2018 at 7:08 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On 01/19/2018 03:34 PM, Tomas Vondra wrote:

Attached is v5, fixing a silly bug in part 0006, causing segfault when
creating a subscription.

Meh, there was a bug in the sgml docs (<variable> vs. <varname>),
causing another failure. Hopefully v6 will pass the CI build, it does
pass a build with the same parameters on my system.

Thank you for working on this. This patch would be helpful for
synchronous replication.

I haven't looked at the code deeply yet, but I've reviewed the v6
patch set especially on subscriber side. All of the patches can be
applied to current HEAD cleanly. Here is review comment.

----
CREATE SUBSCRIPTION commands accept work_mem < 64 but it leads ERROR
on publisher side when starting replication. Probably we should check
the value on the subscriber side as well.

----
When streaming = on, if we drop subscription in the middle of
receiving stream changes, DROP SUBSCRIPTION could leak tmp files
(.chages file and .subxacts file). Also it also happens when a
transaction on upstream aborted without abort record.

----
Since we can change both streaming option and work_mem option by ALTER
SUBSCRIPTION, documentation of ALTER SUBSCRIPTION needs to be updated.

----
If we create a subscription without any options, both
pg_subscription.substream and pg_subscription.subworkmem are set to
null. However, since GetSubscription are't aware of NULL we start the
replication with invalid options like follows.
LOG: received replication command: START_REPLICATION SLOT "hoge_sub"
LOGICAL 0/0 (proto_version '2', work_mem '893219954', streaming 'on',
publication_names '"hoge_pub"')

I think we can set substream to false and subworkmem to -1 instead of
null, and then makes libpqrcv_startstreaming not set streaming option
if stream is -1.

----
Some WARNING messages appeared. Maybe these are for debug purpose?

WARNING: updating stream stats 0x1c12ef8 4 3 65604
WARNING: UpdateSpillStats: updating stats 0x1c12ef8 0 0 0 39 41 2632080

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#27Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Peter Eisentraut (#23)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

To close out this commit fest, I'm setting both of these patches as
returned with feedback, as there are apparently significant issues to be
addressed. Feel free to move them to the next commit fest when you
think they are ready to be continued.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#28Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Masahiko Sawada (#26)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 01/31/2018 07:53 AM, Masahiko Sawada wrote:

On Sat, Jan 20, 2018 at 7:08 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On 01/19/2018 03:34 PM, Tomas Vondra wrote:

Attached is v5, fixing a silly bug in part 0006, causing segfault when
creating a subscription.

Meh, there was a bug in the sgml docs (<variable> vs. <varname>),
causing another failure. Hopefully v6 will pass the CI build, it does
pass a build with the same parameters on my system.

Thank you for working on this. This patch would be helpful for
synchronous replication.

I haven't looked at the code deeply yet, but I've reviewed the v6
patch set especially on subscriber side. All of the patches can be
applied to current HEAD cleanly. Here is review comment.

----
CREATE SUBSCRIPTION commands accept work_mem < 64 but it leads ERROR
on publisher side when starting replication. Probably we should check
the value on the subscriber side as well.

----
When streaming = on, if we drop subscription in the middle of
receiving stream changes, DROP SUBSCRIPTION could leak tmp files
(.chages file and .subxacts file). Also it also happens when a
transaction on upstream aborted without abort record.

----
Since we can change both streaming option and work_mem option by ALTER
SUBSCRIPTION, documentation of ALTER SUBSCRIPTION needs to be updated.

----
If we create a subscription without any options, both
pg_subscription.substream and pg_subscription.subworkmem are set to
null. However, since GetSubscription are't aware of NULL we start the
replication with invalid options like follows.
LOG: received replication command: START_REPLICATION SLOT "hoge_sub"
LOGICAL 0/0 (proto_version '2', work_mem '893219954', streaming 'on',
publication_names '"hoge_pub"')

I think we can set substream to false and subworkmem to -1 instead of
null, and then makes libpqrcv_startstreaming not set streaming option
if stream is -1.

----
Some WARNING messages appeared. Maybe these are for debug purpose?

WARNING: updating stream stats 0x1c12ef8 4 3 65604
WARNING: UpdateSpillStats: updating stats 0x1c12ef8 0 0 0 39 41 2632080

Regards,

Thanks for the review! I'll address the issues in the next version of
the patch.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#29Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Eisentraut (#27)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 02/01/2018 03:51 PM, Peter Eisentraut wrote:

To close out this commit fest, I'm setting both of these patches as
returned with feedback, as there are apparently significant issues to be
addressed. Feel free to move them to the next commit fest when you
think they are ready to be continued.

Will do. Thanks for the feedback.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#30Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#29)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 2018-02-01 23:51:55 +0100, Tomas Vondra wrote:

On 02/01/2018 03:51 PM, Peter Eisentraut wrote:

To close out this commit fest, I'm setting both of these patches as
returned with feedback, as there are apparently significant issues to be
addressed. Feel free to move them to the next commit fest when you
think they are ready to be continued.

Will do. Thanks for the feedback.

Hm, this CF entry is marked as needs review as of 2018-03-01 12:54:48,
but I don't see a newer version posted?

Greetings,

Andres Freund

#31Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#30)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 03/02/2018 02:12 AM, Andres Freund wrote:

On 2018-02-01 23:51:55 +0100, Tomas Vondra wrote:

On 02/01/2018 03:51 PM, Peter Eisentraut wrote:

To close out this commit fest, I'm setting both of these patches as
returned with feedback, as there are apparently significant issues to be
addressed. Feel free to move them to the next commit fest when you
think they are ready to be continued.

Will do. Thanks for the feedback.

Hm, this CF entry is marked as needs review as of 2018-03-01 12:54:48,
but I don't see a newer version posted?

Ah, apologies - that's due to moving the patch from the last CF (it was
marked as RWF so I had to reopen it before moving it). I'll submit a new
version of the patch shortly, please mark it as WOA until then.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#32David Steele
david@pgmasters.net
In reply to: Tomas Vondra (#31)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Hi Tomas.

On 3/1/18 9:33 PM, Tomas Vondra wrote:

On 03/02/2018 02:12 AM, Andres Freund wrote:

On 2018-02-01 23:51:55 +0100, Tomas Vondra wrote:

On 02/01/2018 03:51 PM, Peter Eisentraut wrote:

To close out this commit fest, I'm setting both of these patches as
returned with feedback, as there are apparently significant issues to be
addressed. Feel free to move them to the next commit fest when you
think they are ready to be continued.

Will do. Thanks for the feedback.

Hm, this CF entry is marked as needs review as of 2018-03-01 12:54:48,
but I don't see a newer version posted?

Ah, apologies - that's due to moving the patch from the last CF (it was
marked as RWF so I had to reopen it before moving it). I'll submit a new
version of the patch shortly, please mark it as WOA until then.

Marked as Waiting on Author.

--
-David
david@pgmasters.net

#33Andres Freund
andres@anarazel.de
In reply to: David Steele (#32)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Hi,

On 2018-03-01 21:39:36 -0500, David Steele wrote:

On 3/1/18 9:33 PM, Tomas Vondra wrote:

On 03/02/2018 02:12 AM, Andres Freund wrote:

Hm, this CF entry is marked as needs review as of 2018-03-01 12:54:48,
but I don't see a newer version posted?

Ah, apologies - that's due to moving the patch from the last CF (it was
marked as RWF so I had to reopen it before moving it). I'll submit a new
version of the patch shortly, please mark it as WOA until then.

Marked as Waiting on Author.

Sorry to be the hard-ass, but given this patch hasn't been moved forward
since 2018-01-19, I'm not sure why it's elegible to be in this CF in the
first place?

Greetings,

Andres Freund

#34Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#31)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Mar 1, 2018 at 9:33 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Ah, apologies - that's due to moving the patch from the last CF (it was
marked as RWF so I had to reopen it before moving it). I'll submit a new
version of the patch shortly, please mark it as WOA until then.

So, the way it's supposed to work is you resubmit the patch first and
then re-activate the CF entry. If you get to re-activate the CF entry
without actually updating the patch, and then submit the patch
afterwards, then the CF deadline becomes largely meaningless. I think
a new patch should rejected as untimely.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#35David Steele
david@pgmasters.net
In reply to: Robert Haas (#34)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 3/2/18 3:06 PM, Robert Haas wrote:

On Thu, Mar 1, 2018 at 9:33 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Ah, apologies - that's due to moving the patch from the last CF (it was
marked as RWF so I had to reopen it before moving it). I'll submit a new
version of the patch shortly, please mark it as WOA until then.

So, the way it's supposed to work is you resubmit the patch first and
then re-activate the CF entry. If you get to re-activate the CF entry
without actually updating the patch, and then submit the patch
afterwards, then the CF deadline becomes largely meaningless. I think
a new patch should rejected as untimely.

Hmmm, I missed that implication last night. I'll mark this Returned
with Feedback.

Tomas, please move to the next CF once you have an updated patch.

Thanks,
--
-David
david@pgmasters.net

#36Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: David Steele (#35)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 03/02/2018 09:21 PM, David Steele wrote:

On 3/2/18 3:06 PM, Robert Haas wrote:

On Thu, Mar 1, 2018 at 9:33 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Ah, apologies - that's due to moving the patch from the last CF (it was
marked as RWF so I had to reopen it before moving it). I'll submit a new
version of the patch shortly, please mark it as WOA until then.

So, the way it's supposed to work is you resubmit the patch first and
then re-activate the CF entry. If you get to re-activate the CF entry
without actually updating the patch, and then submit the patch
afterwards, then the CF deadline becomes largely meaningless. I think
a new patch should rejected as untimely.

Hmmm, I missed that implication last night. I'll mark this Returned
with Feedback.

Tomas, please move to the next CF once you have an updated patch.

Can you guys please point me to the CF rules that say this? Because my
understanding (and not just mine, AFAICS) was obviously different.
Clearly there's a disconnect somewhere.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#37Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#28)
9 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Hi there,

attached is an updated patch fixing all the reported issues (a bit more
about those below).

The main change in this patch version is reworked logging of subxact
assignments, which needs to be done immediately for incremental decoding
to work properly.

The previous patch versions did that by logging a separate xlog record,
which however had rather noticeable space overhead (~40% on a worst-case
test - tiny table, no FPWs, ...). While in practice the overhead would
be much closer to 0%, it still seemed unacceptable.

Andres proposed doing something like we do with replication origins in
XLogRecordAssemble, i.e. inventing a special block, and embedding the
assignment info into that (in the next xlog record). This turned out to
be working quite well, and the worst-case space overhead dropped to ~5%.

I have attempted to do something like that with the invalidations, which
is the other thing that needs to be logged immediately for incremental
decoding to work correctly. The plan was to use the same approach as for
assignments, i.e. embed the invalidations into the next xlog record and
stop sending them in the commit message. That however turned out to be
much more complicated - the embedding is fairly trivial, of course, but
unlike assignments the invalidations are needed for hot standbys. If we
only send them incrementally, I think the standby would have to collect
from the WAL records, and store them in a way that survives restarts.

So for invalidations the patch uses the original approach with a new
type xlog record type (ignored by standby), and still logging the
invalidations in commit record (which is that the standby relies on).

On 02/01/2018 11:50 PM, Tomas Vondra wrote:

On 01/31/2018 07:53 AM, Masahiko Sawada wrote:
...

----
CREATE SUBSCRIPTION commands accept work_mem < 64 but it leads ERROR
on publisher side when starting replication. Probably we should check
the value on the subscriber side as well.

Added.

----
When streaming = on, if we drop subscription in the middle of
receiving stream changes, DROP SUBSCRIPTION could leak tmp files
(.chages file and .subxacts file). Also it also happens when a
transaction on upstream aborted without abort record.

Right. The files would get cleaned up eventually during restart (just
like other temporary files), but leaking them after DROP SUBSCRIPTION is
not cool. So I've added a simple tracking of files (or rather streamed
XIDs) in the worker, and clean them explicitly on exit.

----
Since we can change both streaming option and work_mem option by ALTER
SUBSCRIPTION, documentation of ALTER SUBSCRIPTION needs to be updated.

Yep, I've added note that work_mem and streaming can also be changed.
Those changes won't be applied to the already running worker, though.

----
If we create a subscription without any options, both
pg_subscription.substream and pg_subscription.subworkmem are set to
null. However, since GetSubscription are't aware of NULL we start the
replication with invalid options like follows.
LOG: received replication command: START_REPLICATION SLOT "hoge_sub"
LOGICAL 0/0 (proto_version '2', work_mem '893219954', streaming 'on',
publication_names '"hoge_pub"')

I think we can set substream to false and subworkmem to -1 instead of
null, and then makes libpqrcv_startstreaming not set streaming option
if stream is -1.

Good catch! I've done pretty much what you suggested here, i.e. store
-1/false instead and then handle that in libpqrcv_startstreaming.

----
Some WARNING messages appeared. Maybe these are for debug purpose?

WARNING: updating stream stats 0x1c12ef8 4 3 65604
WARNING: UpdateSpillStats: updating stats 0x1c12ef8 0 0 0 39 41 2632080

Yeah, those should be removed.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Introduce-logical_work_mem-to-limit-ReorderBuffer.patch.gzapplication/gzip; name=0001-Introduce-logical_work_mem-to-limit-ReorderBuffer.patch.gzDownload
0002-Immediatel-WAL-log-assignments.patch.gzapplication/gzip; name=0002-Immediatel-WAL-log-assignments.patch.gzDownload
0003-Issue-individual-invalidations-with-wal_level-logica.patch.gzapplication/gzip; name=0003-Issue-individual-invalidations-with-wal_level-logica.patch.gzDownload
0004-Extend-the-output-plugin-API-with-stream-methods.patch.gzapplication/gzip; name=0004-Extend-the-output-plugin-API-with-stream-methods.patch.gzDownload
0005-Implement-streaming-mode-in-ReorderBuffer.patch.gzapplication/gzip; name=0005-Implement-streaming-mode-in-ReorderBuffer.patch.gzDownload
0006-Add-support-for-streaming-to-built-in-replication.patch.gzapplication/gzip; name=0006-Add-support-for-streaming-to-built-in-replication.patch.gzDownload
����Z0006-Add-support-for-streaming-to-built-in-replication.patch�<ks�F���_1�UlR|�oZ�6�D'��%�D�N�n�@`H��tI��u�� Ez����S��������<0o]g���v��^�q�������]3F��h����z������c�{�e������L����bsg�y�g�6\�������m�h�������(]i>���f�)c���
�Sz�z�W����v�t,����+������Ol���M�y�v��>[:.�|�k�^1�a�����i3�o-S�|��K�����X���vc�:+�{��IO������y)��h���l�
��@8�s�
D�W��)�>��6�Y�
���l1��wt��3�a��m�\>%y)%y�3
hi��<�`�s��6��0}O���+��j�>����Ij�&2��1�Kd�	�m����`���J<��������� m� �W���1���������Y�w�i`9I$T4�,Im������~�ny�-w���`q�7J;t��x[��BV��HQ���D	���=I�
<0���o9[;8�	<����>�4�E�b'y7���<�����n����%�%%.mL�o���d8�K?���'d���t=�<6d�h}��5��K8���r��*H��/�i)�u�����Fs|9 ��5w9����L������Yj4%���<��6V����f��U��<�5�H��m,����Y��d��a��a�C<4@`����Z�UUg��X����Ad^+��o/�
��1�D�:��F}{��g��0�l6[������Y.��	:�=�u�F��zl� ������)|D�I�W|&�F��A�gg�V��q�t��)���x4@���nza����I��:�����!t�%���k�1�'H�zIV{!� �5m�
��d�y���"K[[g{*�8�\���u���EN�l����|��)�j��v{��0�z�g�[+�%��AV�r�x��a������a+!�aXq���q��FG��u2���bwH���F�f����"���aK��0/��Q�I��q�rx�V�������A�ZY�C�78��=�����q��$�������#��UJ�5��5+�gZ��L�8��?���\tx����r1n%��`���VL�?����[���c&���������n�~>��a�T+��*��<�uy��P����
��0�nz3y7e�[v^~u���c��*dJ�� T��	�T�P�q����F���������j5���C+�.��)T1A��5?C�� �(���	�5TR�V��`�[&��Y�T1���!��	2���N�����J�o�u=$uP��3����<SX,����6�k�l%T�������Es�����g�fg�W`��`��t�>V���x��)����Im����G��(�:�)��,<���N����'IY�zRh�Y�i�H�P��<�a��}�g�s�i�/�cq�>o��*|"f�R�N-fR0?`�[��K�L-�A��{k'��$�L2m��A.N����{6�������(&I��-�7�������h�����(��c[4a�q:�3�B� $�_��4pf#��H+#��#ee��|l�����9�L��aS��ap����2��������uGz���$��0Xt������o?r�>S�5q�gaB����6���a��g(�N��q�<��������H��B����0�i���iJ>=+�����x��V�A�pL��a� ���w��t &��V�q�vvU���=3����R��%i�e���G�!��w�y�E��FU�����
F����*sf�L���oR��re#�*���`�b�A���Q���L+��v�i�<��;�'Cb�\&�cAN������";�s/�aS/52�q���
6�"kP�9��)��~Q����9�����#������{\����3b�8������a7���=s�*R�U��_�w`�A ��d��"L=��� ��
�Hj���ch���QQEG
����+�^���w1�)�A�3�$IX4@!�L+<�|	�ff��N��V���k��;e�k�����e�/�p��v|�m �rL�����oX�am�kh��gn �	\��9e5%|�>QIN���t	(�q�2�����l��]s9>��������'��+����&0�Q���i5���x�J�{G��g<"�g��lT����C��Vf�Af������Qw�V�����U�������0ig�f+�����<�\��^�fm��K��o��r.~�����O>�U 0��eB�����\n��A$�FY�"�76�n�L���/#�c(&�Cv�zO�3F���;"I\R��*	�������p:�����5��&B���nl���G��c��DK4��G!,;����� ����� ��	-DD�n����;l;Y�	F��T�^Dq�Eq�E��&��*|GR��QQ�}Q�l1f
�J$���g�o�QO"���l����H�"��i�\����������hV����������SH���RW��J�'���N�`�f�n5.jY�)g�I9��i��RM�S4P�X��
3�C����8fj9f�A�����t K�}�L�8�"�4��&��d�`�Xsm���A����q��Q8������E��#I*������,�[Wz���Kl)#�=�m�*������� ,����5��c�i��������(��|<n1f|>�H�������z��p#K����:������?�_k'r��H�&\H�b\��3�e������A���&�,��������2=��~�!����2��ke}��g�7��	x�w���1��%g���RJ7�*)�����X9�F���3����R�G��9V�NLI�N�)�R{-<+�"�l`H�Xp������(?�dZ��-8��;����L�UQt��
)�7�WUF��b��U+!7tUj�C���1����Wt�d�D�t7Xhe�lv���j��KwE4�KvEP4�w(����
������f���T>��������o�3����\��<���'�����|�J����� �����k��,"���<���g�@�������{�n:�zE�J���kp�Z$�Kq���kFY���=H��"�a}�x�ir9?p�b��Q�=8
�E�(���Na*��0���9�rTO[��L�b�k���?v�����������H_,�~��0�n0��7���������ok��t�������t�>�r��D��	�XH���m�����gg�����7P���
��M_~�,���2C&��B���E����,\0�Ro��-��a�� ��5�Y Y��\D3��x������U���������z��.������;K����s�qCP��c�&h�w+h��ZF�����a��?r���s�����������5����!>C���#��;e-��)�s]���V?H�U���<8���'�v����"�<�AG>�G���mq'��.�+���<@&!l��aG�f���F���*������\����0���B}p�����Io��������u�z:�t�N�G���L���3��F�q�}~������>��O����Eo��#��>�IBQ�L2Z�(�R��uk�r���kq�]�<�Llo�������	2/���>mR����<�#�wU��r����Y4��u���4�i���,���_L'�($�	��$�3��~G��>��r���r��'�uT�Os?��2�o�F�^�C���w����<�uX�������4��0�S.�P���J�����,V;��*�mF�-���O����O���������Rj$:|@LU��M��h�b����������B��hE	
�CI������M#*a���v����x\}���
Q|9{Y=��]�H#�����R������v�f4vq��5T��-�;�-��y7}
�uv�����-��������j����8�����U�����g|������vT��P�E`J3�g���>�T�I)���N���3ZN���,�Z�<�R\����'���?�;�������N��z���Wz����q��+|��T((��_���p	�rz�m8�A�������%�q�;��A2$�{�h$�{twy�WK��=��yP�7���Z�?��YY�����_�5OM�%r������\���8�(��c���I�	���/\e|A�*������������kM���{����s����A�1�5�ZWi��A�k�-lgz6�L/������c��^w)r�"�g��L�^9<'rXaz��D|fv=��(C�����]��+����MY}��A�&1Ak����=y�4����+)4+|�.�V'<�"C(�(0�$2x�����xs�����9�Z��8s��(������be��U<7�b��%ks�_�O���&���s����1w�����vvK�?��h/?*��(��v���*
:+yP��i���|�N���1��}P��u��A�_��?���2.�\Z+�j:��/��r?pmf{[
��@��3om����~��
�����z0��+���^���3	�~~7����������L���*V�lFX	�@��!E?t��B������4��"�H� �0�� ��r���?��6RL9% � ����/RJj�)���w)'�������YR�i��H������V�~����G�\*��%<�&�Ot|��9D���5-6�����$�:J�iy�_�S�>�}�:�����`�4u$5��%h�-b=�F''KK[�"a[��W�^����w�f�
`�q��+5�%�$��D��`����F�{�yH��GM(RD���?�	��|�
�Ax��$��~����$]h_R���e/t����O�m�F"�� �i!OR���m."B�7��A]��eU������r�Y���������D�i+^��Q0�PcB
��A=���zoix�����T�.���t+�S�\��W����b����=�H��z������H_�"{�)��_������iZ��������������������	p����;�o���rWg���Q��\��h�1����������R:�����?��Gl#�p�G���l/�Zj:��Dol{03���9IIy0�5K��W�\��i�w������hn
��?���T���L���nrs?����=�����f���+���x��9��A��8����P���`+��J����
w1M on�S���%l���2>��r/SL�}n)pDqMxuZDJ�.u�$��e:�H'���� L����m�����m��X�v�Fx�[U�{���/�Z�B���p��t	��MbP���t�W��A�o�k�5=�U[��e�r��,av�����0?��6
�`�y�]����6�03�S�S���-�\B�k@����y��o��R|�AX��8)A�Y^�W�1L����\lp��`h�A>=����pX�����6��2["E��p�W�A\�.�����p�������j�8�SK�����0%:	��f��C������Co����X.�� Z^���yg�!P�����BH��}�����CN\����r���g������J�����<��Q����[k,7+7�\��42��������
��PC=��g��$|������?!Df��F�-.e�����A�Zb��-(�N�;\���'���z���KT���'���J��.`L�tP5
b(�_Hr�����.����.��u���L��)���(�SI�B3��d1S���MM���Y����I�%B7��j�[�����?�_Q�'���������j�8��,��T@��J�J3��w���n��8I�^�S��e�}����-���bO7��K�<�B`����?9\��YD���=�T�
����u}3`�/���>{�7$�z�t����o�����������),�����dR���a�NR��{����`r��jg���v���j�_GH�t�Z�� #�����q�����nH�����=��U����8j�:����p��
��AL�+ Yj���.�L����$���;@�Yz��a��]���<��l�oI����0	y������'BO6��$�*�Fbj�H��p
s�q�.��@��nE�f�6�b�6���	k6�SF��%�
�{U����Z���������0����v�ew
�cX��Mr���	����1��E��5v��3hw6�:�e-�N����o�&�zIt�����QVt���D����L�����xW�I�������6^�WHH� �\F�?�0���s,m��������I�����g$�z��]��
���>������uJ���{����oP�G�~�\��J��$I�_x�D�o]�lG�2
�!�YnB���Z��}�?�~4+��TC!uD��w��p�XZ��;mB�.f>��M)�9Y���)�����n��F�*&���:�H�+9`������,�v�~S ���Sv�;��d�p��b�3��1��?l
��|����'��n�'��;�+�����3�f���g�����gw!������IXH@��z�4�8A���?ku��N_����c
����{3&
W�h��[G�G�u%w��b��a�fB$�v��8D!�N��CL��?a�������G��HtN,as���������i�������@����
'��d�����t_�U���x�e��v!���+�ugN��pM�}x��A�X�!���Q��IV��{���K�������E�$��d�����QR�J#.��3��t�� �Jfx)�w��[�uB�1�un����$	����'���~��;�EW�_"
b���Q�fZz��)L��+��h<�te~�p,i�����]����>{||x���x�d��i89��ap��k��F����(�N5���H�QZ������n��k���!Y��y�+YyZy�f���7����5���	x�m�xz�.������������g�����"R�q(��vHj��!��w	M�����R ?S;�;�74w�����A�����q`�}b���0>Dj���
*��|����������H�,#
�x�2������A�Tr��	����j�D&4,=����
2hKV���{-+��Jh�����(�H�����M���P���^D�wVU#�Y�z����*�$'��M�g�;������5�^����KIx�Q`�����?�<�v��V��j���JI��z!��������Q�'@,r���h��W_�:]��:�����6T�`��H/��:�B�k�Y�"�b@`��
�q���r|�M�����Al	��X��xD�}�����������)�D�U3>@q�T��>�0��T� ,j�Z#�A�FD������r�pm��7��br6Qx�w�����O�(n\#�=\{$�D������:��N�"��zX7��1Z
5�*Y��xT���w��Oz!\
�ET��`����5�2Zu�)Q8�����S��>Azx[&�),�W�9e*�U�����w- ��A���]G~��%�1��"9U#��I�z���[���)�R���!�3��F�s�RS����B�2�'���EO�(����[�E���c<���~b��
��wZk��*�D@u�z"�=�OU���0>D}��Y;�3|����z�����y������L�`!{�T;\=�jw@`X�l,�g�B���.�:�IZ�j��	@�Y�\L�w����L��ye�l�B_�2G���<CA~IsW?���q�����|m����Z!n��;
?��D�a���<��_�L��0��	_�')B���������f1`����P������X#��I���A�1=vU�P���*��,}�,�c������L���.����N��~�~�N��_���L�K!�=%�?�;�S�w������l�s��	?���6�w�N��j�a)d�U�Md���6��`�;��{}1�^G��n��G�����N��&�&66O���������62k�O�m�����D[S�J��@�P�n�_�p)�����j�E�|u�g��~��k�B�[� .w����� I�v��z<���"{�@��8��/w�f��M���b�^T�&Tah������)+<�4��
�k��
�U*����cII-���n
|�<+	R����V3�hxY���=�5a��Z��Dy�Gz����,� ��s���b��t����RQ�����k���7/��Z��j� I<d �x(/��u�e:�=����i���������(�|K�T2��;098����V+�uq6 �s��p	^5�2�POL�?�qu��J`�l�FRJa���!�����[���-��,!������F�A�C�������������^�2�~���J+V�lg�!&z���6�L���p��X�t/�J�m.1�;����3���y����ZhF�-F��Y�*O���m�P�Q#�oK�1�!����2H�s�Rp:v�ZLX��'	P;:?�{�n&���1��q�&2!o�v���Sx����=i'R�P|�TY��,2-��w�A�]P.L��1���[���!�
��d��.	.���aJ�l0J=l^x��5Xj�>��VkY�� ��pnpI�PK�i&�*�Z)��������,�������K{�FaU%�,�y5��1����\���A��+mJ���Qi5.��f�}��8�t$��t��CY��% ���Qh�i�^�o�Y�A�r��`�^l*W�'��x��h��v,�-���
��H�����">Vp�7�A0R0c�V����������3'�1*������g����3I�aP������G�LT��o��F�T�R8�
%�_`�95Y�3�P�������f��a�T��/�����d���m%	�<����v�t�h?�)D�����c�OF��30D����A�B����k����z�Xs��&zz��;�&G�ol����`~���!_:Yb���`f�e��rEAuu��8���3������y�]������7�����+�O��eZ����nI����v	�)�L������B(;=;�8�?9j�xx�WT�$��'��]g�/�s��;>�2(B����h.���dQ��1���O�K���4���X��j�
�:u�$�3�L�qY�0�	ZT1�g�{����e�,�TE9Q�"�
���!���G�����}4��j�<4/l�=D����/i���
��#\��H���S��k���x.��yuqg6P�}Ue�o1G�Y��6{GG'���*�>?���yM��h����<����^p������S����&wB,��N�A�k��:��{U����(<�P�?�����������'8L3��
����;�&'3���v�S���5J)�	�W�L0�������'���Y~\��s��ALn�RI����'a�Zp�
9@]����4�p��x��%L�)jV&��2�������Zu�9�V\�*Sk�l��O]�8����]�]�5T���|)�^���
Uo���w����ln����8�U���u�;,���L�4-QG$���H~p����hN!�z�5�Y��&����S���z��l�y�S!���2 ��\z�Q��xh�F���vq��x�~�p���6\q&��������X���2`}��_)�X��f��@�����m>.U�
������19���z�6����i���zWy}x�7���_n�u
�(K�!S4��hd��u5�4o>\p� ���Kn�!��I�����RjuJ7�Z����� b��%ce��w_�r����*+x�E�hq�M���b@9}����KwwF6U���CL�|���]'h�^g������x�3\ '�-��hc���Etr�k1AF���5>�9[�"R��-�o����#��h$!� <�;,XE�,�-�����Jt�wiyrq%��\�Kr�#r2�n"e���49�R��bF�\��W��v�]����O�8"PY�kJ����������b_�����tO�iP�Cm4&��4����a��6=���2[��v��e����2��Y��W}h��V�2�0M��^�G��!=�0����rm/�����x����(��{����7����c����w��������|�pT�G���9�%���-oNS��R�X���{op��������M(,����>�;2������;Yb[�!-������N����t&�de^C����z/��?���Wg���_>X�Nu2���F%`����y�����@�o���&`��Xu�y!M
�AO��(�eB���U]�mj����X>0��T��B&�����Ia����\���-hgK��T*�q�Y��9�I*���PI�Rj�@�u�D�$cg��mPHf�P�C* m-��I
����c���A�HL0����(*E���qZ���Duz�m
�@�6�'�O�C�d|Z�\d�
��";�\�����=4g7�����0f'�?8]l�l1����D����k��,�~��M�lA���Q���.$�>�Q+�
����q�,�*���\�C
�X�����t��GLx��0UP���^f��;��So~�5�C������u�_S�?�c��*�&���x�D�}4(-���{���?��c%���(zRn�/
�I��d�;�����W���i|C�Z����s�/���y
M�Vs�����������~�}rhT�e�c+�:�	� Q.�!��i-JH���F	����.��p�G6����.*N�$��6'�7�=���c)
��u����	�%P���^�i���9�>����������ba���9d�X�@v���y������r
��������y����O���!���_�l$�q�B��X^���_��P���k���NsB3$zx�'��D�B'�S(��9�+"��g�d'�{6�L����/B�����d���"�%afH�dAH,*�$��F��L���6�Fe��������g�&H�gMt���I���V����g��?Q����E*�H�+)MNK�,���0
��e�Q[����1�0��r9��b�<���uO�����4~�?�J���6>3�A	kA�]J���,�A��qR
@et�;�t7�V(VS>���dd ���7���(aAfi�/_ah����15 �U�\�����\�$�qx���������FqA��Td�����*�c����u��"c>2J/D8�E���g}fE�����P���)cX�����n����dj�L��RM`(V���h&o��2�@���L��tu3�@&cC�����*�G���Q�4"�q����#e�� �e5�����!����_����r��"<��_�QD��
U��J�RI7�g�D?{D�m�\7F�}�)����7�3�\baom��g��(����P�R��������o"&����H�������BI�%FN���b���
J���n>�5u�V.'wQ���zT#j�*3�0)��f'/I2?��	�d�����>K��F��2��_(�$�J*�����S��.�U���h��O�G'������NO1� ������_���kmY_������\@Aj@���+Z#��A}���BWh�|V�yP�K��r�)��K\�U��r6���U�����M�N+\�r����G�?��A�B���L7?Qe�g�V�y%����j�Q�����wJ��m���5���d��_���mEwE%��w�Li�<]=�}q*N]����)�(q)G�<)������p���zC2
��3�B�$j���V�6�l���$�d9�T��������������w�	�
�/_���kL"�I�s��x�#T�������9'�q�[1q/���x5aG/S9�3���K�=�����;���9�XW��
S\[	������}l����(
��9��Q��S�m�q8[����1�,���\�FU�~��������d�H��U)�����Ip1��������a� b�etu#���?�	6��(�X�3���l�d�8���
�:��w�����X[\�ak�+Oq�'�����,,��6e�y�����T��.�����i�r��Z��*���uvJL������^n:*M������v����$�=������t�_�7�"��5P���;���#l����N�4�[i�uN�����O��t;�i����y@@3��]��;���_J�j������������������N0[e�����k�=�t���7�����������KU�1B8��v�����|��A ��~R C�x�����Fj�N����@����[zE���Z��]�oZ�7�V�vJiS�n�l*�y�qLHAQ��C�4����,/����S���,~�&%a�����5x�]���P>����Z]T�c��\m�-gds�(6H%%�:�8���Y%4(������u�X:��N2��Z�L��(=���l��6������`Tj�����m���X�gMJi"��q, M�x:���CL�c2��76^g�
oc�i���������]I&
�����+J�pz������8#,B����l������M*"��q%�r��w����E�i_�(b���	���'�m����0%Uy��V����f��r���.Q���D=�_��j����d����-����������'b�^T�!��2�oO��gMA@��4�Q�>��A��z�s4x^ebx*:��c
G���b/^��{�p�S���(�1�T�z��7�&����}�^���������U�T���h#Q���tD�n,XOxo�k��NiE���:)r�D�\j����[�
�a!`�]���	������v����;
��b��>ns���7��&�-�m;���q5gMh��"�_=n�V�9�W~������7W�����0��s���Fc�on"��O�������HyU��S�d_u�{5�nA�P�tmh	�?���J3	��;%#�8��z&����U�j�>6����J����I�g���������K��R�sW��.�	j����L�����p$��b
�M����g����D�����u�}�V��|N��rF[�c�,zX[��5���|N���t����z
5
:��}�"��]����osaEK���GS�P
+���t������Y`7�~����	��������Js��7��x�F	G��������@�#���80��~n����-dJ�h'�R��u�"�uX����J%��T�!��hZt��?Q�B��H�~�T�������q��V\�,�6� u�����Omj��im�S|	tA>=,��n��9��������r15�������lxy)������G�g�H�F�m./k�-����{G������w���Y�~��YQ��m��3�l���2�����&�Sm�n<�{V:�y���M���UL!x�'�v}�}��a����+_��*�N5�����M�_�
��J-�}5�0����S��)�G��)���2�ee(2�V#���-���9X5�6L
�h}��%�(R%NhQv6a5V���>�|�u(bp���KW�P��0����W_r=�"s���dn�]�+d����2�/5����Am�f��P�OF����c�?�e���m���������G������r��Yo�7�(j��*:���(I��>L��R�'���������s��9�
�`{���������P��bR�,�h"6�������D��������4I��c���J9Hyt.r8�}����YZb�F��V�.���
M)�W#z�m�����#�D��g���h��*h&��Rp�g����z��q�wY�!��.qk�L��,:�������Uh�V�'�Dxq������{eE��!\E���A��cR��Rp9<"9.k�5��������������)�
�s�����|�=������<xwD`}pf�a�����L;������x�|�AYI\�\���i �h�je�H5�g�2���l�m��������[V��\H_�0��c
���S��$��m�ji����N3���lM_��R�
TM���o������<�����������M|�J4fp~�L�	���jv����_)�[�ra��[�������,���Qq9�Ll���|���Z}	���R�rr���
,�g�9�e9
T:�P7��QP���	��k��4����FH�*�9�z��W6��$����!���0"�9����D� �Zv�n�tMd��D��B>�`"�`;���*Rj4��3.|W
beW���J�@�0�X4�B�4������O�/SF5����������`D�i����(�FP�3��Ts����'P��J5����w�<
��I)�W�K�	�P�%~A@��o��4��5��7�gh��%���0�V�� c�I�""�X��pcs�o�<��-��F�M��-��b��8�5�RO���Cv�
��a�uG�8;��6>�����
4�r��K*g4d���<��H�T���'4�)�i�������,K$vR��������4wC�!2����d���e��`}��[������4��Y�����W�_��x��y����������x�R��O���1���WHh&�:������rR�1[5h��x��?�J�
�+$��z��be�������YqO�T������o=oA���3l����d�z�����^0j���R���gx����������M��nn��;
Z��#o����

�C����F0	��Ha�la�j��jl��#�Q[H���r����p4�f�y�DrW��PE�<�5���!A�[F!	������N.�d�"���ZwB*f��1�~���L�&35���*~\�TE���lQ�������^y��[����3����r6�,K;�=��P~`(].?R��RWM��cr�������kO~��l'��EJ�)����}�����)���nN���1��$�R�l�E����(��9���x
�<V\�"I����6����������5��YjY�o2�?������6=��&U��U%i�X,���v�1������qGO!������[���Q�V�����
��^z��U`��K(��,��B��hY���N.��i�7^P�������D?��,V�D������I��r[;��Jq���3�L�����y�%�yZ�0*6`�^�8����f�,�����tL�:����VTD�G!=J��CT�@*���|n[�������"/:T��;>��
Jp
b ��A7��
��q�;�B^`x�}<^G1y
�+���g��0���Y2)���D��%qs��R�
O�Dc��8�[q[��P������f+��U'�y��X��s�F���M�"`$^V�7�o��<o6���{.N^Q=����k���Nb0��a\P�A���j������E���� ]��CL
<�~���u?Vj9l(���1��o�;�_����
-3"�'����.iB0ofr�%U�ed�QYk�'����J�*�a�1w������q��rU�{G�3�z�w��}���6V�{N����]�������e���yaWiVr�������y�*��U�����3���S��E���
�z�]�Sx��hK�1[kd�E�A��E�@��Lg�|Q����	�6��(��L9�X5Kl&���V+���o��z�z�����X��-����<��nI�r�\w~4��r�����������8p�������.$Y���f
zm������G�U�U�_�]�����4b��i�PR�*��8���(���s'��-�^���V@���|\�P�����u����=Up�K�`2?���9��\0��QR
^��L�IQx�����2.%,��Vock}{�W�_���eo6���)',�%�1n���?T��#:��#Xd������:�����!���=�V`�%j��tsO�Q�p%�Sa�/��<��RW����D���A������Q2�x�%��4r�v�N/�P�4�SDmE|��r�?�4�Ac��m���|�H��]�wM��E�ZZ�T%|��"�����
Z����N�ugkgm��������nNn����NoKj!���_����.��N�TSm�-I�{���������A���������?�oZ6����G�p(�P�	P(�
�"
	�N�Et�����$���#9)���{�.1�O���q���u\�
���]Fa��H��=q1�q�����/��rX�xR�p��P@ ���O��[����6{�T�!�����o���(z�S��-�7�Ow��Fc0���FK�Tc��WS��E��7x7����$I��������F�5Xpi��d���:��������esz��m��������#���X�o�5*j���R�����o{�� ���X�<���V�}+��NR�=������I>���5��L���J���H8cr���m;'��i�2���$��`�dt�&��s���v%^�G��:E�ynii�pR��a�n>�t���-Q��Gz��yjs���2=���n~���e�27u�A�����o��c��O8a�T�bdP���>f(����>���6/7w��z/�
z����V?��\�W,�51 46��@�0e����Q�
=t����k�||���~�]����A{�Q�0Q&�������w?O#��kK��uc�����}�>�|��
���.���{�x�m9\0�&Jo{qC!����l�m���|cg��VL%]eI��1���c�,j
�����n-�lP�N��E�L�w��v���H��#��\nC��Dx}��N�������;@�}J�����$>���	��ev@b�1W����	��j�e^��do����%��`|��l�PwDf�ej�K�,W�����X�������Nq(+3
4V�*u5���E�Y���-��k�EY���c�>��dd����AV�0h����qJ����H����7�v
��/x�qV����c����#���VK-���A�i:��m��z*
�e��>k���/�{�oT�c�:��a���N9��-s�SK��J�������r���y��f�c���3���y_�������	����u���=(���{0�zg���w��N]5�/[�����A��#�g�����}�
�t~����1�*���?��9Srz,�H�:�6�n��6l�.�D��R��l�$�:����/zi[��C�y���fV�t��3�����x��e�9#%1z�B,�yF��fh(h�Z������y�Ir8�\��~�p��'��X�m!�/��l��
$�0uG~�#^�-U{S�	����]�9��b|����X���������l���S�:��e\
N�9��n���d���POV���K��;�[2C�s��>-=�L�,3�0��\�����,�QI!���D,f5y��Z����}����Y�w$�9H�P���&�g�gkk;��&~}������VG��h��m����7zk������Y/��,��eM3��c�_���#��]�E&�S��3����;�6�.�!`}0F��D�<&4��\�#��(�X}��;,C<!-��Ko�AVo:�"\�npsu=���w^�?b����#���{��g�c�������o��=�kS�gui$�YF�x�:j�_����"�pv�Wk��T!��������wd�mmr���u���T��a$KjXK��0�:�zQ�-^D�����Fi�������w3ZT�]S 6�L��\X������w
���/`���%d����/)�� >;����$�$�1�`C�(������������6>���L��Y�$5S���T�W�{���M�~.o����%��V������5���a�\��,R��G�8�@�'A����3G,��1�<i��Qm����,z'z��'�
�q�U�k���R���M���*.�d��!�
ro�O��z�r�,o�\�D���u�:;0M�W�+�X�*���u��1�������cbr�K�d�	"@P��T_��%�G��z>3/�����WG |��w��A���Yj=d8F�%�W�=��R>����z�]�������������.F�����-o��x�Y�]M-���e��EE�`���n�m���*�Y�c��0�;n_m���fX���UQXw2&7���<^ui��+&����e��;����w�����Q>_�t���Op�^���*�5_~���0O��C~QR������V������%.�����V	�;E��2a��C2��*��BJ8p7�1��C&���]�������O��  �j�9<����:Y�����j��w����
��(�Xa#L�������I5���?=��d�u�a@0���	Q��V
�& <s��g��uZ����'��^�}�S��������o�C+���z���D��(�O��%�T��
EQJkM����k��w}�B��Z($8�IYgn�`��/0��o��9��<��"�G���P�u��x;���67��
*#�0��[j$Ks���/U$q/�)����?q58|���������&������@8OP��g�kZ��JcS���3i���y����|a��,����,�	3��]���Le�>i�O������->i�O������->\[4���(z�(��-~%e�|����	��K�1��5`�(]�h>�`�4�?d�e��?d6cM�������Ickk�7������d���L�Ao_�d�PF�^�?�\�j;��dm#�\�Y��T�~�e&����JB�bGt������d!y��<YH�,$��������'��
����6�����k	��j*���Z�4q�����������.b1�m����>{������������h�P����l�k|ou�����M$���^�|%�XS|��2m�zs�Y�X@�^.[G�Bvq��6��ff'��e�9���� C�����t���w���u�%o
�<��C���`:�5�������e�����f�I���4���f��.��e�5��X�N;���>��t�a�)���x��X?��O*��
��B>��O*��
��S!gR��0���T������W{��H~o5�(��v��lG�]����)<�;�����y������*�_���Q����X/�<�p�i���>��^���6z��l�Q��3�G�k;;k����Q�{mi���:�u�)B|2�=��C��q��$�'��I#z���4�'��I#z���4���F��n���Wi����9��Y�W�y�'��3�Z��D.���e���������{���{�n�_U��}���/��N}sa��^fS�_
0007-Track-statistics-for-streaming-spilling.patch.gzapplication/gzip; name=0007-Track-statistics-for-streaming-spilling.patch.gzDownload
0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as_su.patch.gzapplication/gzip; name=0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as_su.patch.gzDownload
����Z0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as_su.patch�R�o�0�\�����N�vi�i?PaHH�
�������4u:�];��eL$���}�w���n��5�
���J�y=+QU��������s���������Av)������Xp�W�Z|j��
/bg�����*\�v},^�H%����V9����2���1�I).��7������W�oP��_q�����s��Z��w[�W:���Tl]�\�s�*\?z���@�zT)�"gF�6��*����v����������5~�J2��!���d�Z�w���al]#M�6B���R=5BXgh�#=����R���(�����k��U�$I������fH��~��=���`����6���}5�OF`����`o�@@�gk��:���.�^��q�������~K�k�������CG�
J��[^4tG
�x����u��c�u���
u��*�S<&_��������������!��������@����T����2v
0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch.gzapplication/gzip; name=0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch.gzDownload
#38Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#33)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 03/02/2018 09:05 PM, Andres Freund wrote:

Hi,

On 2018-03-01 21:39:36 -0500, David Steele wrote:

On 3/1/18 9:33 PM, Tomas Vondra wrote:

On 03/02/2018 02:12 AM, Andres Freund wrote:

Hm, this CF entry is marked as needs review as of 2018-03-01 12:54:48,
but I don't see a newer version posted?

Ah, apologies - that's due to moving the patch from the last CF (it was
marked as RWF so I had to reopen it before moving it). I'll submit a new
version of the patch shortly, please mark it as WOA until then.

Marked as Waiting on Author.

Sorry to be the hard-ass, but given this patch hasn't been moved forward
since 2018-01-19, I'm not sure why it's elegible to be in this CF in the
first place?

That is somewhat misleading, I think. You're right the last version was
submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e.
right at the end of the CF. So it's not like the patch was sitting there
with unresolved issues. Based on that review the patch was marked as RWF
and thus not moved to 2018-03 automatically.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#39Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#38)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:

That is somewhat misleading, I think. You're right the last version was
submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e.
right at the end of the CF. So it's not like the patch was sitting there
with unresolved issues. Based on that review the patch was marked as RWF
and thus not moved to 2018-03 automatically.

I don't see how this changes anything.

- Andres

#40Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#39)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 03/03/2018 02:01 AM, Andres Freund wrote:

On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:

That is somewhat misleading, I think. You're right the last version
was submitted on 2018-01-19, but the next review arrived on
2018-01-31, i.e. right at the end of the CF. So it's not like the
patch was sitting there with unresolved issues. Based on that
review the patch was marked as RWF and thus not moved to 2018-03
automatically.

I don't see how this changes anything.

You've used "The patch hasn't moved forward since 2018-01-19," as an
argument why the patch is not eligible for 2018-03. I suggest that
argument is misleading, because patches generally do not move without
reviews, and it's difficult to respond to a review that arrives on the
last day of a commitfest.

Consider that without the review, the patch would end up with NR status,
and would be moved to the next CF automatically. Isn't that a bit weird?

kind regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#41Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#40)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 2018-03-03 02:34:06 +0100, Tomas Vondra wrote:

On 03/03/2018 02:01 AM, Andres Freund wrote:

On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:

That is somewhat misleading, I think. You're right the last version
was submitted on 2018-01-19, but the next review arrived on
2018-01-31, i.e. right at the end of the CF. So it's not like the
patch was sitting there with unresolved issues. Based on that
review the patch was marked as RWF and thus not moved to 2018-03
automatically.

I don't see how this changes anything.

You've used "The patch hasn't moved forward since 2018-01-19," as an
argument why the patch is not eligible for 2018-03. I suggest that
argument is misleading, because patches generally do not move without
reviews, and it's difficult to respond to a review that arrives on the
last day of a commitfest.

Consider that without the review, the patch would end up with NR status,
and would be moved to the next CF automatically. Isn't that a bit weird?

Not sure I follow. The point is that nobody would have complained if
you'd moved the patch into this fest if you'd updated it *before* it
started?

Greetings,

Andres Freund

#42David Steele
david@pgmasters.net
In reply to: Andres Freund (#39)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 3/2/18 8:01 PM, Andres Freund wrote:

On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:

That is somewhat misleading, I think. You're right the last version was
submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e.
right at the end of the CF. So it's not like the patch was sitting there
with unresolved issues. Based on that review the patch was marked as RWF
and thus not moved to 2018-03 automatically.

I don't see how this changes anything.

I agree that things could be clearer, and Andres has produced a great
document that we can build on. The old one had gotten a bit stale.

However, I think it's pretty obvious that a CF entry should be
accompanied with a patch. It sounds like the timing was awkward but you
still had 28 days to produce a new patch.

I also notice that you submitted 7 patches in this CF but are reviewing
zero.

--
-David
david@pgmasters.net

#43Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: David Steele (#42)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 03/03/2018 02:37 AM, David Steele wrote:

On 3/2/18 8:01 PM, Andres Freund wrote:

On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:

That is somewhat misleading, I think. You're right the last version was
submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e.
right at the end of the CF. So it's not like the patch was sitting there
with unresolved issues. Based on that review the patch was marked as RWF
and thus not moved to 2018-03 automatically.

I don't see how this changes anything.

I agree that things could be clearer, and Andres has produced a great
document that we can build on. The old one had gotten a bit stale.

However, I think it's pretty obvious that a CF entry should be
accompanied with a patch. It sounds like the timing was awkward but
you still had 28 days to produce a new patch.

Based on internal discussion I'm not so sure about the "pretty obvious"
part. It certainly wasn't that obvious to me, otherwise I'd submit the
revised patch earlier - hindsight is 20/20.

I also notice that you submitted 7 patches in this CF but are
reviewing zero.

I've volunteered to review a couple of patches at the FOSDEM Developer
Meeting - I thought Stephen was entering that into the CF app, not sure
where it got lost.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#44David Steele
david@pgmasters.net
In reply to: Tomas Vondra (#43)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 3/2/18 8:54 PM, Tomas Vondra wrote:

On 03/03/2018 02:37 AM, David Steele wrote:

On 3/2/18 8:01 PM, Andres Freund wrote:

On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:

That is somewhat misleading, I think. You're right the last version was
submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e.
right at the end of the CF. So it's not like the patch was sitting there
with unresolved issues. Based on that review the patch was marked as RWF
and thus not moved to 2018-03 automatically.

I don't see how this changes anything.

I agree that things could be clearer, and Andres has produced a great
document that we can build on. The old one had gotten a bit stale.

However, I think it's pretty obvious that a CF entry should be
accompanied with a patch. It sounds like the timing was awkward but
you still had 28 days to produce a new patch.

Based on internal discussion I'm not so sure about the "pretty obvious"
part. It certainly wasn't that obvious to me, otherwise I'd submit the
revised patch earlier - hindsight is 20/20.

Indeed it is. Be assured that nobody takes pleasure in pushing patches,
but we have limited resources and must make some choices.

I also notice that you submitted 7 patches in this CF but are
reviewing zero.

I've volunteered to review a couple of patches at the FOSDEM Developer
Meeting - I thought Stephen was entering that into the CF app, not sure
where it got lost.

There are plenty of patches that need review, so go for it.

Regards,
--
-David
david@pgmasters.net

#45Erik Rijkers
er@xs4all.nl
In reply to: Tomas Vondra (#37)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 2018-03-03 01:55, Tomas Vondra wrote:

Hi there,

attached is an updated patch fixing all the reported issues (a bit more
about those below).

Hi,

0007-Track-statistics-for-streaming-spilling.patch won't apply. All
the other patches apply ok.

patch complaints with:

patching file doc/src/sgml/monitoring.sgml
patching file src/backend/catalog/system_views.sql
Hunk #1 succeeded at 734 (offset 2 lines).
patching file src/backend/replication/logical/reorderbuffer.c
patching file src/backend/replication/walsender.c
patching file src/include/catalog/pg_proc.h
Hunk #1 FAILED at 2903.
1 out of 1 hunk FAILED -- saving rejects to file
src/include/catalog/pg_proc.h.rej
patching file src/include/replication/reorderbuffer.h
patching file src/include/replication/walsender_private.h
patching file src/test/regress/expected/rules.out
Hunk #1 succeeded at 1861 (offset 2 lines).

Attached the produced reject file.

thanks,

Erik Rijkers

Attachments:

pg_proc.h.rejtext/x-diff; name=pg_proc.h.rejDownload
--- src/include/catalog/pg_proc.h
+++ src/include/catalog/pg_proc.h
@@ -2903,7 +2903,7 @@
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25}" "{o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25,20,20,20,20,20,20}" "{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
#46Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Erik Rijkers (#45)
9 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 03/03/2018 06:19 AM, Erik Rijkers wrote:

On 2018-03-03 01:55, Tomas Vondra wrote:

Hi there,

attached is an updated patch fixing all the reported issues (a bit more
about those below).

Hi,

0007-Track-statistics-for-streaming-spilling.patch� won't apply.� All
the other patches apply ok.

patch complaints with:

patching file doc/src/sgml/monitoring.sgml
patching file src/backend/catalog/system_views.sql
Hunk #1 succeeded at 734 (offset 2 lines).
patching file src/backend/replication/logical/reorderbuffer.c
patching file src/backend/replication/walsender.c
patching file src/include/catalog/pg_proc.h
Hunk #1 FAILED at 2903.
1 out of 1 hunk FAILED -- saving rejects to file
src/include/catalog/pg_proc.h.rej
patching file src/include/replication/reorderbuffer.h
patching file src/include/replication/walsender_private.h
patching file src/test/regress/expected/rules.out
Hunk #1 succeeded at 1861 (offset 2 lines).

Attached the produced reject file.

Yeah, that's due to fd1a421fe66 which changed columns in pg_proc.h.
Attached is a rebased patch, fixing this.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Introduce-logical_work_mem-to-limit-ReorderBuffer.patch.gzapplication/gzip; name=0001-Introduce-logical_work_mem-to-limit-ReorderBuffer.patch.gzDownload
�R��Z0001-Introduce-logical_work_mem-to-limit-ReorderBuffer.patch�\{s7����X]�&�7E=h[���������|�����Y
g�y�R������M�T�l��I��~w��w��'�=i
���H�{���?�����@�d�`��O�'�������Z�����^�b���k�����2��{v ���>}?��_c�^����Y����K�.pZ�?�-=@���������`$��~�W��'�RV�R������E�;�������cK	��9�t�+?�/�BD�p�����
����
��;?x��8�3U�]ya��-�������7���
�:�)!ETCiE��P�r�D�
s<�U��@k.��
�C/d$0K�#����~ �xB-pB���*�/b�0�-~�EW�,,��x9y����:��>U������'H��U�L?|�h���M��z��*V
.�>#Y�	�����N>/����7/j5�,
�����������g%�(Pr�v����E���+B�hx��Vw:�#��E�eV�z����B:^�<�Y*��U[�kN��?�i����g37������mTZ��`MC������/���@^��	��}li���DJ�xD�������4��V�9=��t��v��gC��_��W���.���(��fX�'�i93[:�=ue��� |�>;$��K��m'�g����=��c#`c�|W�E��L�D�S`v�@���n��y�1|1hB1�L	����deHHf�{���J�9����br�m!��?%1��.�O��{�J���BzX%`��i��������n�M��(O���}�
��Q!�l�+`������5fELX�a�Pd!i�,$S�KE{����DX?$���z�[!��rc�0"j��&���H����%��������������//��&�M��������j)Xm��
	 �P����l�a-���]�o��6���s�03�L[�03i�����%����`?��p��xK
a,���#�Kw��aAla3:l��Z3sl��#u�v�����	B2e2
���.��/m�r0����
�A�`�����_�(�����S�,0HmD
��-�05z'��0�wm(y
t�������b�=P��b�@Ol1f�-���P�b��$j�Z�`W�Rxk�%'�M�\$Cr?���%�\�8!�"��S�<��VM�"��������V�,���Cm�SI�;RA�H�/Ww?
��dh��Vx
����w\����n�k����p��g�������` ��fiv��]F_�����av�/0��M$��������Yq�UYYT��h�v�3��gc�pHXc��N��:���+���(8����2]��������]#Ax��q��Q���������#n��YK�����`G��tb^OTV������	��,��t/��H�~��D1~uY5:����x�������}��]���t����dB{��<��+k����.����"@���-|���vp�w�~�n�j�3��v{�$��m��T�VH0����������#k*�
���[����m���^����G-P��<3#4/�����!��.�='RH^�'�e@)x��ZS?�?�}��j�n�\�a����+�����������I�3�����2�F�_S�pF�2��]��H�09X�H�^�B�]�k���O�:	�t�nu)`S���2����65������t�W�bNw��W���2x5%���8f���/��J9�����I�!"���QN��
�lPx��HF������H6�=8N��_�U������lJ���X�)e�( ����u���2!%�KxW�8����"����9^{���e�rA�q%^>L�{�L�&sS�Z��e��2q����L�����!1� ��
zC��?5��-��C���8;\&i �Y�Z�M-X^G{� ��0�u5<�&�T!�s�9�B��W�o�bbs�y��Fc�q��p)#�+'���*]��4�L�R.��zs��IN�����,$&�Z'�eE�i�}�w�Z\V�k?R&y�\�a7��;�!���� T��c��Jd"�
3Qt��$��������,�;��6���S�e���s�RFR�'2Qs��x
|@�HD��G����b>jQ�/��9��������y�d�P�;5���;A��v��m#����%��wg�K,�������L�_l��F������b8R��a��L�����Wl��&��:�"�A���?��������R�~|s{qs�����k�������b
R�0<�K���Y������MH(�����r�NVR1+�Hgh�'v8�*��
</kC����}�578K�E�F��BmH���Ouu��v�a�#�*)��R�<�t"����N~{%3�F{�D���	��B����
��[�9�m����?�&���������I��ncI�{�>������nsc��
X:�}W =#�0��aq��M��8"N������k|���:�!H\�I�hH��*���Bq*�C�Y:�
�b�����jM=�C����S���>m�����b�����b�u�^PU#�#��7����0|K�`��H����GG�Ng4U'�7���-���}�hb��!X��O�� �a�����N@�S�r�|2r�o���a�x���@c��)�N'���Qu,=e#l��
�~��:����cpUf�=k���c"�����Yp-��������
l<5C�Ved�N#����.�K�}�"����������tC���1b/6:����� X�������3��4�|��(�)
)d�\�����L�c���>B����J^4t���e��a������[���!��mN
V�
`(���I�l��}�=+l�(��y�b�������S�����8==��yV�d��e�`�"3��{�����M*zu���������������;�4�Q�LB1l��{TipM��6���MG�n{
2�M����	�oS]y�� %O��#4y���-��v@bp���>\��(x����hy��p�	���{:8n�*X�W��yU5&�����x�z���)F���X�s���S*���#!�V����r�`x��������9p��F=O���z�N?Om���Y��������5R��|���id�����`^�D�?uJ~�}S�HOHd��AxQ����\�*]�����r��7�d��V�������{�b\������P�E������{LF��?�X��f*��x��|e�$�2kX�a�R4��b�Q^���BMl
v�K|�+�S\7�gW�o�.��
����F+��Z)����O%O�w��9c1��q�@,<w�Q8Xy�1��Ez���sm �Z��<e��`�^gqR�k�����jk���L���X%�)n����U.k�N�[��;i���E�
�~�?9������3S�,��+c$��a=4�np����?])���;��^-5�M��xi����U��&�^��E}4$v<0��Si��D�$�%��sZz�:a^�s�$o��ql�P�<$#U/�W��q��5E�f���L�����#�������y��F#�<6��4+RO�������.S�6������������J�'oF�����T�A��,�36VK
	;tLL��lSU����z������L�����S�W+-O����1G��kD���3��(=3��"�3����X�r��o#:u�BQn-`�����c��fX�����^V11�Q�:�9&T��@+kz��M����������-�86��) �|�W�y��3*����c+/�"r��u��&�����dmy����5v�

:[b�#���@��S)�z��t�����Y$���2����XCh
�������z�z|��y�U����r}dD���)���S`���.�Y���P7r���qe��p)�F�f��<�.iA�|V1y,�qW+0����?
��K
�� 9������Zw�k��l�@�3��1=G������Q9����j��;���U���3i��w�"c��Z������S����m�F�U�M#G�t\������7i�H�yt�/�4��g�T�@���xH�3g�-�J.�(H{b�����l:O�]�;��2b	[�x3���	@@>�&���ln:���9z�s�����7�����A���g�o��#�Z��kZp3j�]�L�s��N�����YR]>H��\�H��O����'��>~Z3i���uA���-��t��I��Z�m��4�SFkjM��Ew��Y�W��J��D�{m�Z���IqG{��+E�X��c:�������Ja�Eeq�0!��2}�E��MJ���`@8'����@-j���d����t��*���2g8���AL���������#�C�^e���� ���A���c��,������I���'������dPc�t��-L�o��?,j����2t_B[H�p�+�X
@]x��\�����=�#U�h|�w%r,�	������T{�#u�j�	�^Qn����|"L���>}b{��Fd
����Y��O`�0]���$
R�l^@G�#���?��]d�����H]yHA[����M��9R���C�L-��<d��g�z<A�|��z1W��mb��>]������(hn���[J�P)��8��"�D�';�N�I����F����WS��x4������r��0��U�����9�g����#I5�����C�oIa�]��bIc+���
!3l6��H���q�&�4DK�G�j��fV6���9��K�Q�F��v)�����HP�4����T�������mX������/�2������y�=�2������q��=vC:6v��{v���#�������705��?�h�.b��Zq������QKl���&�5��O3�*���H���
v��^�DtOP�J��$;���[`>~B�J3���Mm��8�o��������������������?\�+��$�k�JId�{.L����b��������s�E�`
a	3:��g$�|��<�Z�~���U�����*^�����2��$2@�C�3���l6_�Q(��k�2�n��4^�6ocy����y��������!Q	�C�P��`��`@�v-��n�������1������3U�"�M��O����k.B�
��g����h�o]P�<�� ���$4/�v,���
��)�k�}�L���#�E`H�2����f��q�+z_�9��R������m���M����UT^�t��e���99$1n�G��������k���m�W�n��7(��6�O�/�P�x��7��lpn��&�eG����.'���"��r���DR��u�1���_�d���J�I����g����0V��Hi��|S�E\������9�&i�
^���`R~�%k&�='��t��l��h�
Y���^��������_���ms�F-'tS�o�hL�����!	CJ�@^�\��aZ���$��o�|�}?��U�/W_�"v���B�p�w�U���p(*�Z��E�5��V�;��.{mv�F!����Ak��k�#FU]O������`s��eN�-R�8W���C*|G���\��b� ���O�z��x��4_�Q���fr_��g_��p�������P�_V�T_�a���l�g��5��i��t)���l�����j����4����T�d8�e�'��2z	]l�o�t=�&�oL����8��2��<������s�i�\�4X������1QZ|ZC��Z���:zJ��{
S�P�T�}��)J��X�#%�[����~�IaC��7&��Ai����M�T��0�f%Q�Y���%�]�J���&�������o�^�}|{w~��������U9~�7��W��>�q$������$Dx	�e�
K��,�J�vR����d:$sI����������G����������:�En�9�������C�����h��?�
�����4��yT3��z��|�%,W(����TF�jQ�8��
���x��ui���cJ%�?bO@����K�;i]���/X�������Tv��b��.�����r���o�Ad���>�9��d��I�@����
|�}h�
�@��a�d�g�U[_�##�'>������hR���(��\b��PC�G�F�=�%��?��g�C���}�rlW12k6WxQ�)kt������p1�� s�����2�&T��Q6R�r3g��%�J��P�3�T2�� F2�s���IL�P;��%�VK�|��Le���S��70t��[�"M0��u�v"�s*�j���,$5k��5�����������������	w�t1cN����)n��XT7`��zT��Pu<�1i2�`-?�2�Rt�Nt@]�0�����!)@�?1�+��ZE����&�	��,��6+2M���m�%����o0Y\W1!�DX��B��BmX������/;�h@� ����n/[:�DEwfOD*���9
he1�uJn}���#�V�I����E�u�s@R�gA0��d���3?{�y��$/�c�����b�q3CW�X��"	n�������%����bU���Y!�(�����w�eEz-�dg�����2�Pr3\+�Ec�`m�0#ai����$�$`b��j�����j�x�4����~f�2P�C�)V�L�R���j������1V������C}�YXT:2)�D�#��@z����4�K�o���!��4�\^��@����}��Du�9�l�2���E����F�G�����7���{;�so�?�#�����	���#��2|�/�<��F���/������GR�����\8�B��o��=l���S��eSof*:�����G�0��`��z��M�0�5�f���%�`?'��k����:����������_TT�P>��w�m��
��S1��U���;��w~�H��^���@����oh\8��������L�{����+?�s+�X�z���0���)��;6P��u��v�hlj��g�����s����T8��z��S�{�u5��R!�&���8R����0��d����$#�	��y)�0���M�Eh&���g!�`h�;�q0�Q�*��1�>}����-�;�~|q�&������K�hR-d:m�,���v'��P��*���2�cw���!�m�O
��7j�(����%�������+�`��"3�����"���:�����md���z�)9������_Bf��VG-f3'��P�$�d4DAz
��M���G�����Z7������FGzR �A�����
�Xo5��R�	z��YZ��a�]��!'G|����N'����D�I���=:v��j�RL�%�����q�k;��Y�]��}v��p/.�W}�S�r�����{k���1�Soe�Er�c��,��8�8n��P)n=v_�#0�&�@�8�����Lo7���O��9�Z��������JU�fT���fi�����4�(�����f����A������A`�"��~H����-���5��f�L}1	V�M�#�<D?�~�!i���;�+�2�����������>
�@tM1��O0E����%E�c��u�hg]F&������ �d��e��Z��x=p+J�\�P�+��i�-�-#T�����������%&�A����g�S�����46E7sT��P�
t���n���-6P�t��s��HS����rK.eHAk���)�8l�'o�-����_���������s�������;)���
�8��:����X��^��n��zv��k�Q�_���J��~��K�l�D�d�
�H�^*zN�����f�$�q��4���n�����V����}�[t��UbkX����V]���8�����W�T�JM��.�[r��	>L���7��lu�j��?@��t���%�1���|�U-Y��"�w~�SIM��!EB[��Y4���L�c��2QY�\��A��^�u���tN�4L�u��$�������|TF�G�:(�Z�Y��l�#� �m$H������z��.wt�������o�����7�$��!X.����)�gfk2B1��pNK|��D�z���t�M*O���e�H�'�0*]�e��.��.��0|���Y�f6PpD�;�����'�a�?> $����J�{E��r8�� N����a���9�����r1{����������O����g���1��������k`���o�L�^�Y��YG9cIN���H�uT+V�����2F�w�������t��	LF��"E��R�gi�����q"�sUt.;��������uz,io�n=�t6����E�>�D�6�8}i����\]t��^��MmD{��DO�sTw����Tx{4��mfdM3��v�}��w��oi�������T������{���Ug@o�g{�����d��7)�_f���;Y �9��A~F���U�pv
�G�)3�����N>7��`3���z@��1�?�D�C�F��k�&��r�R-Vi�=�����v��AN��L����E!�f9H��n$�a����#�V���*�*t?�9��L|
��X����~�(:�siQ��{�h����
��qx��}&���{�R������w���;�SN��K����S���Usy�HN.XS
�Cz��fo�"aT������L2����,�����T���E?���G�V�T*�n��rs+�,��Z�-���*#�u���L��E�\�f�y�:/q���)a��0e\��U�*,9�O�j/RNV��il���\��<a�B�F+8��n:�?������`���+����{��;9����������sF��w���JN���]oG����Q���,�Z�V�Q�NN���������d��(����6�{�1�bjRp�~��������{�9k`?>����������y������1uY��tL��/}
*��D)�)���s��F��4�����Yo�lN1u�*}Lm�����Uh�c�s:�n0�_�����<����0�s3�'#��-��MJ<��0��b� �D���� �c���%�=�_�,��!��j��S���g�;������=|zL�>9�)�|��O��qjJ������V�so��_��j� �H���DY�p���(+�)�@�����^����2
���:�XG1�;�L���\��������6���{$6@H�����V66��<���j��������N���Y�{�$����Q[(���c��.M3��7\�}��|9���i�|��BZ�WU�
��
"�v)R.
�:�����M�M�9V��4����`���k�D����w�����R���uT����R��7�)N�mx%7�u���q�0��'}m����[��
��^|~t��0,�H/������|����/k�P.8p(���i���aa��(�C�� <��i�"+[����>,����+�����	��R.p~<z�N��7��S����4�67�4�/��j����:O	V��A,�a�WD���0�h,�*H~W(	)��
w���S� R���z�	��E�����o���;��,��RmIC��-���y��}�;w�nL���k��$6bU�Z���v��r.!�s�E9�Gk��V�������~
�	���"��O��VO�`�R��N�V$�	x%<�@d�1^j2P�	�-�i���On��vH0������!�J<�`�>�I-c���$�� e��-IT��1���Do�jJ���|oj�����d���V���T��)��W�m�i�\���e�r�i���g�?�����C$�P��:��?�$
��H�rh����9�r,���n��zv<��Zhf���?�w�@y��1'
�Au�v�0 +������j�U:���/�R����
0002-Immediatel-WAL-log-assignments.patch.gzapplication/gzip; name=0002-Immediatel-WAL-log-assignments.patch.gzDownload
0003-Issue-individual-invalidations-with-wal_level-logica.patch.gzapplication/gzip; name=0003-Issue-individual-invalidations-with-wal_level-logica.patch.gzDownload
0004-Extend-the-output-plugin-API-with-stream-methods.patch.gzapplication/gzip; name=0004-Extend-the-output-plugin-API-with-stream-methods.patch.gzDownload
0005-Implement-streaming-mode-in-ReorderBuffer.patch.gzapplication/gzip; name=0005-Implement-streaming-mode-in-ReorderBuffer.patch.gzDownload
0006-Add-support-for-streaming-to-built-in-replication.patch.gzapplication/gzip; name=0006-Add-support-for-streaming-to-built-in-replication.patch.gzDownload
0007-Track-statistics-for-streaming-spilling.patch.gzapplication/gzip; name=0007-Track-statistics-for-streaming-spilling.patch.gzDownload
0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as_su.patch.gzapplication/gzip; name=0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as_su.patch.gzDownload
0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch.gzapplication/gzip; name=0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch.gzDownload
#47Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Tomas Vondra (#46)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

I think this patch is not going to be ready for PG11.

- It depends on some work in the thread "logical decoding of two-phase
transactions", which is still in progress.

- Various details in the logical_work_mem patch (0001) are unresolved.

- This being partially a performance feature, we haven't seen any
performance tests (e.g., which settings result in which latencies under
which workloads).

That said, the feature seems useful and desirable, and the
implementation makes sense. There are documentation and tests. But
there is a significant amount of design and coding work still necessary.

Attached is a fixup patch that I needed to make it compile.

The last two patches in your series (0008, 0009) are labeled as bug
fixes. Would you like to argue that they should be applied
independently of the rest of the feature?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-fixup-Track-statistics-for-streaming-spilling.patchtext/plain; charset=UTF-8; name=0001-fixup-Track-statistics-for-streaming-spilling.patch; x-mac-creator=0; x-mac-type=0Download
From 7ac3c2b16f9976c75a0feea3131a36bdf50da2f8 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter_e@gmx.net>
Date: Fri, 9 Mar 2018 10:50:33 -0500
Subject: [PATCH] fixup! Track statistics for streaming/spilling

---
 src/include/catalog/pg_proc.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 9d6c88f0c1..f1cea24379 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2901,7 +2901,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25,20,20,20,20,20,20}" "{o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25,20,20,20,20,20,20}" "{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
-- 
2.16.2

#48Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Peter Eisentraut (#18)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 11.01.2018 22:41, Peter Eisentraut wrote:

On 12/22/17 23:57, Tomas Vondra wrote:

PART 1: adding logical_work_mem memory limit (0001)
---------------------------------------------------

Currently, limiting the amount of memory consumed by logical decoding is
tricky (or you might say impossible) for several reasons:

I would like to see some more discussion on this, but I think not a lot
of people understand the details, so I'll try to write up an explanation
here. This code is also somewhat new to me, so please correct me if
there are inaccuracies, while keeping in mind that I'm trying to simplify.

The data in the WAL is written as it happens, so the changes belonging
to different transactions are all mixed together. One of the jobs of
logical decoding is to reassemble the changes belonging to each
transaction. The top-level data structure for that is the infamous
ReorderBuffer. So as it reads the WAL and sees something about a
transaction, it keeps a copy of that change in memory, indexed by
transaction ID (ReorderBufferChange). When the transaction commits, the
accumulated changes are passed to the output plugin and then freed. If
the transaction aborts, then changes are just thrown away.

So when logical decoding is active, a copy of the changes for each
active transaction is kept in memory (once per walsender).

More precisely, the above happens for each subtransaction. When the
top-level transaction commits, it finds all its subtransactions in the
ReorderBuffer, reassembles everything in the right order, then invokes
the output plugin.

All this could end up using an unbounded amount of memory, so there is a
mechanism to spill changes to disk. The way this currently works is
hardcoded, and this patch proposes to change that.

Currently, when a transaction or subtransaction has accumulated 4096
changes, it is spilled to disk. When the top-level transaction commits,
things are read back from disk to do the final processing mentioned above.

This all works mostly fine, but you can construct some more extreme
cases where this can blow up.

Here is a mundane example. Let's say a change entry takes 100 bytes (it
might contain a new row, or an update key and some new column values,
for example). If you have 100 concurrent active sessions and no
subtransactions, then logical decoding memory is bounded by 4096 * 100 *
100 = 40 MB (per walsender) before things spill to disk.

Now let's say you are using a lot of subtransactions, because you are
using PL functions, exception handling, triggers, doing batch updates.
If you have 200 subtransactions on average per concurrent session, the
memory usage bound in that case would be 4096 * 100 * 100 * 200 = 8 GB
(per walsender). And so on. If you have more concurrent sessions or
larger changes or more subtransactions, you'll use much more than those
8 GB. And if you don't have those 8 GB, then you're stuck at this point.

That is the consideration when we record changes, but we also need
memory when we do the final processing at commit time. That is slightly
less problematic because we only process one top-level transaction at a
time, so the formula is only 4096 * avg_size_of_changes * nr_subxacts
(without the concurrent sessions factor).

So, this patch proposes to improve this as follows:

- We compute the actual size of each ReorderBufferChange and keep a
running tally for each transaction, instead of just counting the number
of changes.

- We have a configuration setting that allows us to change the limit
instead of the hardcoded 4096. The configuration setting is also in
terms of memory, not in number of changes.

- The configuration setting is for the total memory usage per decoding
session, not per subtransaction. (So we also keep a running tally for
the entire ReorderBuffer.)

There are two open issues with this patch:

One, this mechanism only applies when recording changes. The processing
at commit time still uses the previous hardcoded mechanism. The reason
for this is, AFAIU, that as things currently work, you have to have all
subtransactions in memory to do the final processing. There are some
proposals to change this as well, but they are more involved. Arguably,
per my explanation above, memory use at commit time is less likely to be
a problem.

Two, what to do when the memory limit is reached. With the old
accounting, this was easy, because we'd decide for each subtransaction
independently whether to spill it to disk, when it has reached its 4096
limit. Now, we are looking at a global limit, so we have to find a
transaction to spill in some other way. The proposed patch searches
through the entire list of transactions to find the largest one. But as
the patch says:

"XXX With many subtransactions this might be quite slow, because we'll
have to walk through all of them. There are some options how we could
improve that: (a) maintain some secondary structure with transactions
sorted by amount of changes, (b) not looking for the entirely largest
transaction, but e.g. for transaction using at least some fraction of
the memory limit, and (c) evicting multiple transactions at once, e.g.
to free a given portion of the memory limit (e.g. 50%)."

(a) would create more overhead for the case where everything fits into
memory, so it seems unattractive. Some combination of (b) and (c) seems
useful, but we'd have to come up with something concrete.

Thoughts?

I am very sorry that I have not noticed this thread before.
Spilling to the file in reorder buffer is the main factor limiting speed
of importing data in multimaster and shardman (sharding based on FDW
with redundancy provided by LR).
This is why we think a lot about possible ways of addressing this issue.
Right now data of huge transaction is written to the disk three times
before it is applied at replica. And obviously read also three times.
First it is saved in WAL, then spilled to the disk by reorder buffer and
once again spilled to the disk at replica before assignment to the
particular apply worker
(last one is specific of multimaster, which can apply received
transactions concurrently).

We considered three different approaches:
1. Streaming. It is similar with the proposed patch, the main difference
is that we do not want to spill transaction in temporary file at
replica, but apply it immediately in separate backend and abort
transaction if it is aborted at master. Certainly it will work only with
2PC.
2. Elimination of spilling by rescanning WAL.
3. Bypass WAL: add hooks to heapam to buffer and propagate changes
immediately to replica and apply them in dedicated backend.
I have implemented prototype of such replication. With one replica it
shows about 1.5x slowdown comparing with standalone/async LR and about
2-3 improvement comparing with sync LR. For two replicas result is 2x
slower than async LR and 2-8 times faster than sync LR (depending on
number of concurrent connections).

Approach 3) seems to be specific to multimaster/shardman, so most likely
it can not be considered for general LR.
So I want to compare 1 and 2. Did you ever though about something like 2?

Right now in the proposed patch you just move spilling to the file from
master to replica.
It still can make sense to avoid memory overflow and reduce disk IO at
master.
But if we have just one huge transaction (COPY) importing gigabytes of
data to the database,
then performance will be almost the same with your patch or without it.
The only difference is where we serialize transaction: at master or at
replica side.
In this sense this patch doesn't solve the problem with slow load of
large bulks of data though LR.

Alternatively (approach 2) we can have small in-memory buffer for
decoding transaction and remember LSN and snapshot of this transaction
start.
In case of buffer overflow we just continue WAL traversal until we reach
end of the transaction. After it we restart scanning WAL from the
beginning of this transaction at this second pass
send changes directly to the output plugin. So we have to scan WAL
several times but do not need to spill anything to the disk neither at
publisher, neither at subscriber side.
Certainly this approach will be inefficient if we have several long
interleaving transactions. But in most customer's use cases we have
observed until now there is just one huge transaction performing bulk load.
May be I missed something, but this approach seems to be easier for
implementation than transaction streaming. And it doesn't require any
changes in output plugin API.
I realize that it is a little bit late to ask this question once your
patch is almost ready, but what do you think about it? Are there some
pitfals with this approach?

There is one more aspect and performance problem with LR we have faced
with shardman: if there are several publications for different subsets
of table at one instance,
then WAL senders have to do a lot of useless work. Them are decoding
transactions which have no relation to this publication. But WAL sender
doesn't know it until it reaches the end of this transaction. What is
worser: if transaction is huge, then all WAL senders will spill it to
the disk even through only one of them actually needs it. So data of
huge transaction is written not three times, but N times, where N is
number of publications. The only solution of the problem we can imagine
is to let backend somehow inform WAL sender (through shared message queue?)
about LSN-s it should considered. In this case WAL sender can skip large
portions of WAL without decoding. We also want to know opinion of
2ndQuandarnt about this idea.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#49Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Peter Eisentraut (#47)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

This patch set was not updated for the 2018-07 commitfest, so moved to -09.

On 09.03.18 17:07, Peter Eisentraut wrote:

I think this patch is not going to be ready for PG11.

- It depends on some work in the thread "logical decoding of two-phase
transactions", which is still in progress.

- Various details in the logical_work_mem patch (0001) are unresolved.

- This being partially a performance feature, we haven't seen any
performance tests (e.g., which settings result in which latencies under
which workloads).

That said, the feature seems useful and desirable, and the
implementation makes sense. There are documentation and tests. But
there is a significant amount of design and coding work still necessary.

Attached is a fixup patch that I needed to make it compile.

The last two patches in your series (0008, 0009) are labeled as bug
fixes. Would you like to argue that they should be applied
independently of the rest of the feature?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#50Michael Paquier
michael@paquier.xyz
In reply to: Tomas Vondra (#46)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Mar 03, 2018 at 03:52:40PM +0100, Tomas Vondra wrote:

Yeah, that's due to fd1a421fe66 which changed columns in pg_proc.h.
Attached is a rebased patch, fixing this.

The latest patch set does not apply anymore, and had no activity for the
last two months, so I am marking it as returned with feedback.
--
Michael

#51Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Eisentraut (#49)
8 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Hi,

Attached is an updated version of this patch series. It's meant to be
applied on top of the 2pc decoding patch [1], because streaming of
in-progress transactions requires handling of concurrent aborts. So it
may or may not apply directly to master, I'm not sure - unfortunately
that's likely to confuse the cputube thing, but I don't want to include
the 2pc decoding bits here because that would be just confusing.

If needed, the part introducing logical_work_mem limit for ReorderBuffer
can be separated and committed independently, but I do expect this to be
committed after the 2pc decoding patch so I've left it like this.

This new version is mostly just a rebase to current master (or almost,
because 2pc decoding only applies to 29180e5d78 due to minor bitrot),
but it also addresses the new stuff committed since last version (most
importantly decoding of TRUNCATE). It also fixes a bug in WAL-logging of
subxact assignments, where the assignment was included in records with
XID=0, essentially failing to track the subxact properly.

For the logical_work_mem part, I think this is quite solid. The main
question is how to pick transactions for eviction. For now it uses the
same approach as master (i.e. picking the largest top-level transaction,
although measured by amount of memory and not just number of changes).

But I've realized that may not work with Generation context that great,
because unlike AllocSet it does not reuse the memory. That's nice as it
allows freeing old blocks (which AllocSet can't), but it means a small
transaction can have a change on old blocks preventing free(). That is
something we have in pg11 already, because that's where Generation
context got introduced - I haven't seen this issue in practice, but we
might need to do something about it.

In any case, I'm thinking we may need to pick a different eviction
algorithm - say using a transaction with the oldest change (and loop
until we release at least one block in the Generation context), or maybe
look for block mixing changes from the smallest number of transactions,
or something like that. Other ideas are welcome. I don't think the exact
algorithm is particularly critical, because it's meant to be triggered
only very rarely (i.e. pick logical_work_mem high enough).

The in-progress streaming is mostly mechanical extension of existing
functionality (new methods in various APIs, ...) and refactoring of
ReorderBuffer to handle incremental decoding. I'm sure it'd benefit from
reviews, of course.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Add-logical_work_mem-to-limit-ReorderBuffer-20181216.patch.gzapplication/gzip; name=0001-Add-logical_work_mem-to-limit-ReorderBuffer-20181216.patch.gzDownload
0002-Immediately-WAL-log-assignments-20181216.patch.gzapplication/gzip; name=0002-Immediately-WAL-log-assignments-20181216.patch.gzDownload
0003-Issue-individual-invalidations-with-wal_lev-20181216.patch.gzapplication/gzip; name=0003-Issue-individual-invalidations-with-wal_lev-20181216.patch.gzDownload
0004-Extend-the-output-plugin-API-with-stream-me-20181216.patch.gzapplication/gzip; name=0004-Extend-the-output-plugin-API-with-stream-me-20181216.patch.gzDownload
0005-Implement-streaming-mode-in-ReorderBuffer-20181216.patch.gzapplication/gzip; name=0005-Implement-streaming-mode-in-ReorderBuffer-20181216.patch.gzDownload
0006-Add-support-for-streaming-to-built-in-repli-20181216.patch.gzapplication/gzip; name=0006-Add-support-for-streaming-to-built-in-repli-20181216.patch.gzDownload
0007-Track-statistics-for-streaming-spilling-20181216.patch.gzapplication/gzip; name=0007-Track-statistics-for-streaming-spilling-20181216.patch.gzDownload
0008-BUGFIX-set-final_lsn-for-subxacts-before-cl-20181216.patch.gzapplication/gzip; name=0008-BUGFIX-set-final_lsn-for-subxacts-before-cl-20181216.patch.gzDownload
#52Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#51)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

FWIW the original CF entry in 2018-07 [1]https://commitfest.postgresql.org/19/1429/ was marked as RWF. I'm not
sure what's the right way to resubmit such patches, so I've created a
new entry in 2019-01 [2]https://commitfest.postgresql.org/21/1927/ referencing the same hackers thread (and with
the same authors/reviewers metadata).

[1]: https://commitfest.postgresql.org/19/1429/
[2]: https://commitfest.postgresql.org/21/1927/

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#53Alexey Kondratov
a.kondratov@postgrespro.ru
In reply to: Tomas Vondra (#51)
3 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Hi Tomas,

This new version is mostly just a rebase to current master (or almost,
because 2pc decoding only applies to 29180e5d78 due to minor bitrot),
but it also addresses the new stuff committed since last version (most
importantly decoding of TRUNCATE). It also fixes a bug in WAL-logging of
subxact assignments, where the assignment was included in records with
XID=0, essentially failing to track the subxact properly.

I started reviewing your patch about a month ago and tried to do an
in-depth review, since I am very interested in this patch too. The new
version is not applicable to master 29180e5d78, but everything is OK
after applying 2pc patch before. Anyway, I guess it may complicate
further testing and review, since any potential reviewer has to take
into account both patches at once. Previous version was applicable to
master and was working fine for me separately (excepting a few
patch-specific issues, which I try to explain below).

Patch review
========

First of all, I want to say thank you for such a huge work done. Here
are some problems, which I have found and hopefully fixed with my
additional patch (please, find attached, it should be applicable to the
last commit of your newest patch version):

1) The most important issue is that your tap tests were broken—there was
missing option "WITH (streaming=true)" in the subscription creating
statement. Therefore, spilling mechanism has been tested rather than
streaming.

2) After fixing tests the first one with simple streaming is immediately
failed, because of logical replication worker segmentation fault. It
happens, since worker tries to call stream_cleanup_files inside
stream_open_file at the stream start, while nxids is zero, then it goes
to the negative value and everything crashes. Something similar may
happen with xids array, so I added two checks there.

3) The next problem is much more critical and is dedicated to historic
MVCC visibility rules. Previously, walsender was starting to decode
transaction on commit and we were able to resolve all xmin, xmax,
combocids to cmin/cmax, build tuplecids hash and so on, but now we start
doing all these things on the fly.

Thus, rather difficult situation arises: HeapTupleSatisfiesHistoricMVCC
is trying to validate catalog tuples, which are currently in the future
relatively to the current decoder position inside transaction, e.g. we
may want to resolve cmin/cmax of a tuple, which was created with cid 3
and deleted with cid 5, while we are currently at cid 4, so our
tuplecids hash is not complete to handle such a case.

I have updated HeapTupleSatisfiesHistoricMVCC visibility rules with two
options:

/*
 * If we accidentally see a tuple from our transaction, but cannot
resolve its
 * cmin, so probably it is from the future, thus drop it.
 */
if (!resolved)
    return false;

and

/*
 * If we accidentally see a tuple from our transaction, but cannot
resolve its
 * cmax or cmax == InvalidCommandId, so probably it is still valid,
thus accept it.
 */
if (!resolved || cmax == InvalidCommandId)
    return true;

4) There was a problem with marking top-level transaction as having
catalog changes if one of its subtransactions has. It was causing a
problem with DDL statements just after subtransaction start (savepoint),
so data from new columns is not replicated.

5) Similar issue with schema send. You send schema only once per each
sub/transaction (IIRC), while we have to update schema on each catalog
change: invalidation execution, snapshot rebuild, adding new tuple cids.
So I ended up with adding is_schema_send flag to ReorderBufferTXN, since
it is easy to set it inside RB and read in the output plugin. Probably,
we have to choose a better place for this flag.

6) To better handle all these tricky cases I added new tap
test—014_stream_tough_ddl.pl—which consist of really tough combination
of DDL, DML, savepoints and ROLLBACK/RELEASE in a single transaction.

I marked all my fixes and every questionable place with comment and
"TOCHECK:" label for easy search. Removing of pretty any of these fixes
leads to the tests fail due to the segmentation fault or replication
mismatch. Though I mostly read and tested old version of patch, but
after a quick look it seems that all these fixes are applicable to the
new version of patch as well.

Performance
========

I have also performed a series of performance tests, and found that
patch adds a huge overhead in the case of a large transaction consisting
of many small rows, e.g.:

CREATE TABLE large_test (num1 bigint, num2 double precision, num3 double
precision);

EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
SELECT round(random()*10), random(), random()*142
FROM generate_series(1, 1000000) s(i);

Execution Time: 2407.709 ms
Total Time: 11494,238 ms (00:11,494)

With synchronous_standby_names and 64 MB logical_work_mem it takes up to
x5 longer, while without patch it is about x2. Thus, logical replication
streaming is approximately x4 as slower for similar transactions.

However, dealing with large transactions consisting of a small number of
large rows is much better:

CREATE TABLE large_text (t TEXT);

EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 125);

Execution Time: 3545.642 ms
Total Time: 7678,617 ms (00:07,679)

It is around the same x2 as without patch. If someone is interested I
also added flamegraphs of walsender and logical replication worker
during first numerical transaction processing.

Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Attachments:

walsender_new_perf.svg.zipapplication/zip; name=walsender_new_perf.svg.zipDownload
PK���M�� walsender_new_perf.svgUT
a�\��\a�\ux����k{�F�&�������
����v��e����mOU���x=x �pD4�2���ODI&$�Q�� �i�]	����������g7yU���36��fu�m�����_�m�������oz�������Y�~��o~x�jv�����^�������������/���lvv�4�/^�������yY�^���vW��~	/|�/�7���16_6�3�^���lv[,�+�o�������U������l��Y�-����yq6�)��o������?�������m��>���������u��z!K����-~��������l��V�Lh��5���g����>�|�6�gx���*����|Qn^^T9Xv�������������f����F����w03>��Vu�K|������//��e��:����?�>��?�������������O�O�g�}Y��fV�-/�9VU��.�f����`p�O��;��o���uS�f����r]V_���\�����:�/*�������^���&��|��`��_�g������Ys��U��w��E]�����r�]��/�J0��?�FUy�q��������j)}���/g�}U�����6y���opum���*v�y�|��������W�~���_?��&�f����u��j�������M�,��es�x�W����x�(f��:��J���W�e��o�Y����n��~s�zy~�����������U�^�?9�n�����^��j������n|m����l�#�=�l����?T�������}��'/_�6���?W_
��b{Y&����2�|�����[��.�J�2���}��G�������9�����b_��f����o�8W7�;,��)�s�����c/�j����#���'��vY��������E����:?�����3�p������K��9���g��W_�������x����������_�/�w��wU���6�����\��t������<��^��\����%|�t���|��9ks_ �1MS�7���^�c�/���������5����W��������u�]5W_�}�}r|p��_���M�b����_�k>��f_���C����<9|�|�b��3�����b�������o�W�����
�E���*Vi�����3{3����8K���>S�z�������_�������'���i�������Z�������zVt������c��s���=��������7����?�q���u������)�u�;����O����b�.�7�.������9���v�U^��P�S��#~[��k��g���6��pX�����l��+~v����gEi7p���-E3�k�������x������0<�U��PU�h�C]s��p��n��_���*j�������Y����}.�����Jp�p�T�<s~������o��������/����M�^wg����tw�0���i-�v����q��l`�����*����gl�kl�Wu�v�����!�������g���{��)�
��-���6����]��t�
_�������sx��	����_�����~PG����%��|����0�������9���w�n�����g��\�}��c���E����k�����&�Q����(�u����rc�d�kZ�p�����k~p���;�`�����v�j�]����{��M���f����(��������X�;:x���y�w���#�������^�I��l���7�������*���;���/fh?:����..�}�������oi�6�@�����pqXV��##K��!zgZ��y��
���5�����������\k����������g�_�CC~2W��_d5F��|��?��#�?�������A�C�WU�qo�l�=�����{���1�����/����W�
�2���R�������������.�!p���[�/	|2�6/�[h���/�|1�kY]g�V�r��'������%�tV�������n��m��\�����3uA��~��?�O�/j��
sU8����e���gDO�5������4z�^�XV_��fZ�����.�i���"���}���0{�U���}��������LES���2e�w�����;Yy��^��0fR������uN{�������M���h��nw�F@jMCJ�,��L-�`_�����I����c}�?y���*��u�Boz����	���x���������M?��T1�b_���2*��;��|�eQC��}�m������+����&��P�5���4[��
�R��h��6���Y���k��|����{~� Z�{��n��#���)������O�	��������H4���\\�V���d�/0#���R}��{�K���@v0��B���0���s��&Xg��6���ULtvY��g/��f�p������m����Y����;,�I�����y����V�]3�����-&��������d�C�.nQ9���-$�zt��_�8,K���(��FU�=�����/*��K���c|��p<����{�U������},`����������v�o���|y�����?]��������h���8������~�q`�Xgu}�I��>k�gm�s�q�mU��O]e��;��F�:������
6'����6�>�]���*���%:��q_u^��ta�J��WU~	q�W�����c�eg�p�6��U�O}<��E[�����@��y���������O����B�z��X���d��GyY��i�>,r��FR���o!F����L�"�m?�~��Z}p���q����C`sw��<���_z8�t!����/����9�V�\�����/Z��>`�is����;������Zq�����)>y���<���=���h`�`�6z�~=t�.f��q���!{��G|{��g��&����g�<�g��l�E�^`�8���j�ZwL�{�[��K����M�T���v�������������������������*�??��k����t����*����\��!���������w��
~��g����QgJ�(���C�Z�r�x#;������l��2����v�������]����w{������(���wpI�WW�S��@��x���(V+��_�MSn>���~���N��f�����y5����q���f�U�n�<�p@Y�}�����f�^g��6�/�����.�r���f�x��9,>�jR��I}�h0�=8A���^�vt�H��~y�Zw�����nI}>�����N]\����^�w���v���K��B+�����*���)��|��n?�<�_v;{��x�Z��[���n�4��)�9o����|y�^|j���}I�Z��|6;�����0�����lV���%��:������2�\��PEe���P�����/bit�`g���GO����Z���#����aT
���6���lS,�������p	/�9�k��Q���lS�����W�l����C��}���r���/�f�7V�d�;2���=��=5�A������9�����Y����@�|�_�/����}�|��Y������8�X�����gV��.��j��8�j�~;����r���e�����G}���PsQ�f*'�J',��,�FX��X�@(����)���������9�`���/�����-��u�Av�?���Z�,�$�8�OS�%�������V�����Q�G����9����/W�Yv�T�uM��S�����y�m,�<_��T��aR9bR�=&CO�l��V����b���
yfVm�*�am��v��W66
�`�����5)�����b�0��q��e(�n���N�<[�c��A8�,E<o�J��!.����3O���U)���)bS7b����E���E�.�Myyi�$c9o���g�+M&"�s��<�����Yovi�+iSl�*U.��C
ca�@D�'
�����QH���'���o�����7+���=��Exq����}1��'��M�[����
XL��a���w`�`��R/� >nq>O��	�%�Hh�`I�]6�M�B���]U��(8�]\�d��!,CM���	�V'c����B '����-v��|Q�����2���"���,0m+���X������|�o���C O1��<���`�����5f^�?�y� �>B!b���PM��i�,!!Z���<����!����n����'�����fH	_X�m.6����
��X�i��.��2`6f��>D,��}�E1����_��_�z�����U�^���iK��=�2�@3u6zl�!�����}]������7
���VU$�G��������$�C���v�����"!�M�u�	���`W�V�N�E0�#�6�����uG�Un��M�YlkJxA���(�������/��!B*&U31v��
kd�u�����=I�}8*�- �����F��K�r�B5E$l#$t��m��(i�O>������u�o�b���T�n���y��<���"�b�����'����u��uV��v��5�S[DG!l>���1q���}R��q}h:Q�X�o��_�rt�`K�hL,}eC�"y�
aP8j&����r�y�����+�fU��o6&�E���Al�Sr���X�`������&�E��O/�N������xBt��65�T><N\��F�=���N��#�K�?M.����@�'h�$��!�(%�_��g9'x����e^�����kX��~��G|B��2���)m�h���y�~��G������.���l���tYns����r��kz�=���P���l��x���r���RlZ. tO5�.�}��s�
����H,2���P{��"��<j�q�<~bwe���������y����`)�����*�G�Up��DM�\�}`��z�KY��z�����h.:W/'�F*[�������L��
e�0]�Vy�	N�����6�Ht��>Na��!a5R�7�����L�-S=2a�6�1�Y
�C/���u�P(v��T�����Q�v��3[8,�s���!��]���I769�n_�&�'I�Ms���'�����6yb	���2L<N\�lSL�E��~Wfu�3��.[@ $z+2D�hx�iJk��d��u�d��I��!/��*��9���Nb��G�2�S2�R�#���C�8Q���yV����Ug%,�^	�B&��������*���#��HT���~E�*�.�y���F\{�>|x4�d��'H��<w�
~8]��%�x��K�Z���v��y��zCx��!2��Fpcx��<6�"���A��B��'��aX�*�$��8H��'��.��R�����������2�n�b�]����A�ta������Z��u���R��������!a��"���{�T��+��rX�5[��.�W���uq����1_�u�U�V	O�[jQl;��O@PD+j$��Z�L>�����*�{=������p�
b�:�%T(�����U�M��z����c(2
�FbfLNF�8��"�P���3**�C�����F�r���'��F��:v{�u�?�N�t��L/w�vwdx�v�����t�����E�5����|���I�%NZ���7�l�w����8��7�jzH-��h.��n4y����P��_��l�(���D�E
��H��~������1�Zi������N�U���u�kb	��Ii�l:�GeN���j����f��[L|p�|Iz� ���%V-]��BM�h�UFD���($���WI��/��[�<(��.4��{N�A8s.H.(��y���Q����q��1��c�{,�=ABD	_���hY�xL�*���E
�f�Tt��T��]�h�L�������'|B��b��~w~�A�s�S�$�bO�<�*�3��g��}~4fk����fk�q�DXr��%?����JFo����n�A>Y�+b��n��8�&�V��SG���4�B�tU��"��-�&��;�H��`���xt0����f�-,��7�2T0J`�=&>i�;�&���9�'%~8g��Y�z����ez���ey��*�1�c��~:-`�Q��������i
���6%����vDj��A�H-X��L�g ���Fw�+^=�2cH�z�g4��f>��h�j���7�;�s.�1�
�x�CU�6���"���x ����������	W�=�C�S�XX��8#S�a�'�i5#.��3m���]+�-:��`^�<N��0�}mL��T�^jON��I�$,J"����_�0z�)�3�-���������P��]�3�:�c1��PB�.cW��]�'���$�����j{���lv��Pkw`�����5e���5���*�"l��+�����E��/�l�I�oT5#�k{ B��oz1#��"&�<�((B��<G����sZ�s$������������[KN����E�����)N�;&	�����4��Kv<��Dm����~���~HC�f�M^�W�Jgv.xd����v��&����e�h��ceCz�6gp
�x�Z���e��%�����B�k����);�Y����@������X2������G�;����{r2r%=�����l34=���pp'>O� ���&� �s;�z��q�!]�f���"�b.�p7��������C�Y���<��&�AD��3�s��!�'$��������hlf�&�!~Q�������I���R��<{IDI�P�	���:��7�w��o�;���S��Y8�s���G^�D�F��x|�U���h���J���.]-�&��N>��� WB��V�B��("��tl0.�#��l�<���N��y<�(����`84����c����	R�h������U�d�r���>6D�I��;	��j���u$s�����z���b�P��1��L��1��9��3�E�X�O$��G)���>���.4U��g����+Ub@���T:�I	���r�3�Wz���0=��~\zL�2���>6{����g��]~nO��d�"��H�3W�;�`�r[�����5cg^���CH,
� ?������P:,������������hO�|@�3X�R�bO����G�N�{����%�O}n�[^�B������E��we�w89����i�d8_�?���9#p@�#�y��H�N�ICX�u���������r?��b���d�?W%RQ���V�����h��l���Dk8�C�?-2�4�����9��G�a��rV/�fhO�a��K�dS��p�R����GP���	����yM}S����38ve��}��
�J���u|X����Z�N(&\)�,�E��;$���E�]t6���8tGf��y�Jt��7E���rY!:�XZ)%�&,7zu�#����U�����3�:=������ ����k�������&�l|H��QR�&�e�8$��g���*7�����wv��J�3y�{:QR�NI��)@�q4`���.�����;xa��,qO$F=:�	O�:��h�#1&�#�E4P3O�y�T�{�����������U��kU4V��@����N���HOp"EK.��p�:��.���1M�m=����1U
*�Y\����I�&��}�������|�o.��������&$�V�>u��������F~��r N��X�������G�9�{���;�Q{����Jl��J����Q������j���F�>lv��$�0�T�[��3���&�c^te��gIK�B���T[�6�%�1��H)�����L��3�(|��.s4}!�>�i�n.*���<���T��?����P�ewI��%������=fL�hC�)��1xE�.��
�XIaF�'�q<�'��)�S�����bk����R��<��HS���9&���>��i���c�v�2�b�C�TK�F��G���Gn��$���>K��v�S��b��1�����%��i���&����[j2d����EG����T���K��R�4��9����(Cc=������F���G{>}o~�Z��������(�HD�����+��!�nsV&5��n���5-����8 �Zth�TiZ�aA�@��!��-�3$��q��Z|/�t�("�H��x��P�("B�E�H��~l����M��s�|�A��le5���ZC���xoC"Ly�^'N��t�����T����<i������
V������Z��F)
$?�L����M��<���S�:�Pf
�y24�)}=�`�0	F��H��Z2)��L}�D4����#�����:����0n]�c%iB�5Q��/	��&�,6�4�������{=�4�������.��1��c,���16E�'�Y����u����P����B�Q�$����I��qntD����:�x"5���O6�z!�d��	�3�X��O�."�����*��v�{T�I��E���O���T@�
	M�������=��|�W8����j���0!b=	&���8\'�Q%�+
��IY��+-B`�Xlx(Y2�5�@���"���IZ���y����]g�W���i<���'R��`7�������)�"���$����n%�C�P0���y����s�����,��G�#�q�	A���F*K�p��B=��@�����b�I,cs1��dw[�<����#�H�$�@C����y�7�j�p���zv����R�d��������
��5�+0�� R���"sj��!q��4�m1���O��X[/��*��r+YMU���6��
��BZ.���m6�Q��5,�u����G":��������Ji-��b��������7v+r�:�����1"�������:@���%��i(�����1����4&Cj���z��~B�c�z{�?�Y�<HyA��#jZh��$Sx\���N����u�DU���l���m�.V�|������zb��$C����B���L��y�U:���U�vjo���A#���#VL�x^09��G���\��A	�Bm��X�����p�9�.��b�P#y1�+~����u��r��ez������������J=�,��L���\*���:�zab��AOZ�&w����#�y�t�X������%:#�[A��9}��_we�����C'�,i���D.(�~"�K���8���>!zOU ��a������q��L ���!3�A<\_����������<�r��NWB�D�0��[7��
B�K;�k��>�x4�^��p#1�W@�J��(�E���-r2��I@p�7L��'�e���6��i3�~@�����p1��eq&+pG��R5�m��-����
��6*�8���w���	��l�f7+;���lHk�\�����V0
�(M?�|QT��u4�eb"a�)[��a_���I�1��0���0�1y������`����ly��5;�b���j~w���v�bC���� �������9>���E���U���T��m�5Vk����'5�T#���*�L\A�v�g���H}h�������'�9�bU8v�2��Xc��3�&�@'��!;s��f���X<�d���"Bb��9s_��,�Z6S������28��!G �qw���D��P>����~"��}TabB�/�2��$�d;�&n	��+:���&�6iyyi�J.��`�?R�F�H�'yt��>��.�J��J�669T���1���kZ+��m���EQ�HRaU*V��p 3��H�{�
9�"C���>k����-�T�+��&�� v�b�$���Z����z���L,oD}� L�<v��Y�(��X���j!�~g%��C�-�8L�{X�$��8|s}i<}.�e����G(CiWW��i�(��q5D��D\�9�P�y������X�l��$�>1����OT����J�\�=>1*z�LR���6�������?Hi$�^��[�C�C�����=a�?5��z+��A��Q����T*
�+cW���}�;����P��&(�0�8m�I�\������v��������'��%>������QL9���l�i~��UT�fbnO���CZE3�������{�O[nQ����6��v��eP��dFe������*V+��U��"[��Xu��!��U;M�DD	#&jS�Y�e�X�F�AB�d�����C>��Q~]k}���Zc�lkI�y�e^��mE�:�n�`������z����f������V(�?=K�����&������L��g�gxKq�����V�����$�������a ����u���G��i�.��I����Ef��UX���7��7�&m) �)�m�DY��n<��������Nrj"%Jhz�6��+��6c���)�2b�A0f�8nC&� ��?��<��d�
*�(���lNY1�a���X���0P�6�.;�K��d�hO��s�HB�4���D����
[�cq�d���}��>��m1	���?A��}j#7�0�wF�AKQ~]U���|�av����Y�gv���;�����X���7�L������-S�7I�4q��Mx����# g�����o=:����=:�6�	Y�����6��&w&����Dfla-L)�	�r�����#���.e}_�`�������#���b�5SH-�aV������h����S|j��~�f�Q�)�Q=�C��^�8Q�u�z�������2f�S��k�-���de4	!UD�&V(l�s���N����xOq�	D����V�u�ucH��4�I���J��zl�sp6��K�)�������0c*��p``r����D=5���A�K*�<(���}^����@�n�
��$�����Y�Jo�k%��
������lz���G"� '6&}9�W\�n���+ �V���)j�jQis�W5��""B���:�t`S}��������`���FV{FY�%�\���^�\s��b��B1>�l}M$<8�	��]�:����*R��Y��p^q"z��i7
$%r��ot����9�00�o�
%���������xL�rF��������,�w9#�0p��������D�E^����H�n2��mE��S�R��(	�R�hL9J9�8R�Es�v��U�������U�ml�Ca�w��i�����{*�	�%>�3��t�c\��M�]�6{�2t����j�vJC���weV�'��3;x��"�G���L�x���Fj����aa�|\��v�	]s�@k����1La�A8}����-*��U�cI��8�������+�U�;�������6��E�f�e����5x���X6P���jz�a2g�+����6���m�1��x�$m}�<���pF�����A�[Y0H�&��5��/�����G�!	1m��8l���� ��aV?�H�h�:z�On�I
usi�]-
����o���ksI��^d�,@
��y)+?��==�_��a������Vh�j�~����`=3�O���}�BGK�����]�-��i��� ��(�'J�"�����i�T��j''
&���3_����c�%�C��A��a�T:<mKh������>� �Sh�Ri�����n�zWlS[>'5?��Np5�Fd|��y��u���}e7R����=-����	��G���G9}-�a:'����9��%-�1FT0�j�|�/�%�<�����@kkK[� �B�����p�Xg��tX�E��U���R+.��*�'�X�%�8\���������G�s;R�9�-�I�G�#�z�����hW��y�\I�z��W�Y�����rN6�7�y<�2{S������
;0M��\xM�D|_G'!m���"j��i������4L�>@>6�Ci}.��@�6Z
$���p�]���.��3
���6ggDO(�B[�����p�����e��v������$�Az{[4��w���J8�p��C�9� b�����������h���&���(&K>�8�z$}*��yFU��S�O���R}�8�k��E��y��?����t���WJ�������%��	x��C�����@<���_~����biYB��g?�mG�$���H9�����$P5���<��u�"�su����q���3;��5�[�n�k�h_�T\�4�h�c]�s�/����y���6�G1�(6j����A(�S��$��Ik���}�{��y�r��~D wH����	���D"B��Q6�=��:�V�������ei5c�K98cy�W*�DZl5��:H;�YS�t#/���~Rh#�&5B�
���j �T{�u)�$�]�	:��G4��Ld�1�sb���r]fM/nk �eO�a5����t��eO�pWf~�@����� o� i���������@gh'N���j������:������@4�T)_���(r�}P!���*��*��\L�6tR���O�*�9��������
����#V���Q
�+!�����=���*��A"��G}��S_��$
����O\��:��	uT����9>��uxF�$�����0��Z�4�%�yC�fLom�����'8�?���]>��a>�����c m��:��
+H�A3��<�����D��G�D�!�Tcm���J����/�*�.�W�g�R�
M�����
�p�D�b	'v�L�y[�rD1�*x�DZ
EI�b#T��a�yP}������j�G�����)#/��h95;�85��Xp:K�a���G	2��V��A�|���Z�q)��[��`���c�;Z{kQ(���9�����0$�#���/;�����Q�\��#h�T�d�nS>5	7���rT<���������sFr;s��{�Z��Q��@&|(��1�~�t�U�%�s�tf����b�s���{�DN������	�q�J�m���P����@W�zj,��%�gd���� H���z�����4ms��r������Klk�D����q���a6�,�N�Z>��mA&�7N�#Y��M?�DZ.�y��W��{M1$[%����m���G�����+�D���Lo��<��hetf��x(�MZ�F���v_.W�U,
5D�~��G5X�=N���D&q%�I�u���~K��58�t�����aSSi��NB�V:��}U=i���Rs�o�?$��9b1}�-��h��r����7�~8�av����]��{��<����u�
�6y���(���	���D���VW�A���^���+1�k1=���������H.��+v��E&8����N��]y!U^n�h2W���D��]c�����A#b*&���`���Rt������pID�����c{�H
�X9D�� ��oE�.����"b�,�C�_�H;jz�������v��%{���
��=h�(�$�������K��KE5z�XiJbm�R����k����(�����Y�f�6K�[���'sc�O�0$��N�0�;}�p�����*���t}M9���2R{�V���Ds�N�k��WW����!9�d�3}�q�q�_�E��!�G����K�"���.14�N2��T~�0�����
�n�U������
�W��h/�j�"<��@QO�,J���ah�S�=��DnL�����������6��k�5���O'1�Y�4�c��H�=��,��Fg��O�`��Y����_����p�������� f�,�����n(R��p��Dc���D��O�i�$o������eC����c��Y�c*��%D�pg��i�,;��loY~cQ2X+_���i�%c���@�����o��p.�$�t�������dC���`O�h�O����[�M���Z �=db��f��X@�s��u��c�(b���/X�������a5*Jh���{��EU������!��Q���b��qq��o�y��"�E�G��F��QZ�	�������*�����dv� Fk�j���d��	���U���]��@��+�kX�*���l�����{�~�D��d�3W:����k�eY�n���n�jFt��8���Pm�;��GT�o��ZA����(���lv!������Lqn�-�P(�6���e�B����u��c��
��1�
d@�82�������\������n�C�j��	���0�T�j�[�a�86�����1�	LP]I�Px�tt=`& �`�r�����J��6��K��������^"G�� K�������{���b��[\��e�k�q�yQW�<`v�����cS<�`����X�����-��O�X!i�kj2?>x����	Q(&�T=n�qr�5?�k�C"��F���}^)�N�Q���[��_ ��iG��7���M)/��2=j�'��o���y�q��7Z^s��d�7N?a��E������`�h�������������=��Q.%E)<�f���q�O��Q5�����~�w�a"�'2�Kg��:�����^+C�>��U�L=_����CMtp�p%S��=�/�U�X����z���C�Jb��T�:*2*�����e�l�ZA��'�},�����llL����Q��PY0n_��1���>�VK��o��4�5�&�,�_B;��:��<��&��Ep�����C�j$&�^�9����'i�� �~���7��L�UX���)�P=9��fED���A�a�������a����H�����8��qN#��j��$5rA��C������z�c�PP�����|�G'�A/���,���t������<�T�������$mwAi���y���8�
Z�{"
'R)�`g�=�e���?y����D��������t���E���l�jAZ���,(�3n����D$�#I�
	
n��|�he�����He������X�fN��mwA�L-d�c/�R=�y({��Afv�xV��x1��!
"��*���O��v�R@��P��������Q��1���]�{��fE��F3=��=��������@�X��20���y�2����~����[G�H%��%��}�q�z�t�_/����;D&�+��m��0b$���:8#Z��Ft K���'��UO�����LN\������4��j[��OGJj��c�
)������3I�&Y��5��HW0���k�^��)x1$�@�	W�v���U�U�t��B7 ����zE"��)

�t��6D��##d��r��5v�L�qi�����2���3-h>����i�Y6��6U�Jv"A�As�ui�ym������K��V;$�v�S�|��K�,�~0g���,���������iFx@����viU������������#�c}�v'�M"D���B'�z�9o^��
3!8>����u1�R���vz0:���'�p� 	5z��f����nVU^�q��q�9������aIr*�O|�}@����p0XU:�E�(�w������{4}7QQ�]�dt�{;����j��[U�k�>���M��,�;D}�6��"�����Z�y�-i�mf{�4g���sU.����X�t��I��Y�o�c�D/88�<
\�z���K�������O��R�P(����Oq�u�o���+��2��*DB6�a)�\83lT;�J=�f�T%/�����<i�c������j�����Oxd�a�8�4�=�S�w�����w���]��fq�S�V
�Zi�������
��XB����!��oR���Lb��D�
���1�m�w�9�������|	���& s���:o0�i�m���#�$`�#qP�h��gq���J��2-�H9��<�0����	c�rP�Dg<��7buZ"���L`py������-Z����L$}5E��`A��m$i�C�<Z�'(_aS����4xX���&���\)h�i�(��;���O&M��e��	i��>�i[��DE���������hLQ�bE�S+��(aZ��z��m|A�5Q"��3#|i�4����J^�����h�*q����V=�[�C���|P�9}@�dH�e�Z6V!)f�	��N�8�W�v��iQ�ZH�������(�B��*�R�a����H��z���9x� "��1F� � �����;m��?�A��C��JH�y@<6_�E����r���#=�}T������3=47��E��DXV�dvZ��&�7������b����>|k����{���(�M��S��+�w��S�o���m��?�R!�N��IMC3�E�D<�6h�&p�����9�pK�v�R�X�����G�Rw�5��}5�x��x��������S���w`=�k�.���P����� V7\����aoG��K?�O��_|=A��K�cb[n/ #��,k�D��HZ��q��Q��p.W����������-����<$'�R|
W��~�S���&���Ve`Ph�����D���\��%������t`a�e�p9��y�h9��@��!�P+�9@���
��1�L{A������Y��>����l>H�������X���t� G�@����Uv�7��Bfk[�%PG�|/�����k�;��k|��������P��e�W���e�}v�<��CA��z���b\31P%-�.�|�Q�|�,����o�����6k2X�����
V3��W7���s�!;g'��c�bcO�?x��X��%X��+���x��.�.�T_�~K��n3�!���pU�QJe}:D8���R����k�S������'��>U}e��3`?�����y�_���>�����A���f���?*���KT:'Y4@5���W��i�% ����f�N�Ft9�U������'�I
u~����/v�F��T�����m{�(@'�V�`��P'h��@'�	��7�bU��]�5��b��"��Z^D��I�}g��-��l���rq������(�5Y�A�zbvt�H�L7�`l�w5,�M�����u��r�C���&\���h��W��e�v��;�������cr9�4�PxV83�S����S�d�K�&p-X��+���^��B��t �$�!�T�@$X�����8Wj"��ji�����G���5lK�d>1)�<> O��aJ���6>��>}*���\:Cs�:��r�$b��)8��y�gL\��4Q�+$_#p�5�(0���>���QQ����X�(3��#�P�=9�q�
�I��B�O��A���>K�x�R��tQ-_�u�K^��~��pS�������b#x*{k�5����4�b���uj�V`XC��ZG
�?I�B�!J�@�������E�R:�,�����(�	s{'.ON�����GqJqV�^\d�q�y%
Xia��9&'
q��h����o����me���m�x)<��j��E����4Y}�*������D�+��Xu���w�JG��);��n�l��V�y�W���D������Q2�5*����G�Jg@�
�Q��m��"�����n�������W�{��Qr!�M���F�Ad���l|O����hT�|w�"��d���k��A�*�>*��M����
@�HW*�Z���jx�U'����12����T?b,��`�����%CEs���������1�+ >'mnS�X>��n>��:t=V�FJ2�����'�cD`��w�0G�����laGF������|��6k���b�#��/3;]�$b����W!��;��P�l'�3Z��iz����j����=��'��-���0te��)e!��{��v����:�#�T�
���\�l]��6�L��NQL��A����2��g���S�1bQ��9w��g�o��������
N<��O�a��~��h�Wb���TR�Wy��s�������@�K$	��S���oy����o1��A{i#��
?�>~q��Y�����)�������u;5���jm2&��>N�dF>��G�Z ����+�Z�F��	j�@�1�Xx��(p::��I��HCZ���Hm���4G��/�Jqc���-���mV4������\��&4:�����G�,j���:�O/��z�Uv5��&�7k�R5%L�l�vU,G]�i���/:��2��*����;e��N H���1V��D����u���$�m���mv`���V�pi�@q�8I�������� �b�G�s�t�������:�=>8>����Ck� H$p�v���t��
��}</����{�{��8�O�I0�Ol�����p�oX������_XM�!��'�"C
���A����u`�f�]Vm�*=�Li8���)��P��	2'�����P�]�������V��d�x���#'�����g�3���>�Lu�l���F�^e�;�t}���3���.�j
KZ.��9K�	Io�fq�6e�\Uy��L��-��Uvc����H*ArO*��R� ��j��m�Nw�}�o�P����r[7����� N������m����,�{nsM�����>o���2���C�mC)��*Z7�:-u`�`���U����Ag���
��s��lW(Q���\s��'1��ER���s�&	"��*���>	s�y��
%P=Db����du<�������(>��O�n_-��������������D*�Sm	�^:���U��K��v������K��)C?CT�CuegZH7�Zc-��c��c����N�J���G�f:��F��!�;j���������n���O7�X5�1����/N�L:4�!�6�>F�3���R~� �f{~2��=hMx��4�J�UQQ�T��Bhii{������p��+|
�S.�R���l��M~4)�}��^�3�I�RiE+o6YmGu��G�Rbx$�T�����l�����,���$U�36k����#���G����B����P�n�*����#}O��t84�%�l�*G^P��q_�;�=~(��H�`O?��k�
�|	������X�,�X������Y�,
ON���U�3J"�yQ`���&��R����\b8}O'���Rd���i���3?#����p@���(k����|U�.0�#���DT
�Wy����-�A���F����5
Lm�d��\���A����\v7z�2�������V����h�@M�����Q�����B����y����]����cZ���&�@���90��"���}[�9�o���2����h�zILd�5�L88[
���|�Uy��n���t�����qu~)���}��GG-��")�bC:�b��Q�H��N�j*�������9$��l���NT���F������O�z!����W[�A������c�G�����l��M����x��"zs('���H�����w�yz�4'�2�OL��#��7�b��m^5EfE�	������T|!�GF��vbWh)��W�fS��r��'���$���X���+� �f@��%�?N��,�O��$����>41La�(�X�{t85�/D���<�����v��,�.��"�	������Z���Fi���/�����8q �;"f��d����lLQ��������S��j���c��xw]�4���4���}���x�.+��1ez@�S�������){0}�1��m�����H3E�N�H�%�HkL<U�������������Z=���}���=tR�F7 ����}Wx%!��
�@"�8-�3�G��I�����(#�>��Q��@�bqf�#�@cQ=���;2�tjb�����f�l�v��Ul'���p@M�
����8%x�����F�4��=
u��,����q��lQ(�a�J���_B���;8��e�x(v��$b���xo8h%mIEOD�v�����1���5����e�F�%���q[����O��Hc�n���&��X��:b^'i�GL< �4	�,E�0Rr�$d�/�0�����}�X4�%:�����I-�jZ��;���Tc������om)��hUK�-�@�&�X����I�%nj_|�.���}F�uUe�]y�&�Lb>pv�'�
�D�����s���maq"ttd�a�PRp'�n9�=b�-�u�<tab�L�l����@��C�x��a�'2S�<�3}�>-��3?_��������
��(�k��{��MU��&y�����O�VA��t���O�,=�1T�h��Y�|�i��q���j]@��&�������V0-�f�,��[B0��H��E�$X1I��e�,+�G�a�P5zD�tR�YA�U;�~����&�A�
��IDZ�$Ll��<��D2$L"�QT��C�Y�z��$���������us���,]�\�?����'�	���8��i���|H$���=m4�^Nnt%�l�L��l�.�J��[�?��$	1�������\du�c��_������Q8�1l�N��H��?�]��}�+��J�FX���wqg�v���m�� �G�+Q;x]��:oK������(�3��������G�kw5�jE�c��ZOj���I}bS�v����KW���&�����/:Lj�|��,�ZSG���c�r��w��E+�������r�k�"og�E��r�Gz�;���Zh��Zi�����(��4�4�W�e�J�6�q�H��\V��
+����'�
��8C��]�����<��c�3a�\����8x�	���(�l9����}�\Z�c��������G3%8�nM:��U�^4����O[4�<�F.`�-"�j0�������i��m���*.�:��2y:>�
=N9�����|&���x��� W�|k5�	{U��&���"/.������V����!f�=�u6�z�"�qz�'�)X�P��m ��{.�U��wH�SC82���f<�����Q����]����no��Q��k����!��De�t;'3}6�^R�(]�����1�A�U���c�$��q���(�:�5�V�G���Up�lg��V�|�����:������U����[V���u/���`��kC}��|
(�BZ�	�#WR�G����yzu�����s�J�1���$m��3C�6������Sl$!�^|yB�er>��p���2G�lM�JE)ieM�	TDd�.����hF2��C�D*����������t!�%� �J��*LYig�`����Cr$%���f�b���i^mj�D���@:NdB����QW��ha���_�m��!�O����0'�����
s!��I��3��;S������)�F��pz�J���q�!94MS,���6�����h��N�{�l��wlD�T�`)l���@�y��N��y����k+j�cd�C���V����a���n�-�&��K�u�SJkCB$U�I����+z@�|���4@�Hzx2���=�D4���>J��,��c
3����Qf��}5:��@r.�������W���F"}������{��Q;�"���q�#H;\��������������bh�?������OL�3/��F2Q>��{�,���Ez��~��H��oF�$��0"v�e����bf��:O��*G�+����d	����{{�����!;>�M����I\���J<�i(D
�Q:#����5Q:(��l��"����%ab��~H��>�3����O��(�@�
I�����v�!x�h���"����-��V��3����x�t�s���m����YiM�F����B�����D�R
�z�.�Myyi����������4<��1s�D����u����v�})(�BAD��8
�_��A�N��M�#R����S���Z���L�4\C�g�1��f��Rl"���#�V)���TQ�����K\�SB�0Z����.9y3�0,f{X������ $����g���IiW�7i���[U��;BM��G(J�r�0y��y,�A��/�4��``�o�((����t��h�o���*!��^!�V�M
��x���@�h�K3�|�z�s�/7+TR���4�`Z�u���������;a�CZ��KR�.��t���KI6���M�+������T������#�5f�c�#?iJ2)v
~*��x�g�P4�or8I/��GD�zV�\5������:�����R8�@�[���|q�b�?��x��]O2�m�}d���qp*���9"��-��?�p������>Y��� N"�t_��H�v��L/���
���8D� l�}2��f"��-��|O���U�Q���>WG�VE�v�\��&[�����+6��c���G��p|�"��F���s���b��y�_[������j�9N�����/~��a��9NdW�~�#MMj|�nm�����G�A�^�>����3���*H<��=�c��&�n^�6�7�.�5T3�r����9w1�����*s�����F���8CSiU��S�N�"q]���*R��
����E:a�*��Ie�CPaG������qA1�^���=���'"�c�a����>v�����7t/*�C`� ���qb���� 6������������1�3/���1���������$c9�����-M�~|H-�HJh>�3FHp`����kF�d��Z�C� |\�9%`���I�\����*;��������;�LB	��2#�p|mB�{��_gs.]	����4��+zJ��S?�%���JFO�p��s�f)�c0�:7����z�L<p��^F�����>y*�Gb`,`4��:��I!"�:&-���./��-	���z ���+��L2��Jcd�����:���F�gf�(�|h�G~4���0i��IZ���~������kxl�uZ���}v�RmKo��EJ���N��p���u����n�r�"�����pc#��#v��f���������'r#,.:^o���iSep� �����'�<'�u�2�bNI�C���u`r����6�>��K[\�X@,r�q���x���w�#.U�S�6#)�9��l�%�������(3�n�D�
��Q�����p8(��{��eQ>�Y�_��lH&���1we��R���A��6�+���� 2q���C���]��a��
�`�!�{I��Z�p�P5d4�s�!>�a�J���j�	~k8!�q�~��D����&���w�������H�6m$��p�����l��RE�`['��r<����h5��{���(���/�#����nO�;���6�V�����9����1�M�5���� S�"����u'8��F,��}\�+�1�"�i�;��A�[M��i�X�uN��m-(B�A��A����|�����
$���j����5���%���:cw��8c��~�ovjs��m�z�-,i�����@@�-b:�������*}�����/���-k�Q�9����TRh�<����(�y�S��7
����=�>-�%Ud��q�$��*�H�@��\Y��2��9s��6>g�m3Af=1���_u��P��j)�Z�03�����#$�!���m�>��<���o��P�c�'���0�G���d��J[����?�xx�yq����G��3��FQ���������5������� �^&�����H�Qd���pM2�Po��
?�)�9;���'� Zv9���������i�`�s��3��UpA���	�|w:g�vY����j���G�	)�����R��DD���<����������7X���(�A#r_w*��8?�{��������ur���p[����O��X�tq>=���bQ�c|P~j��Ty�&��h����?�U;�7�n]������V�P<�]�>������/�4R�yM�����3�$�m]�1��ES�0�$�?�[7U�m��U��\���7��p���!��x<C>N�PLf|�m-�+)
��+���Q��������PL�'���mH��D���P�(8����>h��"�Q�Yw��s�\�17���A3�0'���R9%m�0�-�]���a���<�]�w�
 ��E�N@q/��1�gl.����r�d|���RD���1GQ���y�R4�${����
�����2m��v^'��C�:{��$�d�h���#t����o�mV���?����r��;����#Q�JOaN���<��2]���~�7��|�V���*F�B���&����;u�Y��Y�b�n������e���R
I;>����_���*W�������{!c��t�����������6��Xo��z@E`Q���3�(O\)��Ph�^�<�B��+�l�����cj7@�<g�>i������"��G#���F=Qh9<�t.d���Q�Hr������dh���4�v�����RsNT��TL���G�<�v>c��(f��90B�md��efW�Kb
���l���F^	:m����gN��� m����"����h����G��R����l�B��y�^��}���&+P��W�2F��U:���]��.�~��}����i4;����k=��#t��>C��'��0�{�E�a�1�z�����Z!H��vh	ep ��>�B,��'�F�YC�E�����=�lKWJ��"���5��0~����.��:9�`�AbT~%��8Z�������4�=�z�"�����vDTPa�!YO\�<�(O�.�u��)#r��&�O
Y�H`#�#gt��mr� �Q#��E�V"������C��n�
G��^��
EG"<������Ai�i���0���=��n������`@���0�B�����f�p��o����}�^vp�W
�;�gN1�������A�!���Q��I#:��@��`�P}�:o�m��-,��������T�L��R�4��m�e�j�P"�h����5}
�1�����J���42�����/N���"���?��L�Kh_�^V��*HB���"�I�G��T�v���fv�(�g��UC�����5�(�����X�������uzQ���e�p��.��8�0�Q^�
������	����hW�J�-ln��Or#`+�&�#���,[�qV�a�qq�z�WH�N���,!N\A�5+��n�	���o+R=���f�|���f��w��O����=]�
J���%��Y��Y&�����r0����6;�NJp��{�>��q|�g�d�n���T���+��c���-"/L��XN
]�������-�Dj9%kxF�q����H�9�n4}���_�i��
���tm����(�[���2UH�vN�� ��p���V��%N��N�9+��G��q5y`'"`[�>N�d�����qz}bH�w�
�IRG(��*9�*��!
�"J��+;..��5���|��85
����M\��}w�R��v��Vdj@u@�t�m��57W#�]s�����$gA�,���Y�<A��\H"<	-t�9����r�cVX��dO������$�������5dF��M��z������C�F���UH[[$�<]At������ ]'":x`iK��>��&����Po`�f�<<�Y�����s$�����T�@5gT��^�V!
a�Z5+�|�<J��{�WS���Q�\����sY�KK]+2�wOH�uc+�yiy(=U|�9��
����NKQ����d"��#����'���F�Ss���Rq��MY�j��	����P*ec>���~D��-3H�e�L�������e��G������C
E�885S[�=��M���U2$��E����M*I�QUi�������h�D�Fh�#�)�M���6��E�j���7�]o<�%!ikJ��1�)H�#Wz��~��4-N2�eJ{na$����*A|�S:P��|�A���$G��Q$�p=�P9�"�l���),E��%1Jd�h��2���9FC^��K�J����Y_h���fo�kr�c��6��/N�j�&+�~�OFX��!>������(���E�O��8B/}�Ud��>����gI5���-������7���	��}_�58
s��]�?�f�������P�-����k<�q`�����q���p������M��s�|�U�o�&����1l����/d�Gd�`����T���	��"X�B����E���(����@�U���b^M�g��h�F+p�/p�Xr#vt���&���&�.������N�b{��H���lx�����&o�4��j�o�T�]5�[��=�Q�PHT�F~�
���8�P�O�������T��� �t����5V�����db[��B�V�<}h�O
���z�*�?W�.���"��8��kG����|ep��3WjGT�4��>�>�i*
ZL�j���r�v����
$�
m�q����\���t���_�Y��o���p��Q{be/dm�m�'������Aq�B����KhdZ������N�,��5;���������G�����/���1M����u��l6�������
Zn*%M������������!��q:���p���<�*�,
&�z]����j�,<�����N��s���'��vj���Q]��sS
�]}}ae��
�+�nv��u��Hj��������Z���=��&L7��x�G"q#��(cp@�����-�����"$�I8���d����T�H��4��[���lX�)n�����]�Wu�H��*����a�t1R_?�K�D1�0�$\�KE�{��Vh
,��n!���Q���H��e^�5��fn:
h��O��hk���`B��0�q��O��A�E��L��������J��<��������������E�R�[�n#����<���r��x��N�!�8pZ��$���$lj�QC��r\1G{yA�����B�\t��*�`���t���@+�2�12 ��{����:�S}�V���7'�O��w�R���v<@
.�piQ�����l�*�9X��p�{p�CR�Kj��5ne!Q�.yg��@��;?�����y�3n��`�f��<��9Rd�����f���f������Pr�}M�����B����+W��W��������S�����of�&r�5����� �hhAQ��������n���u���i��PZ��q?�r�B'��$H�S^�\2x:W�ssJ��hp0�;��x�M\q�>I��e$
��w���������jq|!�A�McM�~;�hW��	��
�����d�-H)�u9��?zx�S�����}`�h(U!�g�������W_��z���H�������!�������|� ���X�-��:��:�����=&U����	�������#�|���i��u��e6�� 2�����o����"�������J��6Lk�y[��*���.Qt�&�$�����5�vF�x>�9R�p���)>H�\[l9%�����3k�?,���qow����^*�����_<��0:����.����Ub�G�0�M.x����S~M�$��P��u��\�W�u-vO�����o,���p�7 ���;��4������*U��-�iW%�<�����i'�J��]
��cK����\��r8��%�� ��:�r���c��l��v������D�*EW����q�`(D]�GL�����s�������6w�����!������$N��JJz��zx�u�\����M�HQ��m����#����P�Z�3����)@����a�A_�n{8�IV����&������R������f(���B;���(��r4I�-:q�������gn_�M��cKH��n���w�Z�R	n�H��h�>g�q��n>]\������T
N���3���8�JT)$�P�����E���e[�\��L��fW��tGW�7�!� �>�X�|��8]���}�	�N-����Es�?�CX���)�'a]�E&�#Yp������8��H��6O�zf�����&cf`j��h�$�rS$��q�jS;�{���z�P'��y���f]<~��K���s��
G���>����>m*x2Ou����>�7����*�2��$�p.�����sX�����`(�H�G;��XF���H((���|�)�������M� !�{2�J�~s�QJRp�Rw
�lt5�� T�"�����
�+M������&Cy�!rw;��p|��;����
���I���r�7��"����8�AH��0���	)	�4��,ir���B��O�L��@�7K�t��Q^T�yRdp���&�	/�]�Or���\��{Tu�`Z��������zd�S]��2I�9$2��,��i����O}�1LK0WC�dgwZ�H���4����}�����g��qM<�i�w���C�\6c�6g�Uc0_5r]�b�u�����o� ��������&��o�^I4�ur��V2+�%�&���t��ts�2I^e
���o�(��k�������;���� ��j�����)�a�x����>��X
|�K�w�������;��)�*��"{�r0��,�9�H�q����f�P�Po��\�hz��Dx�-�����am���5��*�4P�����0��;� S��tvK"� ��\��}������8��n�Z�O�]}n*�a����1���=��I��{Rp��p���E�<� ��l���l�G�N�O��meo	.�x@����l���	dv���2B��>b�kGk��im�'wLmS�?{���+����q=?��?�`�H�����~�9��E�eSbt����QU�N�A�^"���C�V��ew������6�x�����$������J�uq�ow�����Y���'�\O������Io����U����\�'��,-����9����P����r������\ ����]��]�[������O��6�8(M\~0_�!��T�������^�����+�{�PMz�����Q�\z����A��u
��-�����:Y�G 7���"G���������d��V��gq#;T�<��V<'@C;����G�+���^��������Ih>��5��uQ�H5>�������:�=�:$n(h�6TeDP��D�uJSP��T�vMf_������5Gu��I� �����j�HO����$gW���OA_`
�(���P�"���|#s	d$����N�IQ���	�)tA���N���~���(�}.R��DP�������������2E#Xs�v����>��"
@u�M���h$�����g���;�F�Qr?	����������7s�<��,9JF#�!���2���
I���
��}��At}�����\���WJ�q'�����(s�Qz
n�)�����y�������K}��jD����~����o���>���?bsL����L���3�k}���SmO��
�R����2&=OH�1�`����0�����Ay�������wGUM(�#����`��\J��~�)��kL��a����������q�?����n| ����e<?Ti��������\���.���{F7����eI��r�K�e#p�#Q#r�DU��������]�����)0M=�D ��:�\�g��<_�Mc�����7���e��	J�����i��A�9�ba|��;�����&���`�0�,��<�w�����n~�^���������q�����Fq� �\�8�e.x�n�q��IEg_	5��XGn�h�2�� �V�m/����NI�)�!t�:i!��P������w�����/��0��M���Z�QA�(��Q(�dE�����
�Q����p����d���e���BJ����y6��L���\���R��7�,�{��!	[��"=C��]^Sx@6��*]1D���k�i�>�h�o���:�w��k��t���T�#hpz����\������D���?j8��R$H��=L('�����*tv2x��J�����_ik�+��#�2I��t������c���,X�+�z�
N#�\d�Buc����`@��N��1+2��T��:�����9���H��T��GC(=���G�Yh��<���:>���>�
����C�7^������&g�T�
(������s�tK2`t��D�Kx;�~Sh��[y�0�s9����fYm�.�����j\�H�����#���7V���X�#w��]��>����u��,�FS�d
��9A�kQ�p�\����k�\6�~������B�:�QP:���B���A�k��J���A�� �'z���c����x9?���l����m,�����<���Q�4luC�E����3��t��w�M�o>����tL�X�[t��c���p���:��C�����l��k"��o��d	,���7�V����R��K;�q���gCH#��G<+Rft�"[���:������'o���+0�M�W[P(��i9\$q�%���3�]�Y*d��kZ�s���>��n~:�.������x��MwD�+�x����^���g��qu�+��B�t����&ve�D��}f�`���������'(8��G���D:G����K�q�a��@�5P�J�W��|��r`�;��%�P*����(�v/��`��_�g���+�������+���!.��T?���i�<�C�/���+���d\@��D���R�w2�Q�"��JD�������^���;���� ���q+l�~����M��j��]N?�����v�����]���]���x�&��x���>�TNi��vT������l�qY���Cc\��
�P'���v���iY����O?\�o���z����,@z��Qig`sd�*�\�l^���H_�s�i�(�`����w����v��4�n6������K��G��c�����h}���}D/������:5�jk��U���O�B�$��|*�LHuUY"�LZ\�����!����S��j��M���u��Z���@%�h����
TM�������x*��/�._L7��5�Po+!�[�\7�������o�Ef4i�F�:)� N����a�����\�I�!�
1`e<���2h�Xw�M������a���>[c��q�@����ee����~_Y�i��2t���y���(Kv�
�?��k��@�������w��v�P�>.2l�X*z&��t9.�&�m����u���/���V�^i��&^6=��nP������
P�ID�� oO����Z�j�l�j�Y���f������	R���r��.\3/��9��	��Q
�f�������Yh:ZR��-�b�G�w��wl7=� x�� f��z*��0
i_�D�,	�q1�f;KK����&�~i&�M�]/���������}������K��!�q^��O'W�"H$]� D�Ab�p�������	<@�J��������i��Q�����hp��q-�$�������k]���P��d=������ 7���Q@Ysu���|X7��o8K0�RWD,T��3�3����QDJO�p��]m�#����������{���m��)�e���Mi;v���)
t3��w�O����R���X-��E�����D������a:���~mx�}s�j�����"�0O]����P���%<���z_���Oc��#����XD0�����rx�9��m��;]���K}I�s�1�Ri��tl�8�cyj
�=q5�������I�G�[�b�����?j9�L������j!��R�F^��<'c-�GR��V�����o~�_o�
gi|KH=��G�����hsYI�����K��,���x��:
4�0��1��T���a�#���c{$�f�OI�<H��:��T�p\j)�7���V��v��k_`�r�z�]m�3s�J�{oW���l���xi�������(MH�\��S�A	�����I�vz���1F��@,�r\��+�w��������P�m��NSj�)0�i/��J-���pGQ<V��u��}b�:��u�x�'��fx�#�:����#�����D���{p�y��'���9RR��04�Q�u�E�L��4�\�����=��<����\N<��t
N�U��'��
�������Z7��>~Y���{�v����u��v9�K��L�B �r���������W
|����!EJ�HA��"�O����������UJ�DI��1�xb��N������KC7�k��0��v��|4�P,�3n������Dl;�c?��l�]uN���}�5j5�%�	x���������!�P���������������"��c�/^�3O��}b�Vz�8TAvE�\��	�������+�8�t�D����D�79�_����&���d��A�3������;�@_�'��2�P��-�S3V��&�<�C������1���fI����%[0��(su�.y/]9dG�" ��HCq&�J�E�
�������;����\��@��e�(�'u#uU$�FKW�R�������%	v�����
+�!I�~��HFE��_R��T���
��b��$H}RE�B���@����
(`�T%�UW@����Q��l���I����f�����!�&���l�h�����!5���4����ft!	�%��G*��V)����v4C!����g�Q��m�hwR�A�5A��i&�I��G�y���w��S���!)�*�����A5W�HWY���c�����r�6���)����aD:��5��!��`�x���Z��m<O��$�"5Pi�E����^'��r�����#�Q�w:GI�������1Q*�����X\O�V�%$-�����j�,0>����>�[�i�5	�Y��������T��#������j*W��t�* ���d�#�8,v��#$	bz�g�|s9����v57�uOh\�3�G�r\��9��.�|�j~���(��n@���@�C^E9�Z<����NB �$D9X.����sm/�����|Y2��j���8�����������yQa�NJ	�4yT�f0k�`e�&�T�v��i��mr��B�R�dY���������'c�!��d��2"��D$�3��]i���H%�d}��3��H��01��k:O�q�IG����x����H���j+&>'����$F*�	�Q(���I&B�2Nn�J������p�'T���M�A�#F{�qk&���XR�P&�G�hz����������������������K�P4���&H�������E������k�����u@)����1�N������mr%�`>���`���P�TF������H�"��
�����?��:���a�����)�%����2e@��^�����={���2��KH�h�����g|�����<�f2u��Sp$��1��9�"su��>������������P+�]�
�N��!Q�Zg����R��
�	�>�j�����ap�<�v��m�oqc"���-�rGC����������[�;_�sE,�b��R�5{<	S�sA{���\��O���{.�������	'XX�QN�0#k�����A��|��p6;ox�z���t=���%��]���a��9a���j�r>�W]�43ZV��;��
h���?����~�� ��o�e����<(B�I;#t�2��Z��_�vx�����	$�������7�#
Se���A/y\{nKm�����L!%WL�����?��e���"�"��q� �W��V��?�N��5��$������l��)q!U0��������7�����Lq����Q��Bx�n\�Qj��Afw"G����4��{.MW���J�=L9"���%y���������^~������\�x)�;�u���k�`��`�,������������6w4����<���
d�!i�������eO����t�&�A�)LC6����|����� Z�:b��RO��N��H��6$+\���qy-��=�7��6��=��z�U}yh1�$�8��O�VM�!es������d&xt���^����e&���4����	����x����� �'0`=v�}^�2���w���
i|�X2h�2QJs.���9�"K
0@�����\ ����]m���%i�qa���$A�$i\=��Y���D�.�^�����$'J��$u�z�����w��T1g���xm����fGZ��|`g��� 7.�bG��l�p����������M5J��}	�)d�S���U�92 A���Yf����1��)QciTh�28�^�g�|��	���G�^�r���K��wAx*�T`�|4��y�������wQ���,]2z+�����r�=K%��aN_��_��[����vy8~����N.!�5|�(	b3�=m��Z3�F���h�����$����� �$�_i��D�o�/GU��=/#:��*����,�^��|7,��D���4����z���Al���]o/�G��]l�N��2�x��������c�d`7;��Oo��c�x|<�v��2��I��d�r?'�\EO��-�<�xzL����<���*���l�5�if�%y�h�W}���g\|��V��GE(�&%�
��[>��ZR����$��2(��G�;Y�W���W�����h�Q��`P��u�IpL-g��D�K�	��gJ��)4e5TdcX��I*Kg��k���]�@d�T�[�R�*R���sdX�����/�����=X�f�dU#���3U��U��:���Ol�\Vb�����]����F#m�Y{�C�S�����	pN��@%W������t�|���a�j��D�v�w�"����Z 5(z�����
�O��c\0�a��.��q�D
�2�yR��/H�Y�/���:$%�U'��J#�CR� ��A���C�x��gH�Z#ExF�-�E\7�O0y�s�g�	�����"��������'��uV��ep]�V��47�U�.���\����[�W�~s����(�)�BU�L�#��bh�������PK��i����PK���M��i���� ��walsender_new_perf.svgUT
a�\��\a�\ux��PKdi�
logical_repl_worker_new_perf.svg.zipapplication/zip; name=logical_repl_worker_new_perf.svg.zipDownload
PK���M%  logical_repl_worker_new_perf.svgUT
u�\��\u�\ux����{�F�&�w�)8������.����8���dl��g�=�A$$�E������s
(�`�������Il�$���:��<�7�x�^�n������'l����:�,�U���=�'��������z���~5�n.g?��w��^�N�|��������}?{�o?���=���'�����~�������V������e����sx�s|!��9|c�e�<�K�'�����n�e}�>��U�_^���D":��U�.�=�7y}����Nf7yv�]q���?�g���~%<���v���>�@����V��z��,�����x�����V�:�]����p�j�����5��|�6�f�A|�e^_����b��������2��|�?D[evQ��UZgUm��������V�Y�v}�����?$�K��UV�g_~���z���W�������������������M=������e��,Of����`|�O��;��o����.�3����bU�����L�����*��I������k���kc��o�on�9�3���~����-��:���/�
������"�������.T��W�+x����}��W�\}=[���(�������g�K���k,�|[�/�-�i�S�������^���g7i9[fu��*oV�/��k���i�����]��N������pQ���:�O�J�����e����Y��Y�j����?[���/9y6����~y������u��zs���g���MM��}U���������^��\6��_�?�o*���|���>�����g�bWe_���g��(z&�N7�2�|�����[�$|
|�eR	~�k���l�?��t���e'?����d�������"�S}��bw�����?�o{Q��/���<������eq;O��W7`��yUg��<=��������i�|���q�	��}���3�����?�)^����g����/	�g��?���2�k}�]��U}���Z:���z���o��;77Q�,p��,p��nS8G`mn��3����b���+��`��es��?��Ux��)�0���z�c^?_e�������/�;�7��5���nV<���3|�geV�J0�����������f�g�����/��wp<�����Fe_�_�Z���+�(���Jo�����|D�;�]�$�/?�B_���u��}����x9\����Of�?�yB|_6����������{V�������c����woo{�q��/�5l��<����o�l�.��p��*�G;��z�
���v������w?�
������i��R`��*+���)�����������lv���8,����n6o�?;�`�d���i6p���-y=��6/h���y���Zj�]B���V�cC���num�K�g�_p�l�*�VQ}Z�e�GO�����K�}��W�xY���#�������������8���������;i._�uw�_k������|7�X��:]����W��������Ul���{��k��W��v�������A��GX��k�3'������^fh�p����j�,`m�����c����|��)����d������m._�#����#�������]���o����=��=�w_~�;��w��/8|�����k�k^��+�>x�X{v2�76�E��1�6_�\�Z��J��&e���o��������9������6��x��c�>������9o����������=^�!��:���9��q��J<B��������
�]g�����^�����W�f���?�6�]s����|t~��y��L������b��h`s^5�
W��1@{Y��bp�
���vV�-��Xc��p�����f�o�5�}����_Z�
y�/�������<�02�/�S���?Y��_]�UL�&j���{e��#�zV��7>�p��.^���u�{���+���+�{�h�����=x���#���|K�%�O���ys	�����?}5��(�S]��9��No��M�~���U��o��"��5�F�-/q��s����@��n��?�O7/j`���.�zRl�E^�3���
Z����p}�|joY}����i�^D��0�^�	O�F��=X��|�����y���W=^>�]�U3M������>n��dm�mS({����1��^=�����2�������G�6s#p���h��
7;�# ��!%��~��~����z��O��l�{�����>��_�w��*�a6=��?���|s`������o����f�eS�s�+�J``���v�t�Ww�w�)6���P�I/M��N���{{��W�-��N��"�h�� ���|�n���=��>R���=���}O7aN����6�Fw������a�dhKEx$��|U,�+^4<���#���RC���K���@��PB���gH������M�����s����c���|�:y���}P6�����*��G6%�r�gM���WX*���yi���2����Xu�w��'�v���g�_��&�g��q������l 14��5��Y�,{'����"�����
�g�����ec�|���r�/��'���u�����}#�c���
x�7����m�����_��V�~�O���u����������?y��{����UZU]L
��I��8ij�H��MS�zT}j+S�����(���7/�+��X���Zc�lv��n�t����8�}�z�����*v^���Q�\e�*[v���
�'�p�7��������C4��d�@��z���������O����B�j�VX������GyQ���>,r��&R������e��So�6�e�e�W|����q��
��.��;p'��^V�/��;S�y�n��.3����������=����!�,a?���n�t�y��_���[��=����Q\���������#�-�Vh�l����c������q�eu�#�9N������������O��99�De��X&�f��\`�������o������W��=�����`��qR�R���r��uv�G���L�O�>�wzk�^���d~�V?�n~.������Y������]u?1�3��EY�3�}�fW�!q�?�q9C�����iO�O��6m~
�r����K��va\���������=���\��)����-|$���W�.`[�M<�&���K���u]���m�~���N��f���(�Y9����q��r��b�]�Y����R��[]���f�^g��6�/����.�r����f��qsX|����&�e��z���	������{����l���{�u��z���h����6��������E���J��������/�6
�VJ�	>?U�]�������w�~x��_�;{��x�^��[N��n�4��)�{^?�w���=��*h�C�����������J��0����Y�*#�����vgU���	uT��?#��A:����E(
���5������8�Wt4���������H����������l��oO��r��N�~^s.�������T���"]�+���e�2���j�����x��7�����9^l������b���������1
pk�}���k����M�N�!n��o���\��!�n�qU��q���Ou�-�YBp�����l��]������s>��F�=~��'S�c��������L'��,��!�z��������C�����6�]R�gJ1kO��%R������F�y3>���?�[�o��6c\��~b��n�{������1��<	A��{+�C�4�����>?i��e�\
>Q}�yg�����Q�_�WY�Ty�����/_�~�2I/`5%W�:~�z��[���3q������X8`s9ab{,
<N2q�W4�c�����	����>��	<D�o*��f���E��x0�����?�3�5-�6�{-�|/���H�n��T�qkz�����o���nwq����j�!D;��iXOc����
����<z�g��7P�<d���x[n/0=�0_J���T"�4�����)I�_���R~��������') �KW���,E���~B�3K[r8W�����!�
�s�1v���^&)d��"�v��E�H�|v*�W�Z%���&,*quJO1�E!�������~w'	����$�������k�k���yD�=c�	c���8���|��:p�#7�m���EQ�&O���m�K87���c�*�1����)�.�� �$s� ���J���z�����q�FI��:qX�����8,��9P�p`�_,/�:�l��oE1vZ�������~��+��-���|�y����{ �]\'���!��I�<��`�p�"���(P����R���?O
��8)j��~sZ���O�����*?�Wy}��t�3$6�[�������'��Q&���bMCd�\�fS	z{_�La�2��|�l�i�� ��)�����H�����Y��n�%��e��r�~�.�aD4���'��I�%g�JWB�$Ygk�w�7w<���/7�2��uefF���)��q&�<���+.��3�o��.W�F��&�a�0{}h����c��<I:6�J�nw���mo�/����R0���*�����?2e0iJ��LqO�Rs�!�r������l�L���6�mbu��d�m��-��,	���7�������u��~������"L1�|"&L��W���"L�����#��p��qsQ���U���I�0�2����h�P�}gB����b�*�����n^�<~`�\DG.����v�j.F���y7�;,�/�0���FN���%[
M>lE��?��������}2����F���o��lh�oH6~��|�k����W^I
i9F������?f��f*�]��H�heMn����L���]N;!��������&����>��6;
d������E����{E��$�*Pr�=E�^Dp�OR��=��=�~������2��j�A�,������$�5r.|g���6I��"�a����_���]�O�lG��E`Q�sO�JC"�GmG�?z��)����ecZ��A��=@!��C����<����O���]�n�T�m�������|HE���>�,Cj��%\)4h"�z��C���m�@�?
��d�Rr���������w6�mR���5c/���a��8c�5��P?t�����^$�Q�����������m�t��2K��7���2%�ab�-�%��Js����8���Wu����j�.�V���?	|��b$N[�Q������{�,9
���| .�A�-
]1����E�(�|
����-��M���F�`��!�5$c?���W�z����F%�����SN�������s�JXW���d[�7��t�c.��r�#�&����hhp��\D�8�=r7wZ���[��<�����.k���q��;���|��j@�6���+Ev�"�!��u`�hVp5����c5BMI�@��,�2k�?����������1(p�K�t���S	/FL��q7�rt.H	�E�Z%�����?�6����DE������&$B^�SkA}dh�$���$92��r(����-B�wy��*�`_�<���q���~��TK��8W������v@�c����4�0��K?����&���F��U���^�YZg��������j���@����ln�������^��Y�'y`����'9$��?@�^����m�;�Z�?<6�A�|��� �8i�+��a��o�'f���4�����
OD=��p.�b#��M�r?8����O�g��@}����>�C��p��~�v������s���:�A,�����vW'���:�<]Y
0���u8:5�`��(�����p�O��`�"�55p��&'U �g1�1���?U��B���q&��B�1������������I�e_B����i�M�z?��"nr��k|�a�������J�"�B����;��7[�GGoC-I�M)FQq��(�.�LHQD���	W
q���Z�C���3� mk��
tx������<����0n
G}���4�.9���9�ZK~,��S�?k��>�b��������#���W��d,6{�_s��N��&�m���"In:�����.�1q���&feI�����9A���{�"�1ie�!.��w&�z�f=�������#���8IrU6@�n�[Q����m�#O�;�R�)���_�Kd���~�?� �i����O7�1�!Z�o@���G���$�]n��7H���>0	�"�|���);����� ���
�vdFP�S�
@���l[^@�?f��e~��Mv�R�_���5���<��qi
"eI�D���X��e�,��%A�F�;6��2�����GjSt`����eJ�P�(����:�4u�E��u�2���/��h]������s�&I��Uh��D+����>��KeF�U��O�X�1��7�R4�KG��Q�#9�-�3+�4��C
�*Jc�AVQ�\�r$z�o�l��$�h��t��/�MU�,dq��S	[p&����>�T��O��~z�a���"x��[A,�Gr����eR�G��E�-��2�����:\�1X����C�j�oh��W����v�*��#%8JZ���M�����L�] ����#>��Z���`�W�ZTE����^P0��b<��y�v���w��Yy�����I�V�I���6y|`8�y2e��>qHM�:C�8�r��`�����b�&O
�]kH��~���,vJO�P�:���u��������]>t��(w+*�#d��B�-ve^��[lyU�@9��>�6G��t��������CL=���i���C�4a�"r��w���?�+�`�KC$����
*���F
'�\9������E�������Rf���V�8 "�=b<"W����Sv;�BsL����3�R�
9�>�+~|���z�x6��1p�2|������U��(�!`P������a�e�Y��N�j_Ye�*�G��*�X���"���"�(��5�nbi����P0�L��-R�#�qGy��P�������X�t��im�0�H��eb%�M�(�������+;i4���������"JBH�\Y�I��&�6�$���J2��M5bFs8Z��}Q�����<K*fMn��XF�������3��s6���U/�:]�&o�f�|P�c�725=����'�C��IEiL���7�E�)���V��>�'��Xo�����+C���S�|�P--��:����#���RopFc8�F:��a�=lJd�}��6�%#���7v����p���{xB��c��� _I�Dx;���J������f��������z{����r��E���������0�e(�l��Qh��-�M�c*������o�
��"��v��o2H���"�-�&����m���\��A�i�)X��+��]�s�+���w���,�X���'i|o�^����L����V����jv�*I����0Z��G������,�s?t����M-x�N�
���x>�(;�}�d�&������+az;>^�e���N�fL��c42u:5'���0"�G2��<��L�j�Vk�mQ���l����0����Yh����|���l��*)�8��.���C������{sO�p����Mc�����G����>
9�h�&'�uy�y� ��?:)���U�����t�����S~�	-K��?$������*.jlB.��U�Br�p)��8�����	�\���$�T5V�Y��l�-bbz�J�t
�]D�l�@z�,
}T���M�V�(�$9{����:��������8��o����v����Z�T��� m��t/;~��=����em������3UmBD+!��Y���fr��W:�Ln���I7DP�����&��Hl8�[8����yAT�f�X7X�a��1Fb���c��K�C9�������]^�yu������1x#�S���H�2/�&���&C|���X���z�����~�����X�P�N(Z)�x����w��^Z=�.��c3�p �����������3L)���	����cc�(�e�v�n����yw� ��e8���5���o�IJFO���U�zp�M��G�X��p�_���q�,�k��gK��)G��������s�w�L�f����k��&-�tS[�~�A8����!��hNl��f���=��H���-N��F�Q����8���Z^����Q�rz�Tq]� ��#%
��]�,	��lSKSA,P����]y��#sZ'�m�-�EVY�p��{|[�BjaI$8���?�A���3p%x��
O+�k���8N�RC�++�.
zF�S9X�c�k�R��<��(
M��"�0v�����0+�6�2[k@�a
(��.�4����2��a�.\b�^���j 
�(@�hy�}3l=^�v�U�-���m+�8-�T,��o�
5`���+�d9_��"]�O.K�w���a��
�@42�&��H���i%�n���)�W��7;
�z4�f�lDx;�L�T�N�sf�����^]R�N0>�"G�_&:��C�H�e�4�L;?P�������q�\CL����Q�H����5v��mQ/��?fE�s����� -�4+J_z':���J�
#Y��v�����A2Q���\������*�n!��u8����P���N�'J�A��G�F���g~1�����u$i�F�����]H{:�93�28m ��� �d%�{�@#ZG�I����w��wX�x��B!��F5�5!�#C����t{�4��� 0�;*n1����i�I�g0f� �	"Wz�-�{|Y�!+����++P:�Ci�'�$��hd�ht�~��3=�_�EU_�Y�g+DK������e^z�F195�]��[�����k�
�,)�k�=��8�qL?i�<aLLkn
��Kb����5>r�d���s����L�y���i��F( �Y��^^v���MY�V8`Q���9M{�O�5L�p���N����� xK��`���/r\�|�E����J5�
���6d�%a���=��uR�j�����:�,�"��*��X���Z)a��=C`����i��*��C�i������TOW`n$�$�����=����a`�ZM�1?�F#�IQ{�����������&,�xK:2�3��U������B��;��6Cu9�U��+�M�>B P�5,�0te
���Xy@�qO��{p����nqu�,�z�Oo�P{%E�*�#��4V��Pd��]/VEZG�r�A�'����c��L|�W+
�Q^������[C��A��dY�m���=�B}�@��q��4�"�F��i_�e��9�nod��	,Q���#:�Tz�V���v�(��������M]�[�7�1E��8���Hm�.)W���
�]� =���F�7��#�J�f ��i�����GI�A�����d����d�<��;�G��3&�z�I���=c]�'��_��������&/�#4y���_����L�_,���4��?r:N:�����L���\�����v�@W5�"��Qb��<HzL	��y��s��N��I]g�-��������Dh�Qz�<�U�<d����o0������n��G��4".#��/�&����Qw�WE�}r"�K���<����RO�'�r!�"��Lr+<ui�c���"m	]BR�+W���m���"eADX���&��',8�FD]�WI������b�H���c/(��I�{����#X6v%(�lb��.�VF��~���Jb��44pe�v�J����b,��d7_���rz�6�F+RFr?]r����t��+"���00��~�,`��'���v��e�M:}d��y�'�P�8�uZ^'��Xf�2/k���Y�g�V�8�H����)fw����{a9Hh�(�W�6����/�`N�
$|w�m%�W ��#$��bUUD/�����Y��I]$-SQ���[���hCO}����*"��w�u�sQ�+^m�o���Za�������>�$n'%�y����a�{ )�;��A�"�(���J��X$�jz�����M�+���	�����A\�cf�i:��.8nR	
�A��5�>����g��Q<�I�q%3i��&c�
���M����}��Yj������F���M���mD�\���������/^��4'I/��xu{Q�k;�������:p���d
R3Q���
���p��&9��'�|�$3r��L��(Y,�H�~`,��BI��%��P~X,����^�A��T�y��]4t�-����>�Q%�g��<CW!#m�($��#N��b)i�O��$��u�M��N!-�BC?b����a� ��P��p�f<�%������,�TH��F\ub��\�������&s����M�Y�j���+l�����Zy]�����H�~��E4�$�8���s$���m����fzQD�}xPNN/B��<A���<Z��8'u����^4D�VP��`�lM����4�y��q�
[S��]h����y��zg7B�eM
���=E2���<��@���6-��8^>(+�yU�����H6�$	i�{�h���KC���%����2{[����lsQ�p���pa
�e��,2&*`�(6*;,���[���F��1k��T9�����3�GN�U��lm��l�fH���O4F��#p�����^�������Z��b�����C�X>u�;d�2w�'�FKmj:�r���h���$�q�4�j�����E��Hv��%��Ju�TR����eZc�KN.��S����4sx��=,L�:lM�,�����|�������x�o������=��X�e0?01��5$��":\(�����
�NmC"��o���6�D"�<9��_	Cw��	+�6f$��t)��������eLKu4.�
}ac������:^�a��� ����	�|W
n��o2��mT8�N�^��A���"��gT8����d��1���(�N��
N���X
OIk!Tm�D������������N��\v
u7����M~�����-�-��S1Hw�(����4{z3�j��+��	�OW�;P����?���d:�������CO���I��
������Nk���@��4��J+Zj�����_���'�].bb��$�$�
�>O�.fz�����[��IhbB"uFO���^�����t�B�������8!Pn���]�U������5���5�;��F�����{�*��U%k��?�/=�i:�KGik��$9�_z�q������}P%Y�dB�*w��|�����`B*	����p�#���f�Mo���&]���l�d��
r�����IP;���k6
�2���J��$�.�A 
�%9�Ts��n>�"
D���)E<�Or`��qO��MwM�#��O������\�fHL��(�x����o�.��$�[��
�{����nuap�C��N��,"��	�2ve}�\TF�0H	j$�?$�"A	W�y]��@Z�&���Y�[p�:�5��]�-8,������e�Z���L��<]\W6��a0����DLi���E�y�Kp�I{]����r���?�&��Ik����#���O�)h]b���n�u��w���y��C��E^��p����}��c�������~]d�g$=$o$�����<n0
���=��b���mQY�X���r����d�b�����2��Xf[8T�e����6��'������d*�J��D4���vu�,�j��X)�D?\�<rTOjO��S3e~��q�R�u^��
�#M�������g��t��P�?g���7=8�p|:�@
O����Q{�:�zfwy���Q���lU��������9
C"���p�~�)��rj��������Q�i���SP��C&�O�$�>���Tr���U�Y�
���>Tc5'5�>��IQ�������O�����\�Q��i`e����t���S���0�d>��I�$�����x"9�����E{�mj����I.P,�D�����z���������+�_�Y�+7�M�b�Yj^�~���r,e\��^������(��$�
���&�mZ�]�d5b�G�y`VBN? ���2zI��G�=�sf��{�h��C)����'��I�q0��L^�gV0�u7��2�6�X��!7�i�i)N&����G��:8�=�A��I$�R�I��/��^f?f5&�8��v�.��q��NQ�C���YD?��;���d�g�����FW��L��W�T�5%�?�%�Y
�j"g`u/���R��Y��TK�R�E��/�!
��`�=T�G��?x��(����(�������
J��F��>I�0o�hCu`%v����,-�����U�p��vX��r���a:
��8o�q�� ����W?O����hq��t���������v�,�j�J�&�i�u(�	�����|p�b��s��Q�C�9���0��
�����=���l?@s$��0z����i�+"�|W��I��&�6�$����B���+�����S����-X�.�����������:[�`�F����������$�[�S��U���KgF�4_�6[$u	�jUs�;b?�DT$��<���#[9~;���eC��I���/�}5*��#��JO���Wy�
�A�>������I����Z�#2�?;���:'5wZ@�B�n�i������'=�4����b[f^������N���3ItaD��T	FG����90q������S�<Ofe��1��Wv��u~cb�uU��2��X��r���P�eIX������j�f����&G�����|�����hLO���m���I��y<���[�6*1�&��/@AW�bUT�a���3�������C(��l�;�K5��E�.�(�%�u��D�;�$�R���b��r�G�lt���%��������`��[-o��l�d8��9=|��'�5���\�����A�����e��}d��&���9������/
�yq���
������M]����{^��r�!(b�h'��9����H;����G�`����� �)DENg�� a�Cs���� ���vq����p�\�E�~l	�`�N:�\�[�� ��6+[+���2�0,��s�����4��K�TJ���*ZG��f����#�'� ������'�����V��F�"Z�x��|��nuN��K�`�\����d��#�
O	��t����)A�)�dJ����'\���Vzm�I����2�E�����:L��7�v�����������d�K��^���JE��]��Q�oWE����8*���X(k{2�ea	!��\�
�U���un�v�$�����>W���
hI�B��)�d�,���������A�'�1?��+��$�I�
l�H�q�p���������u���]����#���8��(�'��p�N	��q,X�)F�?fu?��(�sS�=ZT�m�G~<����n\�1N��#����B�����`$6����J��
���s
�!��I&��$9sz�X�%����/$F�z�����Z�r�	�&'�P����%x�bW.�e��������-m�}?0%#�����A�@p��:L�&+�L���J�13e"�����}�C��a��t�e_�`����OeC��������z^V�x�-�����2�H�'��s���G�M;TL{�N|��k�Z&jao$���0�~��o6� ����sA���p�I�t%��t��AcYM��N�Z�R1�*�H`������3M���X\�X�����*ppl�wr��(M�H�%����C��,��,["W��:c`����@a���a8����},��X��/�����������RN����!A
��[�Y�$�(��%E�������>�>��FK�F����g��-��]�O�ed_>6o�I��b������*r��D�����O���hr��#�N`�j��'����]�F*�������XtG�chQX��;	��O���R������"�#��������G�����B������~�t�Wq�o3��Y��Ca�D��$�mC�80��w@n6Q�O�#�S����O�����[�R�$�h�H{��PI�bR*4��l���|����l�g�F�YS�	�aD����������~w���~�9-���	��h�7{��D�����:c�V���Y��w���dY��W�1�������o��:I%��'URLT1�����d�[���l�cY��������h��vP��A�v _'u���{�3>��Dl�n=���%[�'�
]%yWt�H�������\o������2�7{F��0n�_9��2��A�-���?
���O�����ElX����V��"�@��t�����QsF�.$*�4��b�UV�S"�G�����d�&���X�	}��]Y�X
��X��;.������'���s�'qSZ�%���yU��&��Y�!��a������*
,�G�k~T��7��r��&8�O9�b�e������+���<��*�)�^�{P��"%�a�:i�P_����=�+���3����R+Q��#�"�R���@�����J~9���?mVX��k	�q���7�zD��M�9Q~Qb_��$Dx����1�����QhB��������(
��b�<���)���[=�o5fs3Yn?e�����mIg�7Rd����%��UzYY�0��
��dZ�U��5baxX�uz���Q�fj�:n���Z��
��S��������Q���4���0��~
��0K���H*!���T0�dQ�o!�L����]�V�)���3�uv8"N�����r `o�t�o�����H��N�����YQ���`���+�C�+6�0�b<�e�8"k1����A�G�wgR��s�f�m]�O�||�L����}�Y8
������TM�d��Af?;�(�N:c��#{I�b�J��]�mRlA���/�MUo�b;��i�'���&g��C���W$}9Z����|���.F�~�=[�a�?4���"wx�����,VYZZ��
��&��Zk��
V������]���m��v�*lWc0��L��JXP)X������=Ir��
l�a���X��6Q:���Q�2FJ���w����Px�����Z��f�����1�[�����Dv��):2Ze����(<�������t�����'d��iY�~����� ��|r�RQ��F�+r/�d�����n�o��v�(QRu�z���_�����m�8E�1�6�'P|�������Qg*@��k��|u���z�tO=����c�G�:�����|�r������v�l��Z�t��1u�/��(�kv���������^Gg�e��Z��z����rP$�P�jn}���-
��O���|�Y��e�2�5������e�|xNF�f*�����F��T3���������u���~��s5���A�H�%��0x�~*��i(�H,���$���6|�����z@�/�%O}�����IR�'����7�7Y�F�j���G�kF+����v`�����z���N���)
�vz��x�s�2Fh�NZ�����F��&T�},n~�8�5��z���"��?�({�.�v���;��&�RED�����z��5��bu^`[�/�;�Vj�p�pW2�����p�RT	d���C��T�C#I���������2{���r���N�tH�1�Oo*MJ�'}�B3#�,s�QR_�6����`��1�
!����'�t��������1w��Ya���A(����0��}�4��Bi]j�m���v"���tN~����9����9������*�2��`fZ�����N�83�U	b�n��}�t�2�	_f��2v%�Ay�C���{�}9;y����������60���]VY8���G
��A&�9q�[�\�R�4��/�`���0o�Xz��Ph���1���F���2%����.F���h�
��we���z�����|��me�X~H���IAj��8����?u���lA���+����PM�
-R�h��5�D#zTX;p8b��*�YL�p��FVx_�T��]�Z��8p����_�I�+�m@L�HO����s%
��]�i$�/�b����tm�1�g,���Q��|b�2�����KT�@��@?R�>�JKi�1�Y�I���kHl�����������G��f��#Z�A�J�M��f�������:�9kd�^�m�����^+jZ���y >]��Tv���!7x������c�3�L|�y������rZ�OyL(�D4\
�����o����9�fr�f�"�������R,�5��Ef8�������f�����4{#�p�(��L'm��V3�d������5���\s��j�a M����}���,]R�5��.�q�����q=~�������OmW��8��8T�'�1L�����>7L�����=���iNLn]L��($gG��r�y�Y4A��Y�x�]�D�v+���"�L6I��t�1#r�b��C�z�J�@&�9����-����4udB���Ge:�M����$��M�������Xh`[���L�J���"N1��>�����oIv�[U�� �-'GX����U�~\��cK,���,+����*�$ux��P�Fh�����[�lw������k����C��)���!���p2���u���N�0����� �'1���)
	�B}>s	|V�+�M�.���}�*��_��H,��H��
`@S>~[��+��V(IsWI	]������E~���1��7�j�z��F}�;�Z����]Q7��I������W�AT�F����T��;��&��l��0����"�q�]iY|�o�����k�{���������zF�\,Q�b[��S��&d)���k��T�\8Q2�w^���"y�'�d��9pR��\$��u���4���>,c� $�����NL�j�U]�5��wt��V_��d��,<T��R��Ce����
�?YY�h ��H�-;:�|~���FGDv�9��x��K�;�w D7zI�uB�(��#0d�<:����D|�c�(�����"����`�J��������$Q���s�k��x��Fn�R��L�RD?���')f��W�dqx��S"�j�pm�c�������?�9?1���ms3���t����y���l��P� :&��mZ��d7dB8��E$'qL�+�9Pi�����m## �!�}@EkvWg��`/�b�����I�f#&KJ��46J?�_����!�����������Gm�n����������l����H{l�A��G���4��^�';��q����X�F���~�B���� �\\���A[IF�j[�����T����^�����}gp����&W���i���)���45L����X�/1�������m�Y_<�S1���v����4
G�G0TH�<A	:�VSC��67�*�A�����F��rz>�VU�@����R�E�:�s�z��IQ�a��/���
a�e���C���AHwR�wa����m�;��e��4q�����N������r�8��b�.�(w��!�=V]a��8:@Uc�����.����C�yp��c�j"E���(*����E&�w`i(���H�%%�j��LC�B����#��\	��n���2��O����'���|�O
?y����c36C��-N,����*ba2#Z�o2��~],�U�6`W��U%�zhX/,n�`�z��{E�0��V�Fe4���l��Z����'�n�%	�"�z@c4C[���p������o  2SA��e3JF@q����(����@vd\����lS��l��V4����KG�D
���������jS���������m�����3�{�$��/�",�[C�����S�P�"`����c��P9T�py��lSee=�I���V�����U��I��@l��j2���+��<�6]�t)��O7$<�������b�}���g<�9��^,�gu�~uWg�%&8bU�q��G��jzv2��C�I@�j��G	]��u���z+���>�:���as1M���"���j���+��7yu��iu�T�y��"2��6?���������=�m�=b?��{�������@��T��{/������U� �c��������8I������%v%��id�Hn�����d��������I�1�/�eBp��n�k,�x����Z��x�w�PK�#����|���j�S7v��^3��:���>���=�������U����?f�?�/M�����(=��x��E��!Wq`X:5_�e�~���������b��4�pp����I5��#C��� �{�
��B�Ac-���#�=�� %B!� ����N�G�pf�������K:��5�%�,*!�p�'%�bUXb�c�,IkuD?�j3e��3���f=&�Y��u,�JO��8���"E�+�����A��9� ��#rP�F�t`������8G)�r!��������-gf��5�?$���{|Ti����q�3��(�X�����vv*D2j�*�1�F8mF��GAZ�����i�[X�"���:+���*��c���`�#3!)�������=�&�6�$+�E���������dc�&��aJ�N���T@����F0��
k2��90�zS&3���1���%L�����0�%���.�JW��n�]y�4}���4�LLcu-��7e�:�4���A�nO(m�X��
"�(bBZ����X>����(F�P1*�7���4�7Mu;��3�g��X��x�,G9y���g6������3b�
Vb���N>i)�-�2�|�D��=�R}T�0�����/_�~�2I/P�����L�Vv�5u�����
����������4����:q���X ����i�^�&Z���VH������FI
�K��(�
��})����$$��J)��7���^�	��(�I^%��P�1k�>�q��^��?�AuU�V�$�nW�I3�_%�we�����e������������Q��W�!Z������^&|���������f'�N�q0&#B�����<��]�lfd�mA����~�s�����^���n_�J�?���<�O���w4'w>����SI"w\)�����CQ^�o4��174�]{2|MdB�p.����F�5���\����2FF�=���J��J����(4$'�O���&�����u9G����E��5�I'�����h�f$���$uM��O";9P�n)���f�����������(S;�4�,?�c@/���N�����p��P���);���`��~�'�*?_�����7$	O2Z�d��VW(�i�:��4Q���
'�a�X����>	|��A��k4�D�%L�s�L������n������(K'&MG����:���o��G��P����A�J�`��g������{ q����h�4w.*v��8	�<�4���=���}��HO��>��8���	=M��|-����d�XiX��k[���Zt8<��b
]9	g�B��t`,�Cq6��9L�L�H��_��P��=�F��^UH2���s�b�#v����x$R���l���]���(.�h@����]�b�X��H�����1����< ��B"�.�������kSrZT��<��t����^�I�3B
d�G���A)�YU@��bE>�5��$����Z ���r���i?*��=��ED�8k 2QU,0Q�|q�ez��NXm3���a�q������2��@���.��Pz��"\���Ht:��p��1]g���a���=fP��
e���11ZRq���Y�2��Uq�?�n*���(�!��1q��=9w�6�$�l���Izs��6i��d��J��1�2��M��5��&��aK;p�4�.��Q�����(
���������Rk�]�y��=B�#��T����:u�����$C2�;�(+)|d���C4�AL�&���$��Om��av3SfU]�vb�����9x!��|8sf��W_������1�� U��d�c����!�Z������2�_���!:�3���ql2�o�l:T�������O�\����������eq	[�J
#��h~���&Q�����q�I
�+_��?mW�U8�T���
hE:@Q��t%�,ox;��IZf����Az��uvk1Ped(
��ph�i�aQ<��t�`L.LN��I�i�Ap�����IR�DS
A���Y����$nVzaH
/}X����E^�$9��N`��7�-�����6�SN�'\��w�]4*��J��t���������C�Fi��P���XzUdbu�O-<`tJ�uV�KqZH��|D�a���J��/IVcF�pVT��8
�����>x� =z�v1
����I�&\Iv�7�_i_��"]���j��,��N/�_�����
{�t~�����d��p�	��o?eT��HO���N���������vc��.���Z;��D�'�u�)�����
a�0����rLh-�����r��p�'�=�������`�p*��r���a�'���v���4�;}��W�#���������_~��������|�t�cZ1��3;���&�u� #5���:��| ]�x����`��hEs���QqD]�@��3=�$��,��UuY��f�zS������E�����O`��S#K�������/a���Okt*Mw���|�9��V�1C��L��C��	�5��m?D�z����o&������v��3+%�j|�-L��6BjE��%3|��
��.8��j�8�Wn�����0ag�����1�;����\qW"M�i2�2K�� o7/���Nzx��ZY�1nF�8-;!��a�P��p�4V!��	�dMS�Z�������������bd-�F�>gj���/��F j�0��9:9��&[�9TKl��3%i2�H ���H����6�Xe���d,����A�87����r�����7+p��Q�N�kzF+]�`2��Fo�
��;���6��	�	D�K��p�����L��8�c���;����S>�������	]�e�����t��0��HF�XR�I-;�=E$9dJ�I��eHy�y��ZH�]a�qDG�b���e+E$�Q"�� �����EuMq������6����)�y�NB���#g�����a����h�{8��)��_�t�vKn�0?[A�^b��h2��Ae�S��;�Dg�����5�����QB��q9L"���&�_��e�n���)U��h���el�`hjZ�&
��Lw.��9��djjQl��:K��s���o�0������,8$���$���Ng��:��H����DL�NE���n����U�����2F�<95'���RIw 7�<�c�^Dl��F�"���du,����j���'�'i��J���w��;��<��e�\������Bc�!��$Z�k�Z�-����s���/?��5�������Q�I�3�$�Y�@�[ps^N���#���I��I���p��L J_�h����m����oY�����D!L�i�1�Cd9a�QkBf����&��>,Y��O�n`i���?,�z]l���6+���%��D�8�`HR>��%�������]��XMH��
������8$QI(�\94l�����Io��#�"��\�a�R�N-V8q���X�����:@N���=�K�7�>�)����������)�"I�����3�����0����t=Rj�k�����\�N�HEu-�
����L�)I9�y���O3������B&��,v���
q�����m���`���`D������G�����4��|��Cy�@hk�Z~���i��H�4�����e�w�@0�M�,����d�������s��^$���@v�B�wT/z�]�����0�C�����|G�1-
L+)�
K.������JrP0f���jr$G�Q��K��T��	����d�Jg�'��-9�~��D!��bv�f8��}iH���;�f���9"���E~���F"��f�o�����	��}�ig#Pd�	8 c����*?��V.nPA9�o,�1U�@�m#"��)������BH0l{����e��G��'{��o�$��<}�������o��;l�}B^4�{�Uf3�(bH�}''Q�O��0�L�}?�F�9��8�
�%8��T��������){�6%���HX�'�8�$r��i�c��6�?$2�s_�w�H<��4�o�����1R���`{A@Eu���[��m
������F�8�f��Z��G�(���SL���]f()�m�P��)��V�|�O��EJ�}�~�F)�EG��uZ]��J\u�-U��5�D"(��tg8�0ro������?6����*��]�����.�W����dF�����,a21�
7��.�C��v��Y�v8��)��o����@E�1���i�M�#b3-�C���E�k�"V�[�GG�h:�^���g'P�@�In"�p�������8,���WH��D��p��<Y�(�B+!N:�'���*��u��ezoW�``4����+��%m��!�N*'�x5��O���V@d��=���$�����;#�$�*=O�i^]^���G]q>9�&��hpW� 2����K�����Y�2C�l�D�F�z!Fs�������iy��D�: ����b2F�5��c�
��)^n���(�M��fyJ�GOzr�i�h�T��]�����;a��t��a�jj�&n��R�Qi1��w����>
@������TyIb$N�im�6X�l,ng�4���r��$���	���^f?f�?g��p�oyh����G����9,K��9���
��������j������Ahz�C�>���b3������9���E�!-�;+��&�����c�'� *�9�sd�x	�f��%`a?�LKq9"A����C�:M��������Av~Q���y2�-H����Cx\��|Z;E
���������X��{����7��z����|?�-����ts!�R�pD�DF����$�B��'�m����G&���o�_�b��.�S�:=c�	i8-c*�[� $rf@z��i��0�����x9���������E�u#b
�pkX��(;��
,�����e�,v���E�y^��\1;��0s�C9��6o���8�
�1o�wG 
�Zo�u�>��6: o�;F{�'S�#���� U��J� �X���jw�\d��
g���i���X0:4I"����I���
��g�������j�=��"$y�
� B���������0G�2����R�
s����q�l������G�K=�&��X.���H5�Z�����O2��??������(���(TaR�������1#��Z��6�|sId��n�	���<�TT�1��mQ�o�"�4�T#�j���k��-��Kr������;s�.��,o��)l�uV��#-�{a�m#zFZ��R�����|��uH��]f���;.��g?P[�C�EWs�i<��t7`�WYZB�~��W9>�=����!�~�4�Z��xR99����\�}���@>z��U���2��	������E��~�9�����Ow�'����.I�]������^/�3��z�&�����%N!Fi��5�b�kb�VWb���G��X���#�
n�C=��8)��9�h���bg����������U%���k������b��a�5��C���8��z�it4�TR���&�JL�T@�4[+��u}�K�E���7���>�N��<�Y:�3�K.s)��f��Z����;�sd*a�|�����8��
�9��?E�F�^�I�1�p]��!3t v��L75���6��h���j!��p�8�#@����~I��u^i����Cf��}��.)D�SQs!|�!=~������g]�`w��V;**<��C��H�J#E�d�,6In�k8"�:.���&���'w>(k��\d�31��K?��TM���D�����
-���1i�k�9�gT��z���Z�r�_7�yM2�=�AC�rH~�������'���s�-��R��n�C6#��x��s'�Qp�r��:���+8�L
@�i
%�Z'J������������?~V��GW���ke����A}���/f&�|[�'���������Xn��~��U=����e���X�Za�q����h8�82�=D�g���NW�����8��QpfO�&���D"$��n����~(��2��!�kraF:#�����:���gDg�evg0�����h:`�
����*�Kg������5���
����{8`�2]�wI
+1������Y�_��*��-Q���AI�\���Q�P$��xw�	�*v��3b�]�{�������,q�����
�����=M��]{�p|}r���N���cg��t��;�!]L�����b�XKA`��!!����'
�)�����f��EY��/�K���q&���i��	I��������W����
2�'�1$LD�?���l����3���*
+�m����d����e��qTQ8L��`Z���X�`��'����O2 ���ZS�%������(� �Go6M����U�on�E$�H9q���7��P��1�1lk��[8]]��F�GQ4��+Q�R����oW�:}c�y4�$�q0b��=I��DUO������d�C��!15lS^�]ow�+3ulb/"$$����f��eU�5r[�ur��5A�R�N���	��$���#q��2�?6�g�>;	���w����*�H��*/�r��i�O;�M����6���"���,2���u]%:�\CdT��Jg����|����p$D�Xvd2T����{������I<2�A^yVg�Ww�
_d��}���h�$E�8�9S�x���|���.�~�e��F���5(��:�����%���.�e	�lt)�����*D��$��n�?8�q�9k���2�Ez�MW�AQ���7���3�$E	#��R����>�<s�5��c$	��'�mB��|��� �oBx�z�i6x�V��E�[5��������2V�<�Y"W�?��6��7��s��@>(��C��0��f����IJqOg��-�Q\�b�i�8��Ni0��	�;w���6+Y���J��G�`C*By����
?�H�&���!24u"�����r.�I3W����|��)�Q�&�u����N�p��-1;�bp�=�
U�w"�D�.i��\��c>�LQ���I���'���$G�&���2�O�bI&��teP*I�JD��I��4�be����n�xs��Q/2'��x/��"9�-���k���Q��Ia T��H<��C��wQ�l��$�4�GR�B�����x.���\��'��k1������o��J��)|�Z�JG@�b�U�]�Qt��C��)�-Q:�6�,�NS��������=]��#b�H�]l�W���oU�����l��A�����\:��x�Q�p99q�����C������'i��H.�f7Z�����(�r&,�9�3�<��I��_�#&Ng/�m=F�tnX�$Re�N}���N7F�?;��1�����X����<A��Uq�M��qi�M�W��V��}n3TD��m�	2�`/�q U|�&�^�>���E�b���y=�������Xe�E�q7���&"e����L�9pNrF�n����9����*C�z��N�A=(��F�f�|8@�}�������6�����G�)&qQ�.I�b2W\O��J�iUR�QS
m;���b�U�����q�~���P�d�:I�k�1|ZA�������������8�c�Qb`�;���VN���8� RVk�_��,�)�9�)��v�k���!���]�I�
Hw���������K)�����7��&+m(�%��5�*\�����u3a6�B�����2�O/#��]��q�9DY�����%#i+�����d��)��Z��b��1]�4-+.S��8$��?]���Y�[��Wg������HYHt�zb""��E���+��)�����!���;��C���:JWa��#��$�!1R����V��"]�Z���&O�QiGV�����ma��(
�����c`�U�sfE�%��`�#�@�����:G�G��(����p)it�>v����I:�r!�"�.��>��=���f������?@��{J��&!c�l�@c��Ek�l"��Ef���`u��B;�g������
+t�����2�Gb�pd>��^h�����T�y�����Q�m>.��*
{�lZgR0pN���DF[:<��hd�����m~.��2��
w1������
�K�����2r%eZ�~m�(����In�:�+���.��E�>XU��,)qR.�����'�lI���$����T���'e�d`k�g=�S�]��eC���Y$Ux1��=�7���h����ZBs
|l,_3Z����"K�������=�W��
��TON��&�]���'�(��e;���H�*��.��L�[!���ij9��g��dHC�xZ�Q���"o��_����I���1������.��S����(����xR�#��c��tZP�P�d����
��2
�,�~9��!jGr��!}9r�i��W1��HH�7�����6 ��������\m��7a2\�TV�d����(���=� |3��_I[%�8z�wh������r����c�b�i&18K�)��;���rJ�G)�������h�(�O�|x�����NOj��:����C���'y2Q^�����Y�=U��=��jH{M�P#����"RTH$Rl�H��l��l!�T����[�`�#l@�@$��z���J�{�r���^�P�4�9�(����8U�7�3�r���-pb��H�]?��/eu1��
Z�����h�(�>��b�����b!�������+���?�S���d�AK�0�Il�`X�V�y�������7�+�����c��"/�_^����������j�������(^�F.�Lcp�������\����o�kuxw���A$���g�8u��$�{�s*��[����D&���K����p���' ��������2��5���%�$�xA��M?��"��{H��0c��jyE!�9�"rCa^L�ug��oT�`���t�l�_V���y��F�^#k{��
Rby�s�x�@�������������DR6�saI34�a�nD_��	���2>U�k\|��}�+�vg�s���	O��,�lG<��i]/"�O��gu�����%b��h��|c��O�+U����))���&0{��V
�TdH��:��#��C[�Q&�Pds$��\s���j.�\��s��i������M����F��0')G��g�Zpx:�@+p=�q�B��@��<>b�0�"c��pB����>�]�{���~�a8�
4,P�QM%/6R��T>�Ai;3���S���o�B+�5u����I*+ +�.��8�+c�Y�����Q�!-�n^2	��_O�/��6
�uk���1n7�G�]D�)�^��|�
c_��d�H��p��s|#H���dF�uGnr�
��,GqC��oD{�p���b�oL�����Dsy��������[�!r����Y1}-G���E�����g�?���f���������$�`�>�����%�Y
��'�!!
��ynLf�5u^��}y(��]���~����9nsvk�g����L����Lnzu�����$DX���Q\�s;�o+�LW��%���h+�}�����WP��uO��~�Qt��n8��	X����`��S�7��Pu���<_�ST���
���	����*�X\����Cmi���H^��M�M �$!����8f�H����i�=�
����Lh�+�$����i��y?/��%l�'��br�dSx}���ui���hf4������c�5�>����6m������_����H;]��q����",7��0��l�2�+���~?�e���j����;r���BLW�%
�u��:����'^�����c��&������U�7�c	���Qo� 
n.8]�J2�_X����K@nZ��p���c|�@K�i���f�9Rp
�#I�?��z9?V3��#����< ��"�fbS$�b~N���U+A��G�I)J@��=: ��o5ZZE�L���TY2�����Ib�f�.�DA�
�Wb��@h�\QU��_�n�����pd7$��(�c}aY}�R^���8��n�qn[����C%^�G%
k	g~�v�P\���v��;�]h��-{�]����e_N�f��Et�C�"��>����|�%��v���3���?��W_�l���V\����Q�����Xp,?N�
I����������/�;�'�*�2���H�H���	�1QW[����	H~�����1}����X�v�vQ���Q����l�=!�/�c�'0i��xe����+�
*.
����|�o2��%�B�of�((�d,���n�cYP�4A	<��v���\�I/DE�W`�SJO[���"~z�s�3��n2�7�A�G�i��L�	��=q�-�6C�o"I]3�D-��%0�\_)�{.����v_�����P�#�6C�,�D���2N)��Z���������rw.Ahu�v&�*{�I�h�����$��aP��n��w�Z��R�����2��F8��T��0���0����,G�W\�:��7�t���T!N��i
aAE*RJ8�{0�./�v�l��62�5�4�L��HsS�U*���r�������.us���=S6�7�h��z�����^0��S��E*�~??���mW7��2�)#�����Ef�Ha���a���GQt����� ���jt8Z�r�9"����n~=7������6*�A��`|@au�E����T����6����r��}���z_�w��:�^(���f�
��,�����Fu0�D��T�����i�q�BF������S���Wb��j[GE}���`
w,d}�V8��3������q@�w���I}7���?�>4��.S�u�m4z!R�>.o�26A~�������o��e���2nz�xs�i��K�9w�Q��,���y����
���������������I���	���{s�QVD�3(�}���xFSYn�bSc8�����X��	*�!�c��"���c	R�!6��i�,��>�����M�>�4��l,����(��"'N�"b�
��	�R8$�#�c�X�JSx	V�9�H{w�<��4�`������E�=�	0�g������c�U�����|�gv��L����]����`A5�4H�z[wP����� ����^3�1��P#�kf�L�����[_��n&p�K�*3S��dp��Z������+��QQl�a��W�3����{�|��r�L�����������y&�-��q��l��0����W#��{4!�|��p�Q�si~�t��	@"6r�R��r������
��jB{��m����TN�t�����m�@����1�n�kw�<�-p+��.2��c����3����In���I�;1����=��	�up1�n�I��~c���K]Z{��������.-�a�����s�6��b����5 e����C��^�:�����\�@~H��s�>�����R:E5���)=`���Ji�
�tB��^��T?�y���Py��]mi��*��9�;,z��3�pZU���L�{h��3����t��j�
������l�-!���&
�q�nQ8���ErCh2����,�,y&
����0�C�'7��B:�UN�T��o�����8�~����D����C8��~5���R�P�N����2iT2��������/3��������^v�=B��7��A���9R��x���|��|r��f7���8�E����2r�zO`�l?�G�Z��4�*�0H5��M��&@Wx}m��1=�)V;�G|�/ B�u�K�V��_\�e{�������~_i�����q��Kt�'(��S��nIe��U�F���>QU/�d�L!�b����h�/���.-�m"���v��s��H��oium��5?^�6������1*��LP�|��[?72�5�����eik����,����w��\�"�����q�aBLgM&�r���PNz#n2�	Z�j�J�Jkdc?[k�P~6�,4"+��9���-�e��r���h���I]�B�~)�s�����������YypDMk�Mc3����Il{��������Qh���$�!����r�`�j����}_��	+6R���4���LC�������*R�����G��2IN�iZC��IA"�I���������R7����`���_&�os��Ae"9��s�\�����:{:�/;C���4fi�B#YZ�M.Si>m��}d����I����M
/R)=���3%�����FQ��0TPGv��:u��Er�9���:� <��Adk������?�l�e��Bm����K��{)!dH���s���O���m��
$�B�K���4��_(�x
o��q��E�)�!��T�����k���b��~W=F���p�Et�]n�!�d(��?O�������g3![��RS�����S��R���I>�'�)���"���M�a���\������8�I:��0�]!���;O�$�4u�p~���/�q����l}H��'84��c����sp�NFv	K
;��������)���p`D7�]����f����7�������P]�$8�����`����K���T��?�c�T�>A��O���������{V�A�?�f��iS�@��%�"�����F9�,i��
!s .`��D����d��P���������w�5���P1=I���"�V
��P��g�>�������_���z�������w� ��D	�sm��m}����S~
�{M���y��^��W�����\����F����c[}��p��*�����3�HnD2e��"c��m��{MH�&�:�3�[]��{�����Q����W���&���1M7E2�r��h��X	�Ha3A�&e*��$@(.1���n�����0$&�d�����hPe����F���?��(�����Ovw._��;r7}���h���L����������n���S�q����v<��Wd���k�~����C�;�jG��j��?x X�yF���G4�n�_}��CS�P��7��&4\#H]Pg��@�)K��o�������Q��$�7���"��u8y�4�w���������R\cRS�����m�@E���I@��LE'm�������Q�8�Q):
�`mI�y�ZI�����}��x�7q���,�u�I$�&>�$�����<��������|*��QQ����	���k#3@2T�as�����n��`��g�-oH����|Er/�A���G37���B����	�������m�1���[�Al�
n�<�����F������Q�8l�D*w�b���:��M}1��u��aacB0���_2L�_��M7{28���8rh3c:6��?G���c�=���2P�	���bLU[����_��w��u�Y5��G�y�G�k9�����R��O���%V�(W�[���&y�f�:C���"�@n�-�j��JFY���$H$����>�7�����{,���T���#���/r��Z'�n "cm/��L%�~��W���K
���6�B�<�rd�7�7+R�� K����a�;[��12'Q���UcPkfNT���bB9�Dx�|�c��i��i?��6M��{�<{���C����+�&����Y�B���5u������p2�@��`��"]�8��d��I�$�"��n�osR��)������L����I�}�Uc>��yM���8�L:tN��n�����n�?�������>�5y�y�a�$3
Z0����
{����	�_�����V6@�mW����
+���|aB� {�
�a*2���/���v����u���gS,'�1!����t��^�YJ���Q>z��5<�,�G�����(�X�|ps%�7��:8�m}����YA��zq4������f�dX��tS8���Os��;��`\������"������?�j[���u�!�L�\�����q�K,�����2r�����z���������F���k���[�L�F(��9@���C*�������'W����)I�L���2Q����(�n\��X,�161��(�T�����pne~
O�R�{�=B5n0��0A.������LXa�T�af��7���w��=���)�{!HWI}��Y$�r<aHyu��_~�������N-s��C��V`k���/mZK�h�$S�� 4�<y<��TgZgT��Q�O6�����~Yu������3��,��\v�6�',��h9��O����a��������J��0����@�B�}���LaD�3@��%T������s��������T<���kq!�L���df��5���#ei��d^���P�G��I.����23:�j#��d~j����* ��$��/��A�H�'I�� r9{L�e��y\���\���qi� ����H��<���������������T��}�+<eXXa��T���M����L�TA�W���M�L�������G2��(����X�����2sBNS�F�h��[�zxsA�HZ$�aZG�H�8����6u����3��F�p���~��|.��}�X����Or�U*��I��I��.��GY��8���C8w��n�MM�
��N�����S�(�5�M@�mr����?/��{h�6��
�q�5r����[c�~2����j�����D���d2���}�S�����ei����6��|�o8������ {�z��`����Q����v/��z|S���W�pv����_�g��A|�����Y�%U�K��Mi~
9?�}�Zqt�0�z�C��3.���������2jn��:���#���������X��KY}�BKs����/5~��������q���2/����[����4���������R��Z�����c�\4���r#�T-p�������nw\�����'ig��&nU�
��P�����O`Ttn��X�`5Q��������5b���6��6H��T��F�|��;��5+��OG�
�/�NH����nd��:��|�\�A@�@�38������1�6�?�WqY����}"�m.,�t�*~��o_����R����<������XG��������]8�RLw�$��1�c��wOz��Z�1����M�WEAr�8������A&�d�5A�M������o�eU���~`��<�����'��i0�D�2�8���pHsI~L%:���X�h��li�5����pt�i�c���!0�7���Om�\���$�V=����v�}��&/(	{\�>RY�D�H�j�o��C]����{d����(����SbinC�@u�$/�O���)hX��i�o�$Z@����n��!��������+r���g��6�N�=��G6k{�������A�������Z ei��1��S-��J-������2�!<��n�\)��.LA����$��7��'���nz���E,N��g�z>x����:�����8?��N?\��7l�5�.O%�*4XB��p������jlK+���,�d����5E���K3�w������PZN�73�%�8��jE7y2��?������5Ov3h�lZ���0����������Y�R��T��'���?������f��P7?��6*�4�#��x���<��[\���*�����,b�����[f��C>P �
�t��8�[V���l���	�cJ������*.��rr����m()��M����^;7��%#VU��A	�({Qi{w�:�,c��#�/���\c���o*"\^�����x=u=��m�"��Z[a-�q�����N'�!���6�b�T5����1^&"��-Z���<3W<:H��Z�q^>����xl�;H<\�.��*����m�nd,�j;����y]�����/�����$<gm-H�vd�sQ�B�����}�F	r����x���i�q��N0��S��l���^1O$��'��'RI�����2��G��H���~����L�*C��I�I��E���P=F�yx�E~�w����dH�,��k1��6���z�2�[N�BS4}�(����'��D�N�\�,�L)�O���mY�?��'��j���_���M�G���[��M�������D�YD���IJZ���f�wg�~*��F���������wU�}�$�����S���p�Gn>Q2\�b��hY~�>���bQ�8E)�n����6e�l@,�r�T*o�C}����R�����!:�����q#pLQ`&a�I��>����
�%@�Gn�h�Q���(��P���lF��O��%�,0��\d8���02��-A�6X���W�|�0xj���{f-�pJ>�-%����s����gVo������|���K���4\-S��v��A]��5C��F|������Js�5H�t��PM����{�����������G��c�*�J|b3��h�M����|�1�-����7U�C��� �Q��%�E\�]���u��Q��2��,�R�$1*]L�����>4���^	�������m��cR�Tv=�x�7����v:D�����<�U�z���PQ��L��������^7�OUW���/���l��ar�v��!�5��bs�%�������~�x�5����M�M}�H������&��Z�����&�Hq�����t-��F'����t����m��O���}�Nr�;57��y���:�����L����}������!e}e��e[Ou}l�����l�r�]%��(O����&J�D~�������s�}afD��E3��^����a����DR��������j�'�/�'�1����;_(�4+���������{9��g?�)G��������|�FC��,�l���'Y4.�k��6z�
m�9I��t����s�D�awS
det\R��<��n}�r{m��EJ�b�a����-�*H���Y�����L���>�������(�N�He���K��z���s�������CtJA�|���	���%m`��U�`zCL��/_���������i�#hQ��s��f4Gj��$�������}������������$��f3$t�`���8�	����  ����+A���:�q��-�<�?���/�v<�\�d@}a3��N2�I�H�������n3(�uI�OT�l1���s�����!�+���<��������W���|KT���-� -�'�*	�F��s�������]8f�r�wnf�/���q�@���i���|����;!k�Wc<��e?BA��b$����p~���/������b����tVx��$�p��T���!�F�L�0�V�49���`�a^�L�����*�#�-��L�	(5��D��'a{��e���C�4�K��tM�3�q�5?��P���W�K}��y����r���#��wL��r��Q�
�:��������&����FA����8����u������k���J?����C�����6�7J��!�"SWm�������,t��f��n���Y���m�h��,��g�Q���i	����5q�I���"�������c[�d�x�'�R��d?}#���jz7���n��� ��3�3�*"9Z�3�n����xP��j�5�����|z�U]��Y�r��?r]|m�Jj^0�"���NW��k3*���2X�Y&A��z�33I����=�����n/f~�r���`��Y��=�[�X9.O>=��M������8��|���>y0���^M��*������o��}�Y.��UWE}�Jx]Y�5[����?5������i����g���'9��j�w)
�q$���&i�����N�"�7�5����G���|���y��M���H���?]���0+�'Ek)
�:X�����W_�{K�������+�f�+�3�P��L��0b��F$6�]6f�����#��q��H��4���q��PdI)=�i����|���1Sjf��5�4+����F�o�R���9����$3��W=��/x��$AN��H��_����*��v��}l�0qKcMo��� ��bhe�IV`m�$��T����B)������}e~�{���7��
��������z�U}������=sv[J�)2M�5��t�|����!��a=���P
��,��)��1	l��|u�p7<�W'���l%��Y������]y�D������4"8����3O`P��O���zy{C����i;^��vVg'q���)��'������}h�Z�J��%�z�C���Z���t��'-,����G��2w���G���S�{�C�&�:8
AG�m8.�2�PL����}w���?+��U������D����9n+@/OFQy��.���2�-���qcL�=C��6�T���&8M�(Gb�l��~q���4�#��$�)NhK
��@*��~�+����OR��,|�$!���M�M���0u�~E
2��od2���G3xo�lq��@Nm�#N3����
T��_��4�f�b
Pz�7SN�9]�������L�H(T*������N������KW�z}�r�.�!�I�	���$9��['4|C���Sui�������NK;%d�
3 ]���Rs�P���:��o1F4�z,ewfj�.�fBPH2�������������CD�>>�<�sU�Y�d��)�����|�4���f�)-A���H82��H���R��^K
i��2Hv�;����l�q�y(n�	p�^������o�jgQP������*��(�M���.��c����>��<������B�o��'IRL��8�-����j�W�4�c��t����_=FXaq�f�pA��i���������`;b^��zh��cF���3���
���,�������L�������*�m� ��F�hK�/��k!��|��m��;P�P��5A��ML!�B�.�0�
�������%�?
��0�sd��|#�A��tG�<�A������[OD;�����P��I�r�3/F3Z>��3������������#wp��<Wf.=E��BQu&��6���������ks�bM��H���(����r�����N��EZ�O@��R|��kf����$�t�)>5�����}��������M�C���6����<������$x��B��?R���2&�H=���p�I�����0�����8�8��{"����<g
�[�j�s��R�>#�2�T����3�Py2�������L���z���?�2�;�3�����1/��>`�m����+4
2�Aa�����9������a�v����q�BS�L���L�H�oBL'?�d��m�x��ek=X{X�M��fn�FQ0�q�,�R��o&�TS��2f�L�C|K������2|�K�z�fs�m����.�� 2L$3V,�X![������w��8vw�Hc�us������$t�2<�A��FQ��(<!�aj��/�BJ������ekz��8�.�a�|�3D�yC��W���i��I#=��(�n]���#F����	=�u}e<fsM=#N m��i@��j=!�5�9_�(W�����C�2-ql����<�n�|<�O��s�3(���G�q`O`"v�b�q��J��%������|uz��!3.J�����/�H�q���Xw�!����GC��m@,�,c(��J��m�j�Y����T����|�
�P�U��H������jwv��������V���j�
�q���&����'����&�":sx����#���w���M+C��y���&�@GT�{���f�������)�]o�#���4�:����;��n ��O7=<�
A<E?��U��M8����f����m��d��Y���u���C���l����u�{
���(��4�g��k�roN6j�(c_~i��7/��G���nX2YH�j�����;��(����<�8mFJ����3��:�c>�'��=�y�Rav�`��j���17��r�����>O�������.���s������(l�����L)��5��$g����H��5�:���i�q���,>��U�i/�,�[�Xm%�p�F-g>����"Ks7���|�Ouc
�����WX���)4��)��orb~$IE=��BS*�	�N�W�P�S���^Q�!(��(���������h��k���^V�]U���<�=�*��:KE8�qa�����?�7����~?�/���D1����}�)�.tj�$�t��j
�'��S�h��;���?����.d���2�LH!*(] Iw��8M�`N�J=����p�ag�-=��SX�<�I�bk ��$9]�]��\K�� &��PaN�6���������r��n���wQ-$������;>9�(����8���U�'OB��`~����H[��!���2\���l!���H��3����#�_{m'O���-)����c~������jP������������$��J����`���\�E������E+Y�K�7.����,]��
{a���F�����B.�4}%6D@��x�F�x��ZH��5���i��J��(���������>��(�O_������zW��1�����S��!	,|�����I���~r�ZF��Z:#>ZD�m2�I��I*�?D,�u{���|�R�p
���Lg��O����4=jw��G��}���i|!�\�
'F�l�K�]�����aV;3.���2������<+��sj�N��<WO���e>�������p�Y�!�|��s�k������l~s6�	`�����~u�\�W��3�.&c7�H8�P'N�a_@��H/|���G����<�����-R����,�S�a��SdB�tg��o�U��}����="��(�G�G#�8�M��6���oY@�77��|�o8�8��N�E���t�����%��e�A����8�Mg���X�J�N8�)��4p$�.�O�
������]}��1#"6����
�Y��Y,�<� T��^l?^�M������f���7�y|L		��@X��e
3��d&�����%L��h�D��r8B��/��<y�����i?����k��D�L���a������nN�TF����iX��Y�[�5���qn���Q7�s2�p'v�6=�o�I����#��4�����)���l�N��,��P�Y�����jw���f��4vmf�tt2V��^�;���=�w��c�Y�|����z�3Iq�vn����~��?/������V��Yl�����\	�����,{P5�7,��u�#T:�����R�����f�G��I(w��/Q��G1T�)��P��[t����������l��8��;���l2TF��I��B�B���������e�y�G5,������=�X����K$*�
�QE*�>����`�����r�����6�}n�Bd�<��<j�)S��N�2�����t�`3�����as��B.�aBM�C�B�e��,�y��}On9�Hn�������G����HM�'�!<G��2�	�02?������s�������DQ�d��G9�`�m�9��}���L��&���r�����AFsyF����f��V�jb�����$Ho��3�S3�X�O�O�r
/���<L����&$���D	�#JSy�G0�w��axVl�W3�;��iK��Jb�s$q�-$���4������UW��7�y�� l�2���x�#Iu*��'��s���eO����������C�7\�f������3�u�@��h{X��,��JPP��3�J@�c}���j����\\�|�6D,��Ttl4�%&�j������l���"%z;At�r]6�$��h������y�;���e����r�cyY��Gd�"��?�N����2���k�u��i��k
�����E���t�g���c10Z�o��n�}9����*�������+l"�s��
����8��?���F\�9D�{q�I$t9��r+��v�;�At�����N�7i��&��dC�i)���d��`�q��=��:�@�sd#�#n�3���'Pp�F�c�7d .�JIn�p������<d�������H���X��[��6Scm���vA���������|�L�x0�"��-��,?��38�@�y��U��L6�/��`#�69M�`)�2��`f�J���?���ky����T��g��t������q�bB����u!M3�����/��q�������d�	�E���r ����d�jf��o-��9�3�Zx�����C���Gk=�q��hV�4�'�|s*��3},�{*c
�-�ToM��`�P^���F0�\T����������Y��=�<@vE���s����}���x���<�t���i�D
��z���l���t��������&���I�L���@.�D�h�3q�(��p����zWvMuj�	�HG���X��
�'n^7��m����O�<�1��?�m�z�vi����(d�������;�����O,����������"=K`mDa��t���dg�CG���?�i5��<�8�5U#������O;S����:*��i�<���
/$�x�C[���N}���������/%/+X���?�7��@��DbJ���
V*��PD�9���G6XI�bS?�7����uDqm�B��<�
�
�L�L��y��{0��s��Z(�H�8�rS�TL�<�O�k�|�#��J�6�	�6i	t��Z��ou�s�]�Qv�9U�"�N���E��/��L�Oj���Wo��L���I�M0:4����	i����u~S��}�/P���o5:`�4(��r�n�#���H@��;�����x�~�1��kh����%�n�k{2Z)�3r�����n��6N/&�uF�f0e$L��N�[:�[=���y&s��n�������l����{Z�e�E��U8R<��T��F��pgB�\�K^d�-���BP��^����o�w�sT�a���v�T����
��f��.����-0s��lp�_��9��n1>��_���S�Mw��.ni(���:R���M�����p���u�5��z7��Gp���0���	�*�C�&�U�T����L����cWGe'��=k� �(E>�f�d�/-�����gd�>3�	����Z���Nz�����K���
��s$^3���i\	������*�k����z�O{��^~��N�8���s�s�K��Z��l����v-�-w�}�M��Z�Lo���W.$:�ss�yH��,�+h�c�LU|���%��=������4%O��$����A�����>�����vY�	���pf����\b�;�N��qA����s�#W����u�`UM$Xj�EX�c���`7t���W�|��/�:vO�d�3��i��&����-��?}�#s���K��$u����	'5��M�m�d�+u��oN>q>NX�!QB����9�c��h������j�8o�|�v�F��@�������E�7o������.����m�r��HC(�|�@+'cw���t
��9���2G^o��F�Tz�ii����p\��}���d�+����P72����4#86h�\��J^>�x��>��u�N�7��-���8N.�+�	�������1g��H�f�6���X1���nC=�����.)�n{Z������De%�:1s���s3���-����@�+�]�}�R!b/*��4�����*��,p�+�>��#��#E�n�$s����>��"���B����������n��PnN����\�(�o��[�Tg�_��"tf��J2����.��c9����=���n�
A���������R��	���F�G~&�!H�����+�e�
*<���0-ha:z����:����c�|0��yW��}���z2�+S|3{����7�]�0����i�r��ZF�s����"2���8���Lk�>�������	��Z����������	a�&?+H{c6)���mJ������9��������D
�nt��Y�F~9�U���BA(1r�oq|�oAa����O��ZG1�$���Z�b���!%R�pj�VA;�
>�0a����w���z�Ix\
��`Yn��-.������8������1�b���$4���y�Uq���3C���7h.E�P��Ir��!9 W�����O��fE�{s�
nN�b�v x\�uY	,1f�39���!��YMwE����V���;�a\�xq:;@�;@����Y�#�3�D����,'��[���,�m����;���X�?&���*��C�	��NC_����$��j����F���R��'a�+GS��{"���B�9^J;�������k
c�r�xX�'����>9��d��W����;�1�2�&�
�MY�5��T�a��� ��^�|U�~6��M��K���<-0�:)�<�����[3��G��k�{Q c��T��]������f���GP_��X��_�9��R|L���t8�A�%'j<�9���$#��Nss��0)+8�
��z���S�+�����������������/1U����f�m*n��9��C�a�����or�_�t�i��z�S�s6�M�cE�=UA�v}:^�"�G�C"�"#�,�]�'�M�S��c���4X{r{�H����JK�cp�|��4�s��3�4{��k�t��+w����r[Q3�d^���P�6����2!���A�[]�' E1M�`�xEG���`�f,�Y*j������4������Fz�_�n�!�P"�YY~�k~'=�9�����9���hw�g��/�<�"�H	yD�3� >;E�
/���%�������	3M�d^L,�M�v������)�������O`T�/��l� ���*0�PK���@jT@���/���������S�-������)jy���oT��|b�7'X��(���84B��q���%�����J��x9?�`���`\CDa'����S����~�����Q_^L5T!���3��B^��o�k�V������������!j3��J`C1If	A2�y=���L�!�9�'�Ai�~�������M�}�A��,����1��*OGs��W������������+S���hz�(#`����������]�r.i"����M�O���=���I����dj������*�KG���x���j�|���	� \�9��K4���F%c������|y�<�\uUoW�u*���6@�*�4�H�c�h*~eh�q��J)0�@���0�(��e�g+1�r���+��f����s�J`���Twf6����Y�/;w��>j����F���3M��a�Lg���:�,��8���>
p�E���2(���G�ey������]T����
f�z���ylN�:q���������o�#?�����2X��%��,]h-e2���:DQH����Mr���Bm(M� }-x3�u-�3m&��g+����(�(�&|�W�����]5Oj�\�9��C�H�Rd�U�5��%c�e����~R�;w�
v�$H�+�cy�[eiC��@AXv�!d\�I2�C�%�Y���1�{����@�F�|�p����?D}�`�c�v$vl�s�P}{Y���s��/8
���i��}��� �g�\�L��u���qa��M��)��/�T9��q'9��!��<e&�{iwu3��E������y��
w�"7��L����K�}�������]�x���@4�qk���?����,q�If�]��g:3���S����l+M��>����n+%	;�q�i��2�
:F[|�����>���e�U������T����zeV��b��p��Mg�y��^�T��c�'��������]=���W�xcQ ���A���c���}k���c����1�j�ZNI,�H9![�g�%�=7o�f_����;�/��{���^EI�!���2\���$|�����F$��Q��5_^��lg�@Hl��b�1�	�n�_?���x,��d_��Q
x���Rm�G�������M��J�7�}�H]����[Ox���0����$���T}�9�]u�E@��LNr�Q���ifz
���Y�*����8��9izteiE~�x������n�m�:./��SA���N����������i�����y��g�#	���p���$8�9�t
���M���x�����Q���{����"t��Td���("�N�]���2L�	8��r��&�Ct����!��,������iu�U!��N��D���{S_���z��C�~�B�z6�Z�
R.*l���
L�^w=�lq��J���6��J�2�e5H*V$��[-���h^�g�xc.��u��L���)�b�I�m��T���7W�?��	���c���w����E�����y�hY�����Y������b���o�_F�����$���j���M���`"�6<�B�{��_��'=)�d��VD����������7�TZ�l:\�����s�H:'�3���Xj��:(n��[i�����O�
X�����8�J;DG����<�"�?+��G�!5d�~oC�>��	q�kN2�r�W���!���)�@�.*��#�S~��b�3�ec9Ik��^�g�5���x���"��r8�{�J�Q,�6�5�i�j���i����3�8���1E�
 ��cm�r�����D��)
zchC3�����I����~�'W�l�.����F'���EFd���!9����5�k��,������r\�	-��J�d��(e~��J)c���>�����O�Ag���|j�Ks����gm�O�R�1�oqV�oB�}��3���T��w��yL7E���pk��_R��M3�BP�)+���N�.�k�K��������3�P�yGs���AM���Lh�������,-�eV9��-Xh���(	����0�q����+�h�q�"��4��iFQ0��J��	8O���l�Zy_] n�{K(S����5P`Lt��g=�E��?������oK�}���-�H9�Gv��[Gzo�%}��W>���N���0�QX�)S��H��k��z���p�u��:� J����9����f*���
o&G�NE��_[NW�IE�\b�|�\��A�4
�3pg�����v���]�M��b�Tl$�L�����x��F&����X^�����h3cf*x���O�����M����:?��{>������k�����L������c2����_��M�IG�L��F����9����������ap	X���n�/oL#dJ��T0��:�?=�	�7��X�@Q���`��|�������P�C���t�FrB�L������� ���7�
Q�<��W�<�NQ��-�*�	K��iZ#�#�$�*:R3�������R�)6�|F4�y��S��Z"ss�����b�m~'K�yh�,���$p��G��:�bg���qn@�>����q�J]*�]q�3)�],}by���]f
g��
�X���g�SQ�V3T�����B�|�h2��]���~����`�.m\*�F�a�cu�_l������5'O�I$��+�yVp����@�e�=Po��o����G����>"��@��F�� v@~XH����������Y���9G���~	����s�/e�C1��
���@�K�^�@�����/�
%��Q��9��
� ��TXM�6���p0�+{C��D�BiO�
����9�I��|��X�*������������z�r��9�0��o�?g�Q���Y%���c�=��~���|��U���3��P�[uRk�o��r=l!����Hg��jzM3KH*�����7���������WM\������z�8sS~L��l4�Oa���>���|��/����Q�E�����h�����>_�����x���F���>H�pv�22�����-�)�����8�O�w��0�9�6g��M������yX�=��E�U
������L
$�EB~�L�{��pm��G��'����_�?w��|�G�������������2�'��P���
�ZMJ!�{\5+�cN�������s���	g�j"��&���p�l�1�1\���7*S�������}�h�sw�'0�k�M�T&P�5�[�T`��k��\/�>q)��	�������Z�)r�1�I���������0�����a���1K$�6�0�g���f�>@�i�	����o^����|[w�L��N��e<�9������F
���3��'�
}S��r�s����h�G}����&�����_�6=(5u��Y������L�X����N0_�O:�����=�`J��%n�cG�cF�gdU�l�L(;t�
Qo�n�@�%5e��P\f���]q�;��V<���l���4A��+N���I�'
��p���n�!iy�_dY��+]��(�����2P:��W��hO����e�x)�O_hy<��z'=npi�;i*80�qsxQ�V��'j�������X�2����@�S��Hf�`�J�����:�;>I�z�yj�����9���r {�?uPE��cD����a[`FG�t%��0�����m�{u�\�_��|Y����7c6=���%[�J����ZhI������o�gY��Bi~=7�w�E����]��\�2ix~�"������)E��}.<��1W*���9�u.��Op�~fB�*���c�|�H�*�zBaL��dD�g	AS��V|���9X����������
�1��t2�h!�6)���b�t*:�����r���D����U���$Xv���'1�{�i��j���/��@�*�\Du���l�l����a�y��A�c����6z��wNX
c�r�������������LI�*��w������JS��������Xn����6|��\?��l#,EW�\g��Op���c�xW��{y����|���*���B�����49�>	aliP��)�����;�O�_�L�������4�3����8y��\bB$���@��N����u�Rl��,6�D�1�L��Z���b�������&�s,��_������/�f��#3�l�5��>�^��z_`c��4�i2�}Y��'}E���5��.u.�_��%�F�Iof���w�Vw��L�7������q��P>�^�0��e�#�r�<�B������`�g����*��W�����O���O���T���T7��G��A�G=�� �������jCY*�m�)\�SG�q��u^d��������h��&fQ���� �{��G��bB2�M.R��yvH����������E��if9����c����I�"2>q�+)�����>�I|��K����c������u{��.��������B�K
a���7����fl���5��p%�e�'7�{(�'��Z�nu�����A����
)��Z��u�_*4ZN>K��tW�����?<[�������]@L�p�X���>�g�I�p���E<�/�r������H8�(�z�<�H�tC`*Y�c��h������j�jPQ��x��\gJ!��$���<�A��s�>��.*51�X<)D�����p���L;,�:�����W/||$F�O�4<C�����+s���z�����Y*�/_����u��K����gB*��������L?R#�	H*��3;�X�B�a"��vB3E2��
�t�	���Z��p��[��I����l�Oe���|y|{�.�{2;�.�o���3����<��N3kq�z2�y����X��m�L!��/u,�fsj
��
f2	�D��lm4��{�\�����{'�%���^zIM#E*���}������{�r:�{���D�	�N��l&�	w�g���	��}�v������a����fG��cO�i�
.d��WrY|���c}4-�\q��]�\���uJ�{G���N�����j���r���q��V2�f(&��������g�H�G�}�Tlh2�?��}�v�� ���Qv�d�j���A.�0���*Q�	�Q"�mn�T����?���5j�7MA����Y���pG��F/\{�1�����������b�9�AlXd�#I)`|"��m������bx��:=���2.N-���}�5��8Q�+s�]}�B�\�E�>��-�����4w��7��#3�=�k��t����B0���3�:����^�m�`
���������(�i,��|},
ag�w��^|a�����M�e�Qi��8,sbG���$[���bGPcF>&�
��v��!�7>&Y�l��9.�����X>E&��:t��+�%��A][��"m��O�VB�qO��
7�����KnY���#�F$#��	U���C�{O�e�-�~]XKb���&���c�Vw���?��;K6��UiW�g�K��6�&�O�d���6����|��KI^H��Y��
���)6Z�2���>�������^��m>����T!��)s�,���ipgD��G�=���4E{��d���`i�p�+�e�T�d
C~,�[�4���\�z�w\�A\'�-n�%�.�������m�7�oc�T�lsc����V+	 ��C��,�d���s��C���~~xS?�[s������ #�g���RY�n���f1(n��H���X��G�Giy=U������uslcJ���r���@��3����&����j��RvM]����L��s����{�=��,z�#�����E�TFq�2R������RT�cbOsL��l�<����2��~�����X_g2n�����#2�@��N�s,M)�������������CB9���Xx�`"�<��|S$F�%���������1�N20{E�����1���-�l���J�G����r���������+�|vv{k�����(����~�A����&$'�S�@34�@�o�)�O`	e�(��l��@^�I}
S��A�^���~=7o�fo����7s������:��'1;W_��(�'H$7I��u��}�~na���h�Z�`�T��=um,�C
�%M�d>U\����d���uK����s&vP��i�%���L��=��\�$���V�F�\�2u(���l���8
����D����� �����a$_�����M}97�����)h����0.M@L�a+B�gR!�K���4��D�P�g��������{S��M}���Ho�����>y�.'���"��
r���4�u�R�@w�9���l+X�y����.64O�lkpq%-qo���A\Id ��E5�|����(y�3g�}����=���OD�d����(�����>���8E���?I���A�h�	(�Q}�M0��	�a�4����M���N������w�??v��s���r�!1}!��Pl�����D"�'r�����w���&>z��[�E�D[Rs�E���K�d\�,��~������a(S�
G���`}�8!BR�����:�v�n(?�Q�5��S��#��`��`y����U��g�%%6�d.3��<�%e���NA2H�O�`3�mr?;x8G5�D����-X�R�L�U����S���g2'A����sQ��sZ�H�]��~�Q����1����T8�o���M���N��\�<#L���4�������o��l�V��y����h�b��n>����u�S����jI&��>f*��R�^����>���Yj���L@�Y���GH,&���Z�i(���Q<�?�h[Z����zE��@8�?�������K��
�+����i�vj)��3gg<���X�a$���\�������c��[�����d�i 13����Z�,�����{S������W�{3\��ZH{
�k	o|���`���2u!7��B���;Y^�K�J�~{��3�m!4lE2v{���f������t]?�I�J���J$���m���z��K�����	���T����W��Uc:�����%]6�2u�x�AsR��r�$7e)OF�[�
�9���Cv����s�G��if&�D9
\����d���v�{l�;�F"�y
I|u>�u������X����O�(���^��������oqw��F4hm�m*2���k��p��auJ���e�\�:�{�m�MF�����z�=�?�(q���$�s�f`BkZ5Ke������W�j�Bs���N��a���3�E�Y��6��mW���n7��w+7"��Q\��4g�Sk@�� �)V�w��L�[�m���9s�=��Pw�3�.Tx�GN�B;�z��`vdz��$� ��'v���a�Kau�H��f�R�^�^�*f�����WX��-���,^p$�.s:�8[e�i�XujS~���B��B�A/g�&
��Ul8I���I�_w�g�-�1 BM>��XP����p�9���b?���Nmese_�^���wu��M�����@R���c�qp���������]v�#���@E4w63�:�\��<�p��9s&�	�7v9���n��q�o�j#d��uw}���Z+��&5-�L�N����7��\����h�z!BX��m�OM������bh��>���z��u�3�[�g
�\�+�yOa�����gb2��s/���n���9Jl
�S�3��Eax���Zf>liV���Mq,x�+��@�F:l~\��������Th��oL_��C��c����f�K!�@0��{[w/��V���(b�s��^�c��4�3�T<h��R���7gg0�WZ�RT�������g��E�\�����(f��A�����I�L������~ ��Pf��;|�t)<�&	t����9�:]�
O&P��\�me*z���0��Em�e��V����������n'�m�et[�m��9���N|�"�;��i�u��+�F��\qg@��	�����M�����-���[��n�m���l��C��dE��G�����&2��H@�;d�[;����,�
M���L���p�]O��F�TtX�|����'���u��L�� �Yd��(j��|���=�=�-���P��E����8=V2`���~���a�z�`��bm�}L���f����)�� ����o���	��n��U1��M�r�	��m-o���x���8���AAil�n��s$UsbBa��������;��9��h�����;	K�s3�K����yo���|�
@��#�(_.B�s�����r!3�S����El�x*����G����1�Rn<C�C�l2u����t���-�I�_*7}�z��?T��7������'t��Wkp;���0n'=���4R�������}>��������+@���T.}�8_Le����,}L1���C^S�O�����xCq��D���=����"0�?�����,�������������( 	����>�X,�����>����j��m]5���y6������:\�N�(���+H��3���J�T2��e{����sW?�O�+G�w�K��#�M\~x�y�9��t�����N��k���wM����
�����j?��z� 8l��to��zM'�	0f��!?������,S
[�oK�X��>����\r2������\p����8�����r�S	 �ei�����LE=��l�~2�L�%���M0� ����<Nf��~����+
�s>}�8�`!g�4�#��|�j���5����]��4�����l��=����n�I��%�k5���uy��1�L��}����1�"E�&# 
������!��w;:y$�?����K|W��cV�E���X�H$�])Y�@_��aJtFo��_�F:��$����d�|�� ��M�$��ygW��X�F���}�b��?��V��������������RFh�p:A-��H�������ZMfW=��2��.h�IFqd��x`����e�U���K���gKK����"�@F4l�JK�k��i�
d�G�8F\>�	�@@	���I�z���L�5uu�wew���c��I��F�V����)���]K$��p��3�����p��w���y��|bo��_�&v�6�/_���]o��~!|<��r�3�<vJ���F��I��N)�w�0���TN5��Y�:CA�)�eu+E�o�[0�����ros3LY2��5������Sr����M$�q�#��������Y��g��3������9��
a�p�9��Xr<��B=��,��xmr�X������%~�J=��F��^=����Y��-?f����>K���6�X�Y����/�>[9��D1���)����CY�\Lz%��;"i�X'U;���*��?�Y�'�rSgEp�c���K�9�U�����z��OQ���9;/���}��,X�a��� ��t��>Tr5�cWC\g����PL7C��;�bZ�s��:��<c������E�4��^��W���C��n�]�������� ���}��"��Kp0u���+��)
��.�Y�N�]�
�!)��^�V�s���i	�3�ns1���qt.��bv���T�B'dE����4&�������PRd����>��&���U�7�����i<���yq�����lG�b�=�fC0(${����=^?�i���	?�E@�D�E��=:���z���:p=��wDfV�|l�����EF��J�g�^�`��@z�yo04�2�Wr�9����T���R7��5����a���A8�% �R��d6�+$X��� sQ!m���
�\RB��J��Z����|�A���X�/������
k�\��R��?<���-�����/~��������)��o*H�|B�6/��G_�!r�O�"�N�!�H;t�C�����L��C��PP����)�@pY�:Z�!-�4��d��?q�������l��<6s$�� f�!���&"lX(�.~1����������d��0V��wA-�G���:9u~��c(��m���pf�)k���2���d�ak8c����?^�6
�$�IZ��r�eJb!3
�	��lw���n���BM/��J�c^����`7��7V��������&r��a����+����K]lX2������
��G���t��:V@`w�H���"�@������"����}Z��`��s0-�1�����������0���� ��L����I5�K^���������E2� `�k�������:v��u[I��fT�`!��i���>�S]���jS+c�oZybh��7����\���"���_���o��r`r�x��1������WMx	������I���5-��>u�Mo���ARS/�0���La7CR�1Fl�]��T������G��8��!�DL%':�s�#�4C��LxLG)9�S,�J��x.0�r&��x��r�T���1UG�P�C����@���z��i�_�wq���T��%%t�k~"��$R1�po����������zeAY�{�
�y"���:/�MS=�r:T�C�w)��)mt����I*��W�8��9wp��E�Ifb^�}�l��e{���c�;s�M
�8;r�l;�k� Ag��\.R��[����z�+B|�_�m�=�������q,��o"����n��)
,0�@�m%#��<��&�~x$Q�ZU:g�
��6}(b����}O!��}�}��+
~06H���<)�T�0�����T};���t���Cx!�n�.{4k1���.�>�NwmE� �1��	-���m�I�)J�s~�4yc[����������-N�XLn"��i��v�`(;�:CIK�����\DiLt��>�S������m���.�2�4��,�9m�z%m�x@���&����9�YJ$���idxD��y��x!U����7�2�<��\9�VA2���+P�m����c+g�o'�����N�f*���)�O�"���5Z;�KK���y{Vj��w��S�[���ro�'>�������X|}���6X��6��th����)�D�3�\�LG�'����A.j���e�	�m��u}=�@�b�.�X�H����C`]�=)�Ma��]��<���._{������<���L-gS����� �X����Aih[��~ ���|&gI�2�hHK��6p���l�"���3Xj�p��I�h����p��?��$�i��91����HIl����C��h�|�f�)�P"��c�G���2�I�{����]�����\�}��������j`���Z�s����s��<nm8[O��	�
X���LO����mq�<;��xZ�n?)��"�E�h+�f���jX9���Y���D������)z�NN��CD�3����P^��*�E�M(��0�?����y�����UN��'n��Er�����������4���W�z(�����b��vA)��<�bZ%��T�l�����������X����!���8}�<.���!���X�G�����f�Gv�C}6�O�q,���a�U�����x6�V���y����dn����������$��\��n��'���'������Y�z�q���l�|&[�75�����N��e+�^d+��h�F
��%�X��E�:{���G���'D�����AY#rn���A���V�0����I(�%���dS��p�l���������`�l�n�PWU5�����J(m�7�P>���B�����9�m9���LP!�xg��#58��F�#��&P9�}z��]�j���O�C*�`��(��2���l�	�����(��h��v�&�>�b�G��8���]d�g�����S\N��.��%�*��T\ot0-e����|�����Q�
���Ma~�������8�������0�;@l��1�pj�T�<L���_�mP
�s������k[�c^?f���uM:j
w+���)	N�I
��f
*�����KV��
�hSY�)�����~n���
SX����p���D�� �D1�j��\�v��x-Q{Ii\��#�`�!R�X�!c�r��~������	�Sq�s���GQW����s
��������D(�����4���e8�c�G'��F,�R=h�p��������U9��;�[������t�"��~��Ou/��z�B�1;�(�E<���0����dN�=�m�����<��F�>��w�O6=�+��Y����
�j���1]���f��6	d����E��� �������~vaK�|��
�s
�5���4SW�M�}����
F���K:�	�K�������'����r��5����JD`��R��i�)4�_������k���|[]O�;`���st������	���������@`�F�P�P�[K���,�e�������(DQ-��
7��ju8B>Z��I���/�����t$��~�`p��07��1�h�l������to@�FQG�#hk�����0����6+�������_�Tu��.-��B�FO8�<�I,MEwMA�$��������w�m_����z�-�T[�p�:Z|� 
'�~�81���1k�+N{������Qo>���]�Nj�tl��"���z*��������F����2��o�~2H�������7a"�Po7Vy����*���#8�7��r�n%����,M�FUQ���\M(
�������������J�����w����G�7i0>J�Cu=��q�$T*G�e�(��^E[�r�Dt\���A��[�<�l�'O�l�i�wM,�e�$f�s�U�c��>�\�>�c�l(�b������`��}b���;�e�&��b&��u�����9�������C�,�P�
>5DY���B@��g}��v�s�8P��I�
�.0�0I����0[�>�$|*lL���C%�4I�K��`&2�������~Ll���#�&�b�$��$��)�}�g�0�v=��F�z�T��l���zI��3�r�/Q:�����l4r����/�(m9�CQg��?���6}�G��0��bR���$F�Y6Yu��%J���^����9F�����_rv���,*Q� S)h����)6���Ad0�����8�?W�,�O~,.�|��K���t����Q�"�%�m}������O
 �����k&rR�l��|VhR�����Y��g�
 ���b_����~�=e}D}����$��3���P����Y�k��E���c�M��C��X��u�T�&Sjc�~��\��/PKBx`��%PK���MBx`��%  ��logical_repl_worker_new_perf.svgUT
u�\��\u�\ux��PKnd�
0009-Fix-worker-historic-MVCC-visibility-rules-subxacts-s.patchtext/x-patch; name=0009-Fix-worker-historic-MVCC-visibility-rules-subxacts-s.patchDownload
From 0a7cbaeaa179bde9dd5146e60f7b5c52b6c34899 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov <alex.lumir@gmail.com>
Date: Mon, 17 Dec 2018 15:43:13 +0300
Subject: [PATCH] Fix worker, historic MVCC visibility rules, subxacts, schema
 send, tests

---
 doc/src/sgml/logicaldecoding.sgml             |  2 +-
 .../replication/logical/reorderbuffer.c       | 61 +++++++++---
 src/backend/replication/logical/worker.c      | 25 ++---
 src/backend/replication/pgoutput/pgoutput.c   | 20 +++-
 src/backend/replication/walsender.c           |  2 +-
 src/backend/utils/time/tqual.c                | 16 ++-
 src/include/replication/reorderbuffer.h       |  5 +
 ..._stream_simple.pl => 011_stream_simple.pl} |  2 +-
 ...tream_subxact.pl => 012_stream_subxact.pl} |  2 +-
 .../{011_stream_ddl.pl => 013_stream_ddl.pl}  |  2 +-
 .../subscription/t/014_stream_tough_ddl.pl    | 98 +++++++++++++++++++
 ...t_abort.pl => 015_stream_subxact_abort.pl} |  2 +-
 ...ort.pl => 016_stream_subxact_ddl_abort.pl} |  2 +-
 13 files changed, 200 insertions(+), 39 deletions(-)
 rename src/test/subscription/t/{009_stream_simple.pl => 011_stream_simple.pl} (98%)
 rename src/test/subscription/t/{010_stream_subxact.pl => 012_stream_subxact.pl} (98%)
 rename src/test/subscription/t/{011_stream_ddl.pl => 013_stream_ddl.pl} (98%)
 create mode 100644 src/test/subscription/t/014_stream_tough_ddl.pl
 rename src/test/subscription/t/{012_stream_subxact_abort.pl => 015_stream_subxact_abort.pl} (97%)
 rename src/test/subscription/t/{013_stream_subxact_ddl_abort.pl => 016_stream_subxact_ddl_abort.pl} (97%)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 3571f96a8d..dbec2e4ef7 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1057,7 +1057,7 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
    </para>
 
    <para>
-    Similarly to spill-to-disk behavior, sStreaming is triggered when the total
+    Similarly to spill-to-disk behavior, streaming is triggered when the total
     amount of changes decoded from the WAL (for all in-progress transactions)
     exceeds limit defined by <varname>logical_work_mem</varname> setting. At
     that point the largest toplevel transaction (measured by amount of memory
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1c394296ac..2200332999 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1744,9 +1744,10 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		}
 		else
 		{
+			// TOCHECK: Is the second assert actually necessary?
 			Assert(ent->cmin == change->data.tuplecid.cmin);
-			Assert(ent->cmax == InvalidCommandId ||
-				   ent->cmax == change->data.tuplecid.cmax);
+			// Assert(ent->cmax == InvalidCommandId ||
+			// 	   ent->cmax == change->data.tuplecid.cmax);
 
 			/*
 			 * if the tuple got valid in this transaction and now got deleted
@@ -2881,6 +2882,9 @@ ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	for (i = 0; i < txn->ninvalidations; i++)
 		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+
+	/* Invalidate current schema as well */
+	txn->is_schema_sent = false;
 }
 
 /*
@@ -2895,6 +2899,23 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * We read catalog changes from WAL, which are not yet sent, so
+	 * invalidate current schema in order output plugin can resend
+	 * schema again.
+	 */
+	txn->is_schema_sent = false;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+	{
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		txn->toptxn->is_schema_sent = false;
+	}
 }
 
 /*
@@ -3476,7 +3497,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * using snapshot half-way through the subxact.
 		 */
 		command_id = txn->command_id;
-		snapshot_now = txn->snapshot_now;
+
+		/*
+		 * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
+		 * information about subtransactions, which could arrive after streaming start.
+		 */
+		if (!txn->is_schema_sent)
+			snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+												txn, command_id);
+		// snapshot_now = txn->snapshot_now;
 	}
 
 	/*
@@ -3522,7 +3551,7 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			/*
 			 * Enforce correct ordering of changes, merged from multiple
 			 * subtransactions. The changes may have the same LSN due to
-			 * MULTI_INSERT xllog records.
+			 * MULTI_INSERT xlog records.
 			 */
 			if (prev_lsn != InvalidXLogRecPtr)
 				Assert(prev_lsn <= change->lsn);
@@ -3731,6 +3760,11 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 						snapshot_now = change->data.snapshot;
 					}
 
+					/*
+					 * TOCHECK: Snapshot changed, then invalidate current schema to reflect
+					 * possible catalog changes.
+					 */
+					txn->is_schema_sent = false;
 
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
@@ -3868,7 +3902,7 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 1 : 0;
 	rb->streamBytes += txn->size;
 
-	elog(WARNING, "updating stream stats %p %ld %ld %ld",
+	elog(INFO, "updating stream stats %p %ld %ld %ld",
 		 rb, rb->streamCount, rb->streamTxns, txn->size);
 
 	/*
@@ -4950,7 +4984,7 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 							  CommandId *cmin, CommandId *cmax)
 {
 	ReorderBufferTupleCidKey key;
-	ReorderBufferTupleCidEnt *ent;
+	ReorderBufferTupleCidEnt *ent = NULL;
 	ForkNumber	forkno;
 	BlockNumber blockno;
 	bool		updated_mapping = false;
@@ -4974,11 +5008,16 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 					&key.tid);
 
 restart:
-	ent = (ReorderBufferTupleCidEnt *)
-		hash_search(tuplecid_data,
-					(void *) &key,
-					HASH_FIND,
-					NULL);
+	/*
+	 * TOCHECK: If tuplecid_data is NULL, then we are not able to resolve cmin/cmax,
+	 * so try to update mappings and return false.
+	 */
+	if (tuplecid_data != NULL)
+		ent = (ReorderBufferTupleCidEnt *)
+			hash_search(tuplecid_data,
+						(void *) &key,
+						HASH_FIND,
+						NULL);
 
 	/*
 	 * failed to find a mapping, check whether the table was rewritten and
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index eaefba2049..adf69a5a38 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2444,10 +2444,21 @@ stream_open_file(Oid subid, TransactionId xid, bool first_segment)
 	{
 		MemoryContext	oldcxt;
 
-		stream_cleanup_files(subid, xid);
+		/*
+		 * TOCHECK: If nxids=0, then we have nothing to clean up.
+		 */
+		if (nxids > 0)
+			stream_cleanup_files(subid, xid);
 
 		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
 
+		/* TOCHECK: Initialize xids array if it is the first run. */
+		if (xids == NULL)
+		{
+			maxnxids = 64;
+			xids = palloc(maxnxids * sizeof(TransactionId));
+		}
+
 		/*
 		 * We need to remember the XIDs we spilled to files, so that we can
 		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
@@ -2462,16 +2473,8 @@ stream_open_file(Oid subid, TransactionId xid, bool first_segment)
 		 */
 		if (nxids == maxnxids)	/* array of XIDs is full */
 		{
-			if (!xids)
-			{
-				maxnxids = 64;
-				xids = palloc(maxnxids * sizeof(TransactionId));
-			}
-			else
-			{
-				maxnxids = 2 * maxnxids;
-				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
-			}
+			maxnxids = 2 * maxnxids;
+			xids = repalloc(xids, maxnxids * sizeof(TransactionId));
 		}
 
 		xids[nxids++] = xid;
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 04432ffb57..d5a3cf6308 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -370,7 +370,7 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
 				  TransactionId topxid, TransactionId xid,
-				  Relation relation, RelationSyncEntry *relentry)
+				  Relation relation, RelationSyncEntry *relentry, ReorderBufferTXN *txn)
 {
 	bool	schema_sent = relentry->schema_sent;
 
@@ -381,7 +381,15 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 	 * that we don't know at this point.
 	 */
 	if (in_streaming)
-		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may 
+		 * occur when streaming already started, so we have to track new catalog 
+		 * changes somehow.
+		 */
+		schema_sent = txn->is_schema_sent;
+		// schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	}
 
 	if (!schema_sent)
 	{
@@ -415,7 +423,9 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->xid = xid;
 
 		if (in_streaming)
-			set_schema_sent_in_streamed_txn(relentry, topxid);
+			/* TOCHECK: Maybe change flag location? */
+			txn->is_schema_sent = true;
+			// set_schema_sent_in_streamed_txn(relentry, topxid);
 		else
 			relentry->schema_sent = true;
 	}
@@ -479,7 +489,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, topxid, xid, relation, relentry);
+	maybe_send_schema(ctx, topxid, xid, relation, relentry, txn);
 
 	/* Send the data */
 	switch (change->action)
@@ -569,7 +579,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, topxid, xid, relation, relentry);
+		maybe_send_schema(ctx, topxid, xid, relation, relentry, txn);
 	}
 
 	if (nrelids > 0)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index be92f3e5f5..de709250c0 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3621,7 +3621,7 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->streamCount = rb->streamCount;
 	MyWalSnd->streamBytes = rb->streamBytes;
 
-	elog(WARNING, "UpdateSpillStats: updating stats %p %ld %ld %ld %ld %ld %ld",
+	elog(INFO, "UpdateSpillStats: updating stats %p %ld %ld %ld %ld %ld %ld",
 		 rb, rb->spillTxns, rb->spillCount, rb->spillBytes,
 			 rb->streamTxns, rb->streamCount, rb->streamBytes);
 
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index f7c4c9188c..b905077164 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -1692,8 +1692,12 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * TOCHECK: If we accidentally see a tuple from our transaction, but cannot resolve its
+		 * cmin, so probably it is from the future, thus drop it.
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1763,10 +1767,12 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * TOCHECK: If we accidentally see a tuple from our transaction, but cannot resolve its
+		 * cmax or cmax == InvalidCommandId, so probably it is still valid, thus accept it.
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bf5766c1e4..29dba6673c 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -220,6 +220,11 @@ typedef struct ReorderBufferTXN
 	/* In case of 2PC we need to pass GID to output plugin */
 	char		 *gid;
 
+	/*
+	 * Do we need to send schema for this transaction in output plugin?
+	 */
+	bool		is_schema_sent;
+
 	/*
 	 * Toplevel transaction for this subxact (NULL for top-level).
 	 */
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/011_stream_simple.pl
similarity index 98%
rename from src/test/subscription/t/009_stream_simple.pl
rename to src/test/subscription/t/011_stream_simple.pl
index 4d01f7e5ec..f0aae1041a 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/011_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/012_stream_subxact.pl
similarity index 98%
rename from src/test/subscription/t/010_stream_subxact.pl
rename to src/test/subscription/t/012_stream_subxact.pl
index 1a8b8ffe9e..00dd60b91f 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/012_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/013_stream_ddl.pl
similarity index 98%
rename from src/test/subscription/t/011_stream_ddl.pl
rename to src/test/subscription/t/013_stream_ddl.pl
index 04af0900ac..ecaf4383b1 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/013_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/014_stream_tough_ddl.pl b/src/test/subscription/t/014_stream_tough_ddl.pl
new file mode 100644
index 0000000000..02969c7260
--- /dev/null
+++ b/src/test/subscription/t/014_stream_tough_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/015_stream_subxact_abort.pl
similarity index 97%
rename from src/test/subscription/t/012_stream_subxact_abort.pl
rename to src/test/subscription/t/015_stream_subxact_abort.pl
index 6fecfe6fe7..dcec081fa1 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/015_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/016_stream_subxact_ddl_abort.pl
similarity index 97%
rename from src/test/subscription/t/013_stream_subxact_ddl_abort.pl
rename to src/test/subscription/t/016_stream_subxact_ddl_abort.pl
index 50990c170c..41ad2b668c 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/016_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
-- 
2.17.1

#54Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alexey Kondratov (#53)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Hi Alexey,

Thanks for the thorough and extremely valuable review!

On 12/17/18 5:23 PM, Alexey Kondratov wrote:

Hi Tomas,

This new version is mostly just a rebase to current master (or almost,
because 2pc decoding only applies to 29180e5d78 due to minor bitrot),
but it also addresses the new stuff committed since last version (most
importantly decoding of TRUNCATE). It also fixes a bug in WAL-logging of
subxact assignments, where the assignment was included in records with
XID=0, essentially failing to track the subxact properly.

I started reviewing your patch about a month ago and tried to do an
in-depth review, since I am very interested in this patch too. The new
version is not applicable to master 29180e5d78, but everything is OK
after applying 2pc patch before. Anyway, I guess it may complicate
further testing and review, since any potential reviewer has to take
into account both patches at once. Previous version was applicable to
master and was working fine for me separately (excepting a few
patch-specific issues, which I try to explain below).

I agree it's somewhat annoying, but I don't think there's a better way,
unfortunately. Decoding in-progress transactions does require safe
handling of concurrent aborts, so it has to be committed after the 2pc
decoding patch (which makes that possible). But the 2pc patch also
touches the same places as this patch series (it reworks the reorder
buffer for example).

Patch review
========

First of all, I want to say thank you for such a huge work done. Here
are some problems, which I have found and hopefully fixed with my
additional patch (please, find attached, it should be applicable to the
last commit of your newest patch version):

1) The most important issue is that your tap tests were broken—there was
missing option "WITH (streaming=true)" in the subscription creating
statement. Therefore, spilling mechanism has been tested rather than
streaming.

D'oh!

2) After fixing tests the first one with simple streaming is immediately
failed, because of logical replication worker segmentation fault. It
happens, since worker tries to call stream_cleanup_files inside
stream_open_file at the stream start, while nxids is zero, then it goes
to the negative value and everything crashes. Something similar may
happen with xids array, so I added two checks there.

3) The next problem is much more critical and is dedicated to historic
MVCC visibility rules. Previously, walsender was starting to decode
transaction on commit and we were able to resolve all xmin, xmax,
combocids to cmin/cmax, build tuplecids hash and so on, but now we start
doing all these things on the fly.

Thus, rather difficult situation arises: HeapTupleSatisfiesHistoricMVCC
is trying to validate catalog tuples, which are currently in the future
relatively to the current decoder position inside transaction, e.g. we
may want to resolve cmin/cmax of a tuple, which was created with cid 3
and deleted with cid 5, while we are currently at cid 4, so our
tuplecids hash is not complete to handle such a case.

Damn it! I ran into those two issues some time ago and I fixed it, but
I've forgotten to merge that fix into the patch. I'll merge those fixed
and compare them to your proposed fix, and send a new version tomorrow.

4) There was a problem with marking top-level transaction as having
catalog changes if one of its subtransactions has. It was causing a
problem with DDL statements just after subtransaction start (savepoint),
so data from new columns is not replicated.

5) Similar issue with schema send. You send schema only once per each
sub/transaction (IIRC), while we have to update schema on each catalog
change: invalidation execution, snapshot rebuild, adding new tuple cids.
So I ended up with adding is_schema_send flag to ReorderBufferTXN, since
it is easy to set it inside RB and read in the output plugin. Probably,
we have to choose a better place for this flag.

Hmm. Can you share an example how to trigger these issues?

6) To better handle all these tricky cases I added new tap
test—014_stream_tough_ddl.pl—which consist of really tough combination
of DDL, DML, savepoints and ROLLBACK/RELEASE in a single transaction.

Thanks!

I marked all my fixes and every questionable place with comment and
"TOCHECK:" label for easy search. Removing of pretty any of these fixes
leads to the tests fail due to the segmentation fault or replication
mismatch. Though I mostly read and tested old version of patch, but
after a quick look it seems that all these fixes are applicable to the
new version of patch as well.

Thanks. I'll go through your patch tomorrow.

Performance
========

I have also performed a series of performance tests, and found that
patch adds a huge overhead in the case of a large transaction consisting
of many small rows, e.g.:

CREATE TABLE large_test (num1 bigint, num2 double precision, num3 double
precision);

EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
SELECT round(random()*10), random(), random()*142
FROM generate_series(1, 1000000) s(i);

Execution Time: 2407.709 ms
Total Time: 11494,238 ms (00:11,494)

With synchronous_standby_names and 64 MB logical_work_mem it takes up to
x5 longer, while without patch it is about x2. Thus, logical replication
streaming is approximately x4 as slower for similar transactions.

However, dealing with large transactions consisting of a small number of
large rows is much better:

CREATE TABLE large_text (t TEXT);

EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 125);

Execution Time: 3545.642 ms
Total Time: 7678,617 ms (00:07,679)

It is around the same x2 as without patch. If someone is interested I
also added flamegraphs of walsender and logical replication worker
during first numerical transaction processing.

Interesting. Any idea where does the extra overhead in this particular
case come from? It's hard to deduce that from the single flame graph,
when I don't have anything to compare it with (i.e. the flame graph for
the "normal" case).

I'll investigate this (probably not this week), but in general it's good
to keep in mind a couple of things:

1) Some overhead is expected, due to doing things incrementally.

2) The memory limit should be set to sufficiently high value to be hit
only infrequently.

3) And when the limit is actually hit, it's an alternative to spilling
large amounts of data locally (to disk) or incurring significant
replication lag later.

So I'm not particularly worried, but I'll look into that. I'd be much
more worried if there was measurable overhead in cases when there's no
streaming happening (either because it's disabled or the memory limit
was not hit).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#55Alexey Kondratov
a.kondratov@postgrespro.ru
In reply to: Tomas Vondra (#54)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 18.12.2018 1:28, Tomas Vondra wrote:

4) There was a problem with marking top-level transaction as having
catalog changes if one of its subtransactions has. It was causing a
problem with DDL statements just after subtransaction start (savepoint),
so data from new columns is not replicated.

5) Similar issue with schema send. You send schema only once per each
sub/transaction (IIRC), while we have to update schema on each catalog
change: invalidation execution, snapshot rebuild, adding new tuple cids.
So I ended up with adding is_schema_send flag to ReorderBufferTXN, since
it is easy to set it inside RB and read in the output plugin. Probably,
we have to choose a better place for this flag.

Hmm. Can you share an example how to trigger these issues?

Test cases inside 014_stream_tough_ddl.pl and old ones (with
streaming=true option added) should reproduce all these issues. In
general, it happens in a txn like:

INSERT
SAVEPOINT
ALTER TABLE ... ADD COLUMN
INSERT

then the second insert may discover old version of catalog.

Interesting. Any idea where does the extra overhead in this particular
case come from? It's hard to deduce that from the single flame graph,
when I don't have anything to compare it with (i.e. the flame graph for
the "normal" case).

I guess that bottleneck is in disk operations. You can check
logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
writes (~26%) take around 35% of CPU time in summary. To compare,
please, see attached flame graph for the following transaction:

INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);

Execution Time: 44519.816 ms
Time: 98333,642 ms (01:38,334)

where disk IO is only ~7-8% in total. So we get very roughly the same
~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests.

Therefore, probably you may write changes on receiver in bigger chunks,
not each change separately.

So I'm not particularly worried, but I'll look into that. I'd be much
more worried if there was measurable overhead in cases when there's no
streaming happening (either because it's disabled or the memory limit
was not hit).

What I have also just found, is that if a table row is large enough to
be TOASTed, e.g.:

INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);

then logical_work_mem limit is not hit and we neither stream, nor spill
to disk this transaction, while it is still large. In contrast, the
transaction above (with 1000000 smaller rows) being comparable in size
is streamed. Not sure, that it is easy to add proper accounting of
TOAST-able columns, but it worth it.

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Attachments:

logical_repl_worker_text_stream_perf.zipapplication/zip; name=logical_repl_worker_text_stream_perf.zipDownload
PK&��MU( logical_repl_worker_text_stream_perf.svgUT
X�\e�\X�\ux����k����.����={Z1��p�����e��;�'������F�(�Kf����
(���Z�����e��$���Z�w��o��n�]�������g�����&+6��,������?������7?�����vU�\�~���_�z�z���/_����oVo��������o�l���i��~������nY]�����W��~/|�/�7��c��4�gp	���}�g��|�\�sn�J��W�����R�2�|�,/��L�4���nry�uy��3o���-+�Jx���r����@��_�m��z��,I���x���������N�.���V�k�x}���_���]��r�T�']�����]���������|�>DYeuQV�m����pV�
��2|���|�����j�+��w�n��������^��O�������F^���~��o�idV��7�,�U�<�����C�y��g`F���}���q�/x�'���r��?>_�����������\\������^����;�6�/������-��{����o������l�]�b]��}|�^�uz���L��/�PU^�_�o�q�����ZH�������P�e��}����X�
.�>����}3��\���������7_�����~r�U��l�|[;��x}u���������4w��;��O?���E�������*��W_�6����gu/e��V�~}�js��{����E^����|���������={�o8���;�W�o��8������������k��y����������������x����Z~��\�yqQLR��F�A>���W�-|�>�2m����5_m����m^�����~�z��L]@]�x����^a���|��������n�����[n��On�bS���f��
��u^7�����ky�(�9��d����.��%\g���_��W����O�
���_�����$�����D��J�����a��)#t�N�U��w?��������M�\���k\>g��X�|�p�4M�^��z�����U?�=�Y�WAk���/�/V�o����,.��/��>��\�����iW<���s|�'�lob�_����p�r�|������������?qo�W����:+t�`�U~����<���Y�G���%�����>S�zpc��?~_��/g^�|�{�����i��']8 ���>�\�S����l���z��������m�?nx������#<������u���
V1<��&o�rt6��������U�o����G�N{����5W���T�S��#��������[�����6%��'������q���%�����v���o��~�~A{�p���G��{h��~�#[U�
U�u�������8�������VQsV��e�G�u����������/��wGJ%�������s����������p���oF��{v\k������|7�Z��.�n��4WY�`����E��?5p�
~�
��nc��~���b�W $�����3u�wz�G�3t��;�LG�a�m�2]w������|�b�woy�>A9������p��ku��y~dw�>���}>����|������/Vw��C����G��.q����>�}��c����V��w���W�\����(����a��:�����
V2�5�$�sc�y����G�,��7���V�!��=���z`S�������{���ou�x��X�;������7�����1��G8,��~2�w�m>y��~{�
��{^�}u�0���W���`^����o���3e��K�
�
7����j�� �e5a��820�`)<4��L�53/��0X�c��&�5k�s��e�����};��iu+������!?q�/�����������+�|td��u�W1w>�|���*4��b�G���{o|��"�=\���w���z���+���+�{?k�����=x���#���|K�%�O���E{	�����?�z�����T�B�h�;������^t��������n��m��������1����C��?�O�/j`�W~~V��u��3���
Z��r{:��l?u����#���J-"�yxo���]#�����������W������/�.^�+M��������nO�de�}[({�������Z=�����%����]�G�6}#p���h��
�;�# ��!%��~&o?Sn�K����`R6��C���C��O�/���b���C�M���v5A"���>��o���a~����g)Jc�^	,�L�����n����N�������Oz�S�]v��;����j]�vpJ��Fkm,���n���`��������A��z�sR�������6�3��|:xp:L�]���I#�o��uk����GV��a$��^jl���y�p^�z�OH�i���\�A�	�����u�oUL��"�n�=��>(����?����#��C����}��o�T'm����g���w{�:��V��g�v������_�����1������{!�) 1���3��y�,'����"���:�
�g�u���eS�|���r�/>�g���u���g��F��>0��x�7����}��/��_��V�}����u6��������S?y��{����mV�}L
����M���}�!1���P����U���)�5S�Q�o_4^W�9��~�����\�n�l��t�����`[��j�
;�*yq�W�����������.\����E����G������;=��u^�-$�5��������n!4�WY���+|A��Jp�<�6��"\k&���|1�^6�=
�m{_�Y\�������+���M�����;y�
�"|i��B��w�_��Q��<�.������?��I�`�iw.�������{X+�����h�OFq�W�o�+c����X�9�����_O}x3�mp���!{��
���8��g��&����g�<�W��l�u�]c�X���Zc��������o���������;����4�`��q��R��
r;|������������>�wzg�A�������n��n��?�~��k�0w���'�~����u�t���*8$���9.g�7�c�;����I|A����)w?���~�Y�e�����+������}����F��#!���w�mZ�1�7y�_^bO��l�r��a�����h�
8)���z�����]��%�Jns���P�n*)1PVj����}���*��f�%���3��]����d&��0��'QM��kR_&l��� ����9�\$zu�<z�;�����Y��>_ui�u�..���}�W���������q�tQh�R�M����B������V�������������G����m������N�����nS~����{��Q����%mk?����l+������ft�L|�aT�{���U�3�,$TQ�/� ��E:�)��PD�+������8�Wt4_>;T���q��<�BT�oT����
7�|��7��|������p>������KE�_/�]����QV����������:�m19
O��x���K���q �����D���h�[3����$^��5W]��p�q[�5���_���]�F[W����w�����C��=+@n�C�q�uh���~x>x���������d��~���8����\���K�<0D[��`=z�cUqH����4���a'��:���T%OnPa��.���M�����1���zV?�������Yp��*Xd�Ib'pS|���W�����U���P�WJ7x�]W��M�����������)\^���6i�i�Y�M���M�4P��&��2r��
���;���4���]�~�,������*�����x�\����C7Q�x�pu��a����g��	o���������a(�����v����1����1\��`�\�Ks��Y3��8pxL����Fz��[�n��,�5Wg���,�4d� �M�Kun�G�������rf��.����t�Cz�:�G���I{&n����s�{��a,t8#Y�s��g��7h}_�m����?H9�����N���*C8F[��Y���MY3:,������bp<��(
� ����_�M�@@�/�&���&���dy��K����s�e���bN�f�6��x���S��Wg|dC��~���h����~�$pH���qe`��������J����1	����"2��D�0u�1�~�^���n�S�]D���-�t8+�����Y�;K�lxTF3�� Fg��9���o�o/�R��E]�M~q��?L�8�,i��x���#iIb�jK��T�a�9�����uZe�%E"Z4�\�A��G�<�st���	h�3��)�e�?1(J�?����=�	;�T��wcn�1���<Tk��mo����7g�69���N��1`�t�c�G����]V�/�]���� ������uk3���,��o���=I1&n��Q�a���Q�T#����c
�����%C>��-��+������6~g��O�Q�}��/��N8@cG�����-2H���Z�Mb��w�'�M������b�(pyw�`�����jc��3y������\@�A����QdK�C���6]W�O�E�|*����,$�`{�9t����vNBL��f���GS���.����'��9��3K��
q�\��r����pq!��Jf�=I��R��0Ag�2���2!l�B�FFe{�U1o����Q`a���q��9",	�v6s�������� �����Y
?��U�0��J1t;s�9C�AlGB8�:�V]�9V����Y4������v:�6$�@3-h�����r}��r����wF=4�������I)9��X���,�M���l�}#[�������F���1X�	�.�j���	*�#�]aM����o6u��/f����w� �=H�I���������\�D2��`;��G2��g�&��Y�xb��Q�
��g���ki�@��e�>�C�'��a�&���h�.�zu�K1�Y�����9�5�����0��X����o������zM���>$}S'0���`��s�6
��;�Zn�@)���i��U��7I�E_���V������\f
�Z�Z^���y�0�6Y�����q��GD��U�D�<`��������e!7]o���{������=f��c�����������jB��j~�wp�Q�Z��k5
���q<� "b��J�b��L�����mL��]���!����R�{�%�>����l�U=�hP^g�7����r���{��3?q�E��~h�T��M�v�
'�q�����L=�#�����.G?� )~�:���l�'����Ji����D9!1U$N�.������)�i���i|�Ho��*)����*
+����W�h�whp����g�e���.e��i&�*�.im9?����Dp���S��$���w�pV�vv4����r�S��-4������FZ�Rb����,^�|m8o��7���G�lN+�E\#
,0d�b	w5"���Mf���rc\��9&���r�/?5/�m�Y��aEN7N��JFC���d����j�����k�Le7"�p�����������a��i�7���/��E.��G���e�����*�~c��L�8��N��J�����6-P�'k���*�� �� ���,(���<���+����]�xb)B(K�p!���-fl�o�Vf�4Lq��:#C����p��I�����E��[��M��85�U23]FB'8�t����+F��0[�(������Q�QLU�D���#H����!v#h���QlK�\������n�h��u�6�[5
��	=aAe��wI�mi���������#���p��'z��.�->�D�����z�����Z0'��~�g",H��0<��D��*EL?��y]enb���{���e���N�W%0[�-�^��z�����F3;������Ty�pP��sc��j��f����r�^����,��evt&��F����S!��
�NX��a����i����)��fu
<9����?q���f��0(hC�aJ,��}X1'��E��&8pl�z��n���eu���;,^�f����3�#�����:2RSQ7�m��&-x���e��
�ad2�5W0pR�;�j�	�eY�Q��������d�~1�����d�P�u\dA�
�:��l}%��p�6��w�8f������=�%� a��{�h��7Y��/�q���sp���u
0��
��#B���2�R\9Mak�a��K�DW�����m�����C^}y`'������9�=p���^���TP��#��~�`�Zt�\�MQ�sI8��-`�������Q��F�/�n>�0�X����A\�����JEQ���Gnh2b�K��9Gq���E�
J��sT�������MO��H<jQc�M���G�F]�����!��'�H=$r@��wx�Vb2��#�U�f�����"f(�2�c��.9X�9���XSE��s9m��=�P���y%�����3�&m����m��Jt.�l����������hs��.a����r���cG���]a�i�)�Wo�`X�����x@��1�2��Q��r����>?��n'�*��57��>i�s����^A\iF���I��D��m�-������}��|m��;a<�ob�=�0�(�A�1I����J��qM��a�{���E^���Cn���.\��/��9wAeL�hX}��P\�m��������C��4��E>���->UTm�vd���
�q���p,t�����-ms�@w-{�U���bOze�w����n@z�|�Se��.E%�t��M�</
���3����~hKE���������������#��5d��v�������M�5Z���U5g��R-�!3��5�#���-���^�G�<|�����!�Z���>!��Dl�Y|�x�5^
M��W�NL���� ��x�]�*aA����
O�������i�8f���1,���W�#��eI����:P�X��ky��F��)��U��9h�~���T&�>m��`LS�v���q	�s���H��SU�s���m��uy���;Y4�}���[� $aP*�yN��=
IXQf�M��mc�E�m�XsZ�i�e�5"�K>O(d q��K\����"o�����?�������O
��S y4�JzQ��7��Z^�R�4��p�sZ#������c/~�l����8�)��^��WO��:����QV�s�o�(_����8��OM�F���
�|~
%��3Z[<f=%���yT9���T�<g	��������cM�,]�J_B����CLs��H��@FI�]L|�[St��F����}�����0����@6/$���O���G����;,u� P���`K8������,���`@��RZ�PX���ka�1kV�4y��DkT�����E��E@AL�@��'&�0dnl�>�����(oR8 ����O�@�����e��H35Y�
������K�%�$��������2���G�y
P���:))yG�kAI�5*C	��g�fYr�������p@��j-�:���i�r�q�y�%�e��M�$�8����2Ov$�X��]Y�����������$�1%r8A0DDaYXp[ p@-�"&�6M(H��8G�/mj���4skt�l���A������6!�������lQZ�z�]��;l���K8*������c�(#;!5ST9@7���GBk�K����_}�2�.w����$�M��Z&�����`!�C�f���D��}�b��w�I���������&��88$���cE������^v$^���O��<��T���p�j��d&���V*���G����V�;_��p}4�N2A$\���l�|H���L%������l/�����n�F��E��
!1��t$N���A0���j��k(������+�g�Q=��������O���Z���G�����0!����3!��Y��_�����
W�yU�FH!4��8v���8$����k"���������jR���"���l��g�5P��#G������^6?V�M�H���-"��l
�|�(K?mR���k��CkP���2���Klb��-�MSp�w�����n���L7��|+�e
I
�~x���V�*���<1����������Q��b�~��������_�%X�+�����@lT]�|����A���'n�Z1���R^�K�]nH!�S�Hb)	lHm�rM�oA�]3 �mY�t���/��(-���
Cz��d~������D�\��!o��Q�
T�)�&���Ve�i����Z
���nL�:W7��J�;>������<E�e<O����C,:<2�3��
��*�����@��)�na�'���HpjF���ld%��+������X���[�a���G������������?Y?�������`'{����M����l������yW�3�j0��S~q���-;}jU�?$���~F�i�Hj�?��b{@�GpQ�1��	pn��y�A�h��N�]�I�����*�!���0	�#W����GDnhM���X��d����`dP�������#F�{"�����6������k�F��OJ/
���Ek���:PB
�I�v<7mY������P[������� �'*�X���m8,����=���uSVF}^�M6&g��P�+$
P��XC���s�e��m��[F=^��
DCH�J��^������T�a���8�.����QG��F��(��Ty�n�` ���<~U�P)����Ll9,���e�e�E��X[C!�.���{/�������I��;��jW�V�B�$����g�x���TxrZq��5k���nt62�On�Y��J2�f��=��L������>���wz<���f��~��������]����7}��Y�o�v~�=��I#���D��F4"|Y�#;��-�p��4���������Lx�Kc[�a�.�=��1��QH,��s[|��$����Y��@$���b�x�K������,h�tJ��fII����>��}/����eH�a�&za����M,��:���@����1�!���T�))�x����a�*O�FO���P�Q�4c��,�A��,���\���j�����EV���qQ�5�p=��8�'�t��	��a��Q�<�J��Jv��-����^��9H������Q2oTdmc��7N�t#�:�3������W���lB���t����QS��si��-]@m�\;�&���H��U�(���#`z4�DR��Qk �l)o�)���8f�C�V�y�I�&�(�)���~����(���/{�����)�(����Q;7b[���=A�1�B7�lAv���}����������Pj�J�x����Ib�zB�����I����d���;!UT
s�r:_�����
�Q!:-����|�M1���=�l'R_2���������A�=�e�5�������)�rL��Y�P;� X<��*����|H���QI]���A84[S��>��n�2�;1Dnzn�o��%��8Hb*c=�h����kI�z|#f��%�M00��q3������!Z�����;�����v��oJ�k�\=v|FV��<7�m�e��M
qQ�*�g%{@����iD���T[m���J�d�aK_
���M������>f�}H�9U�bf]~?�|{�V�"5�W��	}�@Qn�G1w���Vv�
b0������PH:)a��D�h�[2�?�./��o��e�S�F�l*��=D�<��S3��!��=7�F�����~g�kt[����E���w��M
��p���#��8�~Gkh/?ZW�{0f�r!�qm q�3���z>>��2���r��iY�r��Gf��V�S���6#��t�>l[�s��C���������E�^ s�0K(��@k�N�mE
tM�$�a9F���_�/�&���o�6k3����i�`G.85itq�s�-?�L����le����ak�0��M�b���x^r�h��EO=c�D	n�N1	t�k�w-l(��S���<�(�6K|D�ko�����$`J"\��zV��W������xF�����S�211SqZ�����
L%Eb�u������5xx���s=)"LZ��1�%ZW��j 'w�K���@���O8;���"b8�m	,����*�p%��[^8�P���L.JO�����G�0�!.K��?����cA7!�Tl��z{�Y�����'�g�N@�a���R�{/�K��+��v��<�	�JD�����z�O?�n�X*�DL�nw���x:Y�kn���� *�������N������+bSn9qH+#�)m��
�pQV�"q|W�Y$�g2�;�������}�����jb�f�����w��*�EID[��9[����Ac����o�4� ����&7<4�)��1R�l�|��h�DAP��6o�R���l�M�Q7�M���p���f�z�8�����D[PVR��Cn4�������N������b�����Z)���.m0����ao4z��,�Q2��9����������H�z'wL�KC���)����>���b��RxO5o�De��G�����!e���GV9�a����W�
�}{6��|�sA�H�7�t=}4�P�T�u}���������
<����&s���1��Dx�9�� NL��{'������y�H���&1��	�A�=>�����`,��4�	�%�e���`M/$����K���}�?I��3f�PP~�8��s8�mif��k����WY%7�
H{���d]#
���k`\b"q��K'b�|��mqA/+	���w��`C�N1�H21Az���� fG�!q<�=�Gj^@�
��
]�&��j:����>r�b��iz��t�?�{����Q�&]0�|��������idH�N_Q�	R�>�BZ~1�����v�V�9TEz�g)���@����k��a�4_A�=50D�p�[/����M��l|U��8����r���j�6�������"}�5�������_�lc���K���W��(�fx��$B�N/����^6���<��������yQ�
�3
56F�z�b�D�.�G/�B�e�L3�w&�!p�b���>���b��������Y�+����\A
�|���O4�=�'���bWCc0*q�}U���0�q,��.>�����:�H�:�����=������J��0:�V
"ZNOD�-��c�	�6��+�����5���4��������u�������(�&[��O����	�}��x
�h�M���5�(XqW���0�y�-���ZAl�m/�E&�p�wV	T}�>� ���)}�6
�5G35�F�a��%'m���N��e7w�����e!7���	hFr�%�D3���2-� J���Z��}��B�M(m��I��"->����.�*��:r9�Y�IpH1��X$.�F���p���6����,�a�#I���m���)������
d�H�I%�:��X�����<4�*��O5������
:G������]��r��"����V�!Y�������	yeG6c����=o�e�#lm��`r�����)�!�`F���������/)��T�7��6�"���� n&!�=�`3������������k�����rW�U�3cv�5}�p����Y������=lv4S�/!�\+�3�3��X��6�O�����J��	��4�v{'��IPV�� {���#��C�N!@��n���Vo�]��bE��:r��$���
r������al��TR��������rL&{��1���B8����u`{�L�"l�j����M���t�zV��]�6�y�0���<V�CkHs'H�`0�9����NU	2l�����w(|�H��� ����by����.�������e;r1�
�t.��y�# �(�=��X~��5U-O�����p4	4��q���z�c��������2��	������w81��+<[�k<fF��hREq��,q"�T4�^���_�}B@yT�2&2-����H-�������N:���J_���9r�p��	��	@�"�������l�����6z�ec���;y*E�j���d�����W-c�A��s��BS'�A��X����-��Y��j-�������YU-���9HK�uA�	A��I�Z�J��6��y�]k���J�Mi}��%��p���S���m��� U���*M�������b�$��������������{�HJ��I�vz�i�&R�����:�}����/����XoeV��y�B���f������g��{��HP�/�L��U�C������n����0��	x|�T(���Sg��;���y��\�m�	G��+��=�8����Or��P<)`G�����z����=�?=5�KJ���
'�Tp}aK���l^�
I����`C3%�.f���.$�e���'����[�|(���������&��|��%���rX�H���=/�I���F�/&�V1�b���	�f����BY�	I����N�)���5R����fW��[���}3.�C�0"����������&]�!���&���A��3MxPm���Cz\�;�5��F}}��;3���oNC�8a@�����[�C��]���qI���A�V4H��K��=�S:g���|��u8���hE�Z��-�FS
��&50��cHR{gI�������x����qO#!��;n����P{�$�-q��f���m�d�_�\��2��r������l�W�A�4[F�|��~b�1��A:�mk��{�MZ�U�&���|c8�� ?D���}���!���-��'ED�	J>&����	��E������xO�P������C������ ����+����<�����f��lrNo6�	�mxT�n����(�"�&����4����2��Q��=�M�*^�7����1M�2F��h<��,i]r�-@��
��6�1UD��2 ��D�C����m�"��Mw�:�G����2�u�142�"�������
}[�
��Fn�B1�:Nx�I��&.�y"*��hrb�5u����b<��E
N�`��O=n����j}�
	X0+C������~���,D��N�R�����2�s��3�����l�}��mA?�V�PU"��yD��#$�����:�����	�"PyN�u����j�b�#��gZp��{P94�<c���/��l�#�o&����lW�9,��#{`���R�$�����hH�0�r(2]����i�JC)U�mU�������_5+������l`�;��M+Wp[�%���`��
)&!|����=B�Q3�6��%���G_�{�Pq���I/���)���,c=WD��:��f����Vd
�?ujq�\�"���)Dg��pQ��c[���F��v����?��������u���?X���E���<z��n���b2���'i������������e���;���N�S"cN?��=b�#��0�������\��m���)��lr��A��v���x�h^������9�w#FBuFj�q/p�5u�W�������g;Y�
!o��L�X��	<��0�q�Kk/��
bjpA��*+��0}����^����G��i�:��T���,9�ny�"���Ms{H7%���g�Tn���1�-NLD�?��x�hzR�-�����o������F��}'��,�8<aA����Z6JAiJ����U�M��6��g{�\8,�TR���Ib��V�(k�f�����XNB��j.���^����B�V'{��YK?y?gb�����@������M�u�Mn������H���K�t�Dg��#�w[��S��#O�u������H�����C�Tj~�W��es$[6��`H�6��:��'T�����������TS�2��q D�� an"�.FTO�Q�J�>O�J�<���s8��-a�O��
�s����w�$������,;c�0�i!F�XG���F�_����l���t9>�k:���]������Z����q'"��X��>.?z�T�q��8g�[��#f�O�����br�������#bs�Q��9�0X>�E���X�����Z���f������^�-��q������:B��=��<��e!Yd����,���d+��!n�2��^���f��O���n8&��B[��LP���,�{D
��h"�'.L&Pt�&�� ���Y������!��}[�(��2���1�e��R8�Re��;�[C����Bf���\3����@ciB.�:f&=�b���^6?V���F^�,E�-j����`�f��8aH��@�P7�lA�Ff��l	g���2�'�B|b�4��L�Q���,	z8������Ki��	]<	��]��Rf�2}##�5@��6����iVI#�k�fD����d�Z��LX�!7"Vg���v�hG�����fzT����&X3�<	�	�'� z��A�C�0�s����l���jR���6�%�$
4�� ����QH\��!��"[����)h�F�����������>�	���}.B����!w}������������/�G+}_��Y�������F4��������$T�jl����_�Y��)4��T�7��f��� ���yB�z�-��
{Eqc"(�N@�����-,D��s�j&2[������
m%H>D��T�,�;��c��oc[�0A�G\c[,��O���L��^�'��QT�<�8��w�?`#�/�d�5Bh��O��G�8��x������������weu)������f�� ����
��*���m�g�_~0��l~����!��x[d���4V��'ty�����?�b_+��d�M���4r[�����[�,6f��5#�3���=d�
������H�z�*}���	O��|AeY���-8N`�PT���
5TC	'�f������@O����p��a�^������`Q"Bc�C�����-�:����j���*4Z0�@k�u����s��r*Awh��!���Y_���W��m^\������N�P�:�;�E�,,X N+!W4=]r�@�Y�d=L�0�V�:���3�yO �ez���J�N�_"�����s'&���Q���D���M�9bXM
F�|���0���Du� �$�%"z@/:>8^<��Q-x}��	�$��=
R-���4/n�*���P�]B���*��d���5�������|U�R���R���g2X��||*�;*�?���������wyG~w�g���i$=��z �w��%bMa��3DB���
��q<4+��2'r.#t�����uvc������\>Qq����������"���:��F�� bY��N��}�����^+�:��Y�:��;^��xh�Uf{
A�FX�)�0Vo��9�Y���F��U����b����*q��D��[F`V"R&p���4/o���U+��<�M��#-2��F���E��N3�l���$5����O2f^�8$���N)�z\��ac����vy]�t&�R���!�a�Qq�:���X�"�q[Q�-/;Er��B`���N�*,XP$N�����I��;�����
�Q0�W17��X.�(c�S2�5�WU+B�|z���Q���3��q<��dH���r[��=l��uy���q�'nm��>�\����E�����F�B}�"3T$�z6���A�p����[�9�7&o��	b�e��Y�)9q���iX�����2}���l�|���QS���1�Y>J����
��2����u��5����������}�5!9�PvBI���N9�bl�m9o����c��kE�5	���9����I
;��s$�����c=`���'t����t�hLE&"'4,;��v�x�l���w�i�@K����_�w�a���iO�\�������Y ����:�@3?�;���V?���(��KY�*�� ������7��QJ��I�B'���j������RPh�}�i,��"N�1e&��FN���T-b<y4������,����I���:�J0�Z���V!7���<W��}%�F�"��w�9��"�U
��5��s`"^|��6������_�~��~�]`������LE��)�q*�jFD�a�s�1?r7��edSD�������{A�����"���m
�+Wh/V�3�j{-������_���������T/�Qa����=��/�e���.�T����a�����n|������8�l�=����r~]^�*ji
IzS��=�'��0I�\���de�V.7f��Z
���e%�'qxM���iI���n�Jd!���X�JK�
����D������KKo�Z��K?v���������b?��2�������y��~o����}�V��o�]���?�S&��1I��	������4/��`���:��qj�;L)O�m#};Q��%��aAJ��N�3�v|1Y����N	�P���/�F��l���x��
���z-
�j�4)��^���V�cdZM�P
��E��G0[������I����|��P���������������������0���G#�����LS�Z�	x�
�p
��#|��7���oM�{�t���t�-�%�C�L��\�s|�:OzT���BZ��6���=1i4���}GI�=5g���<-?R�@�&q`��������F���tWWF21%Ts�A��S����9,�4��u�W��Su6H

�2Huv�b�yM���������7jn�;�L�{�d�H$�&�z��A`�KJF�g��NHH=["��N�d�������e����J}u�',;���N=�p������J=<��G����������!EiZe��.mU�Vf�g_X3��d�R����Q'��|uI\�W$�
^8���d�j0�`g��"��{�`:f���T�����FV�>�!f�d.M*��y,���-��uh�r�����]}y~�HC^#u{���!D�@�7>l������I�����y�����e�����qD.��N@�_&���h���i�h
�Q�������'�-=�NZ��p��t���1��(�����1������*�
3Y�Dg��$HY� ]&m[�����fm����>_�$F�j,���'�_�m�@u����#������'%9\�|�#P��<61�q���P��������H�1��'���ML���L+[Xp@��Dj��s�;QB�����&���|h�Dc?��������*�C��X����)�Z��62�q>)M>���$�gT�w����&���Q8���D+g.Gd*�,���:B]��~���GK���z3=�8u��r1���N�~���o��M�M�:u����)��<5t2V<}k����A���F�;'���	����nN�*����V�2���$OL1���)Q�GC�.�/�i�|(�������������Ot�8�a�+�Z0�3������2������A�h�O���
����
)$���
96K"�����e��;M/p�{�&0���h2;���&�8�z�����V:��r�9�KFgj��Yk�-��i�9�C��;��Y�<�O� E��s����^~�}Q����2��e�q���'���7����0�#�$.�mO��z6��Y2�����D
b
(	�0�%�n9k�e�5�E�d~UNFA�, ?h�� �9��L��i��Y9�D>D	H��Bl)����n�����f�E#z�X7�?TI��5�o��������!h~�#C�U�6a� =��]Y=&�7Q@H���
�	 �&��q7���t��7�6�5:?1��p���D�cC�X.��[��tt�-������Dl:�)Bd����FQD������*U������F�V7�K��������*a��k��H���� �l7��������Om�n�:#.�F<��Zi���k�10-pjq-���-���o������"*	:{�U��X���VI�P-wX_3�;�����"�<Q7w�>���Gi�S���#��2�t��qD��k��[��E�������,]��jI�9�m�|+�sY��}J��!��5[��b�~p��1YP��,���h{}�����h�w��b<�L�'�0#�u:���V��S��G������&aQ��>G��,!:w��(�"�2k��i����t�k��h���\1I4�4I����"�f�A9���D�[���_�YS��u���l�G�����D�+�*�8F�����6?GC�o����(<j6��n�������/�o���U�7o��R������!q�DxN�e	���� +���>5�C���]	��� �4�w��OP$�z�l���5-��5|���j?/��F�B�6L,OP�q��	:�Y�W�R;��t�PL�3���^~�%'r�$��"�����x���)��X��e]FtZ>�s�#��z��|a����#s�FY�P��&�s<"�v����1��nYo~}�Pn�uY^���5�cM��cN%y8�Xl�l��J�W�F��u�P
x3�w`��:9!|7Hl���a_k��o���(/����x:��_33Y���8�`a��U��
;i}Y��,�9!6����P��Z�2���V;�,#2y$��'���^'tb/�{��=[z�9���/����)�f��s�r�)(���%����-��_b�<����EzE����8�D�;="�6���["�J����M?0b"�dp9��DfzL������ �o��K~q��:��=�OH4h����uE��C�+����/�b��1Q���+�4����I �0�6/B��-�!��@b�xW_��B6l[���%��\�|K
	rQx����F�5M/6�&�emR�@��	����{
�A�������@�Y�������5td��|0�7�����
U����z����nl
'��z8a)"�����.����+Be"&�]��7/?�����:S��W��\rc��4��|�5Ul.^-�e��Z����
��1���c*�}D�AH��:��~�l��������L'����k����}�,>���%f�e%�u
ayq	{��&��@?����n9;)8�G���;����I^���V�[�S�|lO8>�����`_��}���v�^����b�79����;q������ �y�4�>�j�y����2���U�G��4)^������������y�HLj��=�oA�x�X]�'����~4Q�81�hV�Gk�H��+���d>[VS�Tf$�l��;��^��0}$�����Fe�!�Uvd�_>�`�������O�$Y}Kb�kr�/	�������@��$����i�F���N$xs�A��G7"�]v.��cz~y[V���B�`�\q�MZ����p@T���Nl�z������_��TE���Qf��0�����"�+�X>� s�`�O�X^���|
���4��b���;r*�}���|��Qyk�l���5����'ub�H��|������������v�5W��T�J����L�zL�`��h+�
����|:�z���s~�#�QH�B� ��^��LN�#�������`�Q�WL����Px���l��DP5�B��'�jR�68��@\�N������(�E��mb���9\{k���'�|Q'��&:Zs���j}H���6�rMD>\��0I��������
{��U�H'��U&��c���*��p[�L�w1
9����8�6���!'�@a��4��q.m�-`�i������n�.Cd;)�'����PIb?9����������!�����u�eQ7����,��v=�D%1���b>����������p��H1�H���
��G�-���J0����}k����e��a����?a��UqaTR����w1�������il����$�����FHJ����N�M���n0*�3G������p�4���u%w�	�����Z�,����C
A4�����H��y?�E%��Q��T�z��'�0*���)��������+%��F��
��F��4�&�����c1���L���e=���e%�6�e�E��h0	�b�9��,�1O��X0
41vn�����������zm4�(H��>u��qD�����g�ds�
2	4�@����ICQy ��W*^~Kh����s�=����1��(YX"<�H�gA�C�D����D&�!"�����"�E��!Q_��,�;��LG1&b���O����l�!�����|�����������������J5�k�#u�6b$7�@�*w-��h�@L��w'XJ��8	�{�sK���	8�������i��Y~�;HE���,����6b�KB�E���I��A���`;Z�c����X��{�D�NXYXJ�����wJ���P7
mh������QH�N�
#��g�+0�b$	LNa��J
�<R�b-VeAd��m�y_��+o,$�/�X��A�i�����k���At\�aHe�:�8,`�R.�#�]���QuLN�z�^4���y\.�����Q�Q��,��>��u.B���������2;��G�D
�N*E	q�Fq�D� �A�-��A.�LMg�u5�<@���A�����M�	�.��l�#68�b]6��J���<Y?m����hj#���ps�h%M�����`4^neVu��lt4���Y�O�
Rq1�@��;>������,H���P����$�6�d�B^���9�	~���T���\d����{�Rb��8��	��$N@=�/+����
*�i�$�c4�w��#�g����D8�@r04m�V�.���3����C�q�MD��C��+y305b��+��r���0&�>�iA��+,�>���q���c'�>��\���p��6k}�H��h����ID�@ ��?u�������q��X�  J���qe��,�q��L��PR�����4���E�n4���c��$���Gt������P�.�MZ��nB������)G�M���~?n����*+6[$��$���9B]��'��C<��w�+��$�Rl�x�������g�(���@�V�,��nM���:���#���G�A��@��5���R^#��ty��`���*Wfi��iz��N�C���QHs�
��Td��Vx?iAY���
zr{�0�������-(w�zZ>��.�ME�G�����	acd��*#Y���=��q�L�:m�9��n8V��Y�J�;DtA��(�%����W�Lv�!+�������B*Bk������������qL4���E;�B\��B�1��4�t�-&�vO0��N@�%������-����-�c4h82�8I��uM�xW�����sbN��b`d����L����/��%�����'�-c��E��(�aM�b����;�D<�a��~��c� C�?���9Q�����-��7�.�Z~UU����3$k�}\�d�!������>1-p=?����-~1���a���8�~cl�����E��4���������{�&��B�:=i�^!����Sb��B���eO+�1���wY���8����F1
L� bnKi�H2�5)`�vn���f��+z��j;Ly�x{"��U�7��5U���o����	����v��E�J��Ex��
�w>��+G�c/t��{
��!��x�#��[��.�z���t��A�N�nd
!��1.�/���Hq3%������iz^�mm&��YWi�
���!y(���x������.���&���	d#�4���F��:-*z�l��������3�!�4#g�='������[��>W���C���6��p$�H�|T����O��7����N'D���Q�x����e��G��=}{����z#J�����s���c�F8���RRS'U��h5<�Wau����	p:��4]n������RT?2�T�J\�i��nm�y4�5����|'wX/O����Pd�����TV;#Z����9��81U����YC"6Q�4�?��;�^>����Fb��w��2���m�>\\�p��"��}>��`G'�.I/�s�,��e�5�._+i�����[_e��L�mu����q�5�b��!l�����G��
5��W�*Y7ee*��u������IR�|
�
��l:>��Cq}�:M�m>��oi�������-�h�����k8��LkY��A�n��|p3�s
�T�G���7���:����{p��J��,/�1C,�^�l��{���#K�I�[�[�CH�'������>o�#��)���iF5cc��L�9]#��0ZuN
]��,S�#�Z9�|�s'&�j������/+�5�OY�|{#���4jLb����3�F��d�g)x����z�7�����Y/�5����6cm
�&�i�;�UG���{Y@~9Z���,���Y�����Q����AZ�fM!��wi�����o��YS�5���xcnI
B�6�%n�m���(!�>z��q�>����' B1�x�).���S�j����0�+��t�Z���1��{�F���&�����O�C�'��8��(���4-�h��~��\��h�I��oY0����l�i�����d��]'�N����5Il��(�����F+���X��� &F��|��|�+������L���A;p $�%���0������Jn��h�'���	3�cTJV������Q]5z��l{0��L�n���t\����d��]'Y|��Dz��~Q��<
����x���	>�IA��Ezx���� ��4
=<�5��I��D��#My��m�G"�-d�FB�t��nko��!���$!�����U�����|�5n��������1YG��W]�&��lNp'
����s��%6&������3c����@�l/��*���`��Si��������V*�\��l��`�����NT*����u��T
����>�9�u�-��J���������^��5���e
>����DD	���`:��g=�"c�-����8�ke/��c�t�l�����h<?%�>����z����g�����h�|���� ������A��Q��O�c�N�N��V���hDT���<4��!!'��	�
�6Z�.W��'��x��9���5s<���$3�'��3���n��	����ehM����[����1c����Dg|�F����/B�����&7�Y��L��=��U!�q���M
5��<&��hj0{LFJz�:����gK�]����MK�Z���]c���`��ds�5x���P��������1��j�*r�GB���������t0)\��s6";������'���f�5��=�r�-�dR��^����Py��F)qP����W��]��z{_+n[f����r|l�6�������Kn��yYe�]Pt<7�sy�f�A
�0?8�����	�}���^V�D�+o�*d�II�rc�������DT��ASb�m�T��F�S�H'9���^g���"}�B��Z�4�e\G�S�P\1����z������
#X�O����������ZI�e�<����Z���}�q`A��Zrc&���jf�!(b����W��OD���}�{�5�Y\���y���X_#��sCaK�R{�o�����)�{���
�r1kJ��CF�5�b}>Z�?�:�����<8&�lyMxN+\H;&c��a�9y��Df���0M��<�4qV�>@�/5n<��X�Q����
6"Vg�
���X������?_;�U������#���'���FE��/l����2(���?���Q����
!��FZ\�H[�:���=3q��a4S�
�}���Oi��S����W�yD,�`�t����;�~���	S *-s��xj���}��(���T�tb��
����2J[����=�W����VY���!�
a�Q!���;���;~_eE�`T�'3���
u�$@58��\����?Wr��������sq9B���)
������4����E�
����gZG�@L
x8���l�Uq=HR�Mn:�#MU`���O��F5��d�J
���?v��Y<}6��|6���E���X�H����7A[������wH��A9���(���ydn�f,�4XQ?�cA]�{��<T�,���2�+�M�)�Nd�j�YkY�p��4/*'aD
_8	u�I��5�\�\�Gu\���S)$��G�X�*�abfGvPs2��D���t�����e)���j����.K8<!���up��d���Q�
Hts(!TCB�z8Rw��s��Q<~�iG1���?TPC<�?'���z��Ys
\�BV�"���v-Xp�3�G������&�U?��J�E
���J��Y]�c���^vP"�P��7���8/�p��!��8���d�m���
���L	���	.�XJ������:R��m��p�Y�	�����X%�D��zi��7.�)v���3��D|�����M$��=Q�9�}]�7�M(�!z�����GVM�V���H(�	�P�����!�iz��K��b���d�f�z{�9lm2��&��@�
���W����R�"Bz�-z�c����o�I��<�_#l9)!#X�����!k��2s�j�`E�)R���L��X�tkdU�F�������#j�I�7�f�>�IG_y��(72U��&�������y�	%wD�+!��b[�v������)�����.,T(V"������*&TW7�����Wj���td����l���q(������0[RfP��6�����I1���s[�����w��A��r��{%�b�2z��X�\���QL�)���l��n3��(,.�g��������F�����������:G����R���
I�q�ba�q�Sg��#�i����m���Bw'����roA�8��������o�R/��D=���s� ����-� �������D�5�g�r�s�|c.4g��D���F8�����e�4�����z�;8��s�1g�%�B�`��l���Y�����Wc�O�?�X�!���� �^`K���|d#7����	�y�<�K��	IP
.�lA��/��euf�q;���S&N�I,�L5�������_���W)��7*�?���N��7��(N���T��v��-��������6?��������?1�2N�,R�?]�������:�����6��	-�N���@�i�M�~dK��V�A����52E�;QD���'�]���_*���������>7��2_W(�k�������E�����	��}�2�����!�N�_���[��!�+q�HY����g��T�D%3���8U�������!U�����o�M��`�����
���@
�bU�%E�Q�gp-����M�j���
FX�P5�E�"��V�wWZ>���Z�K[��,I��8�!ReF��o�i�[p@�L��$UXI�:�|CS������b�R
���;E;��/WgA2�����(*��G\�l��-1�SgMX������-�.��?��'D�~{��:�k���h��������Myqe���a#Q������y(ATy��,1���ys5�k6i��drP7���8������sQtz�y�������'�:�*�6��X�i](QE1v(*y���c��F�g<9y���q�C����^�/�&���/���Z��>8G�����8��&�$�T9�JEI�d�Z���|cv^�Z�ql�p�;�u`�mG���p�^�����0��L������>���"��&X���7���IXg-��r$2���w�����h��j��~b��w����x��HI�u�6u<���
�qLB��9�����/u� J/6[D��
��������������Gue���.��1�|����*��|�:����f�����'38�8�g���D�����=�P�>��9cr,�xRe���S�)�<�sY��, ����������<�������U��o�o����?P�I�3^��!u�tPKZ~��m6j�"}w�Sk1�[�*��Q$�sW0[����uQ�����������f���x�
A�B�"[���(���!����t}��:��M=�^���Nj"��8/Wc��F�S@B�����T��U��Q5a(����j}�uY�e]��l�E�\C�ArU M�����\�O��vD�+��Xn^W�f��Zm�����A�`��&���OO�Y���i��xR��O%�l�'���.�&��O�Q�L�9��j���Ec�p����1����a��
*�aF�J|WX3��4|�������[q���'����-��Ib'����h�0���9����a)X@�<�>0��5��~tR������(�Z��������3[���Cd�V�(�`A�������5��5��Ho�,�U����n�8?��f
oH2���S��?}��'��~������M{����tR��^�
k�+dU��fyd�g����M$Y�0dn��/��^�;dt3+
����K�D1Y"iP^~����x�����{�5��.��(u�|����|���yso�j��`��V�Q�Gp��-���X��Wg���LL�� ��"������zz��+~���o�����}i��>:��X�	*�������?<�2����������U��Y�tZ���IY��Tk���� ��!�9����y�0I�5�]A�\����(�}�������}�7��W��X������s�j_v/���R(j������*}� �!W<�[�R�*�|���Dw?�� ���/�g�����Ay�0�t}�g`���C�I��/F������<��L���)8����l������������sS�Ow�W:��(��=F)�H�@��\Z��oX
Lqvc1��p9����4	��Q�����O�+�C���}+�fW���������,����Mz��u�Lh�RZ8���hrr/~���W���Iy��l}������y�B�������4����K�}��9���r�;	[0��=!�-y��q3��������vj7S��2��~,Q�sg�0�`��h�h��\��,����o�`v5�����o�R���p>���2![+v%�,~!�����d�B�����h���>���y���XHM�(�*wX��F_�=���F����!��!�F��������D��]%�x�mf�����ft�#�k��:���Y�'�����,!���b���^��v���v7������������vb
8q�-���7@�������#C���D0T���e���^��i71��A	_���\�#}L~�(�(���
��{�����[w�������o�F��]�c��G9�.&�1�he����
^�Q�w���R(K���i�h���MS�@���\��`�v\�,���cL�Kr@]��a�>�w7O�A
�2�!���*JF����������mq94�k�`4��]J#gh�n�F�����{p��kS�����t����
�����$��|��'��y��6@��M�9�����)������W��M������:/14�Jy�,��H��t4�$���w���3<$���&��za���=M�����Q^YB,l?;a�� �C^C�!��8�fC�2�]���U�-�@��}������������K�W���YR�z�+��'r��������6e�������]u�1bk�Cr�o�� S��[�.Gg ��Sj����(��g��,]���~J��sa���")����D:����Q����{��Q��,�_����	�"�Q��W�������7T��3�'�I	�!�hC?�����M���Cj�����	t6�������	��/����������XVO]T%�������}D�	Y�����-����wO?������#�s�8a'i�K�`i�Js������&L8R�n��)�s5�W�`Q�g�U�R
��|r�O����*/������%v�J1|c�����|m�U��j4����tgN��o[��y�L.18�d�����#I6��C<U/��M��|~����������F91)���x&��5+��]b��Jz����������v��4�_������0HT��(��O>���
E�?��!HIK&XI�!��HI��� %i}4������o�0o��2�����	u
6E�| |Tk�-Zy�N�5�sC|���T�~���_� 
����>����l[}{8������>oN��:�]�����&��w^�4���f�������c~I/�F{K)I���K�����/��;G,���1�g�����w�����`t�mh��N��C9���l�u���mf�=�x� ���9c���tS�ID(bq��8�`�uSU�3���#'lO�6�B��f�^?���Q��b����t�������KG�;#����A�0�9����O<�j�y�U�����;iD�ll���C"������e��(�� ��AV�M_�{{��G��]���9=��,��m��G|�si��"s��P�����Ij��*��:x2�(I<�$��'��5@g/�~���rF��eL��e��tf��������r�k���N�m��%�gG����	�l�M��Ay���rF���{���sh$���9�cKBQ�j�i��pD������	F��U�h��
��y�)_�Z���3�?8�"��e�,I�����u��i~�_/�X~���=������9��xHM����y(�9�D�������|����O&�k��Y"��?��Y(�5��������:�W���2&F\]��-Z��������2���e�1w�D#�<�y-�o�?���T[���'S�|:���
|Y�l�u��p�K#�t���Q�������>v�#��$S�����9��=B��lk����j��t���Ck	�!���.G��Y��-��$np��P��Q�������^B3�����&�By0A��Rq�+�j���l"�la����E������!�����]-�pJFT�@�hN���j�P8@��}&�d����#�H�Mc"�p"�o�58e����s��iB�,��i(�s]���|w�������h��G����gl��	���sbh��T�)���
]_u}������S�/�|I���a>��+
0��a�LO�
 en�_���r�����p9�p,@�z������� �?�!�J�q�/E��)u���!����j��v�p���$�j���S5���h.���;�&�j"af���^.2����f�$�iZ	���5*��7�;�,�� X��6����P<v�����We�8�-���-Tn����V��A�T����7��^�����������m�&�Q~��Rv�����W����/bv�l>j�+�B0���%����]�E�I���k���l/�my��sU�7*� pI<	y�pvV'O���� -�����%/jE���R<G����#����o�\iwP���v�,��]����/�8Be�k+���	ug�_�"TN��������)p��*��xmUJ3jUR=3+������uqz����A�f?�$��<O��F�P�?���	.��A�~����yf/!ITj��Y����������/$��������
d���t��`�&��[sn>x��}y����6�s��3���`(6�5L{&�M����O	����w�g�$�{�0�y~J�7.y���_H�|(���U���U�x#l>�2�t���M�3k2�+QJ����������f��E{W����L9)���~6��tyz�F��i�U����<b���g\�f����*&�1�@
�����b����C���)��uA���SM�%�>m`X�������1���f���c���
n����-��5e�wt�x�i���I���u����F*���<�����:uw���E�SqspZ���r����w�/��$?�r����������n��RY��xy���[����
��!���]����f�s����T����[��P�C�DN�����?����Mr��8>d���i��3�����1�2�1f^�i�vZ��-�1f����L�K�����C���#aE���Y��]�/x���2%�uPVp�&�eLT�q&�eWpl�d#����$<�,�
Ux�o�W����J��.|�X����p�mLI����P�n�K�����(��:��4nNwxI���QM�\B�6w�.����"�^%�|�Q<���r���R{k��4��TO��|~���G�]�6���_o��9����t�X{^��X��m��G	�2"�q�����6���R�����gBJ��eGT�I���)t��n_5O(���_��~���}:�G�������U��,&����2-�]����,&��fl#��<�LG3���
�eGi�sX; �N������+��uyS|4;�I#[��������a�����}���26�[�^�����e'o%#��-��'��H�{#h�,�TK�5?RD��$5?�_��z�Cu��W�=�R&R�����=V��`l{�4�jCN�~����^K��i=����T�����6��qo}�6�g2� 
���`�������vs��1�`b$���v��cN��z]�x�@��o�����ca��}q��T��`���32���
x����.�]�����	�
�;�.���(�G�ih%��u��\���'��Hv��n#���d� �"�`������J�c�UbOR
�2��SqpY�g�A�ii)�xbgoR���X�c�O�o�&1{ZW�n�M7@l�P���CmC�ha��";�S��}�����%x��q�!k&g�<�r�J��!
`rv���%�P��~�h�
�-�I����W�ho���X�y�E����I#�@mR����_�?�5H��S���
���'Kv;L���2�;����8I�)��q}���
>`o���}'O;]�6�)9)r���iQ3<��������EP�r��w,��6�������|�^����<?}>���lp��7���}^56�9�����{}ip��h��>d�XW8�����)�\
H�'���TW��������C_&#��lj����� ���r��=��Y�m�`J��&�*vK�a��O<��h�L-�g��+�J���t�x(b���0p(B=%
)J6y0t���e�����@}Q99�}��X������x�<W�%����W�ipQ��k�g&�^���(g(L��W��������lT�u��v^L�
:�X0�Z��P�h��h��]���Y	������~OE(����j�X�A��<��`��#>^sx�h����;������2��\N;���5+e�lu���\<�K�a��#v!�z(!��Hn�q�����i�|*Nw�7���T�W��Z����:���q�c�oE����\Z��'�z�]3��Dt#���\�i������mt
�1�}61��6���^�v�b��P8�/�M����l����������c/��3�_J��QN]����t�����C�Bn_JIu�bC��������3G��:,M�I������&�&H�z���S�/S7(\#�3O���������N�{���B���Y����>\
B2\��p�f��{l#�ex�p��q����]U���D>����8>�����<�@j�������_��+o������D�O^���v�-{K��\����I�0M;o����Fjg;h�\��>�b���k3�#u�����_�HI�i���C:���v�����&�*�}���fW��	#���#x��������������Db��'	hq�i�h��L�B8������0�>@`���3��aZ`�g:(?��&K���s������������o�'��p[�l},�}qjj�Hx������j�E���Jx�=W�.�5��.�)�4����H��Xh�`�G��D�'p
f��$X���)H��q���v�2]�9����`��~�+�����M�U�u9=�������N�����I��A��O|
 ���y�h��$q�`��h������$�+����`���d?�&pG����)T$���xXZ�a�D�!���>[��wB%j����������<a���E�G��L�z���h�x�[������F�w	�
�g��yw	���q���:���5�K�o�������D	�T�����Q.�Q"72������N}l,~��l���I�����n��A�_�����h����l�-����*��Z��Fv8O���.�5f��<�46]4<aq�+����l�� *��%>�Y;������P�N�TQ7�)�m�\�mJPP��jr�K��K��i��x������Z�
��6�O�Iy��>C�M�b�vj�D"�8�Hf�[q>��}q��51<��:L=�Y�`"����T�bO	�����D���&�[���V��5�<��; Ci�q<%`U��@"��������Z<R��R���$��s;�`$�)�`�%���"^[c|?�����!���K�����u������KaR�u@�1g�'��<�]�0�0�8���0#����9��Tx���j���*T��8�F}�������B%Y��8���6I����X�hS�i��P�EIL�FE4�l�>,��A��F�����e��/���?�:t6=����k����Y��r�q����u����T>98J�;���
7����S��P��w�e��"��P~_���C�^�cs[���
������y3�����+b�?����c�H�<��G1bM:	�6�9%+�7��@y�eD�.H�`V�7��?�����g��)�
Ie	5g�q��H�^���ow��������d��O��UljP�S��~:W��4�{�W<���3`��@�sF��d��?������3[x����p���-�!��(G�[i�X�C&NKE�����K����q�M����8�������/����-;����sK	�vO���y`���������������#0!k�r�����B��i���v��dQ���_P+Z�ok��K�-��=[[�}��������
��H��.G�:#��d��Z�>�k��H�\�DLD�S�B�D�{83=
�K�����t�N���|����y�	�w9����Y�Cl����oqa��r�����.����+[����T:�.
`�,.MY�c	v�`�6�ggO<
;���bIUFlZ����^'<�:Aeh����p�4$x��~�������O��[g��lt���q��,�
���O(��3
�|���<���Q���m:���J�/��`��W��4[������<��-������:����k*m�Tl�7����h�Cy�:������r> T���$`VS_M`�X�����/����Y]	��,�r�M���r�E<���������6K����J���3�:q�VAJ����Q"C9��P_`�����q�t�����ME07�GV_r��Zg�����F�;3�?p�]��H��E�k���b�����_�o���M����,�� ����9�
�������1_�����+k��	]u[���l��	����vq���������0H�g��-&�����|�m���I��1i����C��uf�%�3����r�����D�p4���J�{T��������D5�Hm1g������S�am����*��@���e�p����y�����!����������}�`�| �}��^KF�^\� �4[�����w�<S����������'��Y��
F&��������}=�����������'s�l��Io�YhXDyF��,a�4{
g�����-5g�`��;3""�p���"}[�>�t���}�7��`N��)��+,l��ODJup������D��
�g��}6����4��@����F�Vw�h���CY�F'F��0�cY��j����w��1cG����!��b6]������`D|��b��;�
��@{�VMe�WvH��}��H|��&��~��JH��
g�d$HU��6Bn�lj�� ��:A��8;Z���g�$i-�S���}��W���$�r2�Y������*sg_IS9a�1Zey�TD?w%[|6���U!����v�t��Of�k�"�8i6�/x
�fH��x��(c�7�,zu^
����oQ��k^�����x������uI���"5��V
��9H\�7�i�rFu�Q��&q.��1g�8[�N�"�
>�dk�,f�z������<���!s�sk1���A�IA|4
����;Ip��A�j]��<��|mE������\��B�!���:��;b1�f+����y�Q{'�H�P$�z	���	�����2�v�Z�U��lL����>�M�F1Xg@<����j����'�U�O��{�`
�{
� {���Z$8XZt�����I��K������4��f_��Z�Eoe����}��������H�2����#U��>	���Y�h�r*�[`��p��iu�{k|�9\RNO�����C��������e�l��9��	5�n�}�O��3����We���N]�5j�1���m�E(���������H	�����`.%�������C>��.��`I��$!)z��3gS�u��4X����08��>	����*���Oe���8�����N��4��h����z���t#D(��WT�t�����]��`�����������u+�����S��3Z��)��k���l��"�md�g�C ��.���2KE��N�}-������i(��)�G���8�T���g��� �����m�
E��Y�j�2u�S��{G��������v#��'s4�S9Q��
����a�p�P��)�[��Wr��?[7X��#�O2�3�� �;<��`\@`
}s�����DFR�>���cWz���l�����Wl��mQ�|6���o���=������6������������GZ$�XMz���p�M���n�-�9�V���uL�)z�h�0�+�E����~��|%�����:�	�@y6���9+N��h�fa��[���In��+���i�b"U��d��P:���������A��I�I���j�W2Y�WA,gB�F���p'��8L
/Je��������X){U�����3Y���h�PdN�R2�{����jZ���N���	���A'/�_x�7�<:^������'��1hn�'����0�����A�
^�8��H*�/�rP��Y��y:i�9':w���]\������=)M�6��Dh,�%�z^5h�B����]U^�Hxs�b=��ng���}����M�3n���j|�+��7��)����!�@������}y~�s��(������/DI�\B��cP��������A��cy�*�~+zj��e��=q>$0_|�s<���3P���Gst���#4��aw2;m�c����#�bn3O*��:!�_����G:�����_
ip\-���M�&��kP_��{|x(��p��o���'�p�_%Y���d��r�`DKU��U�J�t����cK�����T�o��x(��E}�kST����������ZbJ�>��@���D(�����^��xI��TF�*���H�_��&�|�,r���+y5�X���D��0o���d�x�U��i�����n�3.�)�d��4x�/�v����q(�����nG�����=yN�Aa�J�:_���2����1M�t�RV]��b�vi��f�f0;�g��"�2Ae)��OnX?jw�;��-w�X8c��
�2���)��������?9qJ��T�c���~�7�LZ7${e[
��&��g��<���^h�U�m��v�4�n�=���v<"���X�P����E�l�E�pF�|B�����j�'9v"�4���7�����,�eh��s!'���������yH��B�l�|��K1�vp1;�K�^��I��C���7�\��t�-��\���L�!$�e���$�w%N��F$��m���Jh~�79������������P��=A��wBo�� 4�fB��P�ex�`����:]5�6��O�".��
�h����J186���]�ASJ������o�^��S��[�<�( 	u���G�B�KX��a$��[$�6�T���~��(I�������?����\�C-���Y6���'��AN�<�3i����v����	�1�y!�4��z3y�7�K_�Q�����$�^�?l:��$S�,1'�X�g����l?�������p��g�pbt����y] n�T���.[>�O���XVO�����I���s�-2���T,��`��W��_;��l#d(Ku�wmu�������I�3@���x#Y(������Sla����������H�S��Ty+�:x_x���9�������fXBO	|���(��� �?�;�N�r��i%�W�6��:����@����/;�m^�nM�������d�����g[����A�.e(_{	v����:��D�+��1%`�)i���]qQL��<_s0�t����E�,->��gJ��P�o`t�����Lg�T���T8�����E/��yB
���2	�����,v��q�;Q��YcR�lfC%�����\�.[� \Qo���[m����hl������h�m�Qof@%������lQ��%0�!5NK���R�=���<����}n�����?���M��m9�p����B]��,�W�'��"C�c�N���"�fh{X;�3F6����h���a.�2h��/���t��^��~/c�z���T<~��X>��6CyO�A���b='�?X�����SY������J}6[�����H%z��k��]��x`���!%��&�#�#���cI�+Q�B������t$�Aw��CI�L����~�$P�3�[g���PqG{�6Aw����F�����_w���V<�Yk�:>:K9���{~@wW�<�<�d��\m�
G�����o�9�`�!F�J5}�|�u�����7i��~�Y0`�k��Q�5-��m�&3��D��R��b�n�b�n��.?��
�"Y��A���cox_�v���2v���eU��b�����J�!�1Knv8g��Al�Y��@�hr�����A���[��#�f 1q�k�T�}�C�z��	�5�z?��e�<W&�u������p;���Tf�u�2���p>4�qFuJ��e�C���-P	1��M��A���������[!��i�L�afj�w��+����[H2�j�?����V#T�K�\
��@�G}.e������+$�H>���]Q���Tm�`����m��q}%��L��*#N�r���|�Nr���� �^������	�����H�Tb&xVS���X��O��X�?CZ`�BD�u�����tg[��r�_.�bP������������5�Y:����������By���>0V�qD�4D������c�����������Xn��w�P���MD<�}#�����:�.������r�#��t4��i�C���8Q`��@��;{5q@�����^��P��W'�����r;g3���U�{����NW^��h���zG�1Qpv;v�	�n�~7�C�
��*��o.�C��}�`L��2�B�h�RfE)�U����{3��{V[2:�t���w'���<IP�c�q�b�����u:9���$�(t�#�N��A�H��"�szH��}�y���8��/PA�C��<���gS�j�hN�n��A�L�������"��&���6@H�$vGB���.��^n��V�\���M�G3�iK���� ���B&0����@
���(ST}�U �����1,�-'�T%����f�}��R��jU�|(���#{��x�\��k�tij��m���+Ba���4�V���1�X��}�sX���,�8#��q�-p��g�
��.���m�R;H�DRq����?��T��T�`��������m|���Dd��m7!�B1",�g�A]�����i_��)��o��t���=��1X�>���b{S!��l!<��Mu��5���P,YW�FN��+,�U��&
�t�����Gm���Q�;�}�]��(�����E>/�?�*
U���
�h�SF��BY�P#����VX�:P�w���
{�����)�~�|q�[����?�������Y�C"����`?���-PG���$��-+Xg��M��5�Vp�5����B�g����1<�#|���}g�c}w���X�|"�o.�g%3�8�`�gl�<����qh������%���5�t�[0)s��f>��85��4�L�0gy�������}��Ph����t���m������G�J�J]`�%D@�6"!�L���<A��#@K���1���[����If��P�Z����K�}��{k���:�j8C������^���{F��U�[�+���]q���9�Q�U5p����Bv���Gf/�R������L���9	&`d�9A��+&���_� ��G1����B��/�������i���2�
���v#�j�nR�VM���F����)��������l�����G�^��?�^�
�dF*J4�FhI�<����]���)��i?������B���rv#��7��x�g�$y�����H�7Y����b��L������	P_�.=���`b�w��4��?4�{�����L�W�s:X����M}�i��4@,��������
H�M�cbi~��~�B����.�/d�!GI��j�%�~%��.��f,&��Y(����%Bi��)�>H&���A
��uL=-.�w�������F��9�����q��x�ZC��g-�b�MjNh���)g�3�\	�w�����l��������x,#���^�6��0 �MkG*����`��i����|�>�s,��-�I��T���i0I���q���R���}(K��G4��x��@��g���,b�L�E������?������nM���T���$���=Y���D�p���D�SW�W]��'G(���K]<"����7bh��C��G���Wg]����Q����%�����)v��@���������3`jlynw�I��h�"�������+�u�+�q�;%��c�y�J���&��M��F�m�T��#�;{7�$X`�Vt;�q
�Ci�\�W>�`��)������$���L���$������������e:��]��s'��uE��C���6�f2��C�&�0��bL&Y���P�J����CG�ya�~�4��C@�[m/z{����qo(�Hu�q�N�;K{%q|�2��2��.��S��d�E����n��t��w�8�������BoNto���Q�dp���]M��/H����!��N
�^��yp|-��`�1��������H�M00�L\��l�Wt�u��oh��my�ucg�g��<�����m���$��]hr�K=c�A����c��P��O;�h�N9:K>��4%1�������T�����HLw�K��D:�������BNSN���=b	���1��/ �Y�c�LE��aR���b``�^���Ro����T{��`�)��>e�1H-���
�����w�������r���x���	��B�o�(<��`��x��~F��;c����z�����/��i��<��C�e�����vn��S�{$�
;�'I(�f]4�q��;���m�����}��Tk��")i���ys���������59b�S�/�bjDf�w+�N�'��W�L�P�����wg�����������.dF�g��r���<u��5:#��Z�3���\������&���e��iO(@?�p7m#I�����!<���\�a��������a#���j�'��}�_�|
�Z��R���7��U�3I��d����m�~0�r����S�.?7�;4c�6�70`�R���}C�,��f��������to���E�@���tg'�J�M]R�N,��y���~���SN������.v�A��{�������X
?�<|�n,�'*L����,y���z[����GS���t�|Ba:;�0{7��'��l������VW��S��q�o�)F����g-��7��'3	F��w}m���f�&�������{-?�nK��G�"r�-��� �,!����>����o�-=�ZM���b�>Ro���K������M�B��oz�d]�6��U�?�.
��&�i����<6]����P�6�f�|�y@Q�/���a��c?o��\Q��A��X����8�b�I�B��#�zC�����y$�j���.��~����B��	�MX3.j����0i%�����r�M�Q��>ysL�������.�*���9~w���F��XF�"���D�a��\�[S��H�g���S-���U�f38����L�H�H����'p��)Am��C�����Ol��Q;�A>����
Q��5����3N��a�e�� ���my<����L�I���Q��bJU���4�/�����Y
��,"z��=����t��5>����h�L0b^O/�`�����}=���3��`��E�}��e�tDS<DNO�{��0�9+�$v!R�P�l�d(�F��y��cL��[w�H���������7]^��yWU���-��Ks��x��[I���V�N��}����`�	�;����&�X�J������q���{$f%��vm|(��;�TVQm�m����m��
��d�,�+#N����J����s�cq�C��Y� �6�}n��V0k�vC�:�(�I�P@
�7dm��<��H���p�Y�e����{���l�����������S���;�u,�u�Lg�L�8s�E�M�8R�X���7�,�#����v����zJ��
���d)=f���X��9*�,$2�gQ���d#�P�/�>���J��K"<���r���=��}k��=�mb�kK�
>�����d���������KbDiUS��}��3�[:��Q.�(���]@���}��L�H�X�o>N���;_ '��n#m��KQ�{]������e�Z��j��_m�x^?����*�<��/���W�4%��d�c�v��,~��i�Z�]&r�3I,�	���s|	�!tJ�8K��m6�T�`R:�R�n��r�Y�R���]�+�-��$7��uN�)�@0W�<�^U���tS�/@1�hn�.��m������X��P�Y/;�:gK���a�"cK'��.'�vdtc
�������!�QB|.!}�������o_O���g>����E�]S"��4(E9'n�/'��uqg�!��C����'F���(����f�\�2���cy������q2�����e���FK��q�`�s��9���Ua�({3G�x2�&q

9Zr^��)�U�+��GKk;O�������M����t�Q`&"[��7���JB�Q��l+'�����$�4�����6��,�8���
o��>���i_���}�T��5/!"N��lj�i������7�ltl��4U��]���#�t�����T��[q���Xa!rOG�X�0�I��H;J����2^��S�m7$zq��4������<�`R�wu	3:�"��]���m�D`��'�#g�D��2��2j�6E��no?���u����P����No��gp��������
���Yn���-S7���\p���6k�����`�4�L�w�^��g�p�-"}��a���G���XE��jIo�����w?j����1���z[`���(�������RD������!B�N:p����_�n��v�bE�P�����D����>�ks<7K�f�����@`�����*�#���H6������0DW�E#
�\E����H�^���B���:�b��>��>��)+a�����Jp�IyN�T
�qgN��w�*m�*V�o{N6��/p�lO���3��o�������6��Ev�=���j��(�j{w���|{��p��?���7:YK\������+�������"N�}	]Q�T���}�YOtn��14>���l/D;?�Y���~0�����������m���g/$��")��Y�>�"�7rFAEA�S�A8�n<!�
P��:�����o���q/��x3���#�%;���c�uSU���O������{�h
p��E��4��w+�����W?���_PK�G�MD�UPK&��M�G�MD�U( ��logical_repl_worker_text_stream_perf.svgUT
X�\e�\X�\ux��PKv��
#56Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alexey Kondratov (#55)
8 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Hi Alexey,

Attached is an updated version of the patches, with all the fixes I've
done in the past. I believe it should fix at least some of the issues
you reported - certainly the problem with stream_cleanup_files, but
perhaps some of the other issues too.

I'm a bit confused by the changes to TAP tests. Per the patch summary,
some .pl files get renamed (nor sure why), a new one is added, etc. So
I've instead enabled streaming subscriptions in all tests, which with
this patch produces two failures:

Test Summary Report
-------------------
t/004_sync.pl (Wstat: 7424 Tests: 1 Failed: 0)
Non-zero exit status: 29
Parse errors: Bad plan. You planned 7 tests but ran 1.
t/011_stream_ddl.pl (Wstat: 256 Tests: 2 Failed: 1)
Failed test: 2
Non-zero exit status: 1

So yeah, there's more stuff to fix. But I can't directly apply your
fixes because the updated patches are somewhat different.

On 12/18/18 3:07 PM, Alexey Kondratov wrote:

On 18.12.2018 1:28, Tomas Vondra wrote:

4) There was a problem with marking top-level transaction as having
catalog changes if one of its subtransactions has. It was causing a
problem with DDL statements just after subtransaction start (savepoint),
so data from new columns is not replicated.

5) Similar issue with schema send. You send schema only once per each
sub/transaction (IIRC), while we have to update schema on each catalog
change: invalidation execution, snapshot rebuild, adding new tuple cids.
So I ended up with adding is_schema_send flag to ReorderBufferTXN, since
it is easy to set it inside RB and read in the output plugin. Probably,
we have to choose a better place for this flag.

Hmm. Can you share an example how to trigger these issues?

Test cases inside 014_stream_tough_ddl.pl and old ones (with
streaming=true option added) should reproduce all these issues. In
general, it happens in a txn like:

INSERT
SAVEPOINT
ALTER TABLE ... ADD COLUMN
INSERT

then the second insert may discover old version of catalog.

Yeah, that's the issue I've discovered before and thought it got fixed.

Interesting. Any idea where does the extra overhead in this particular
case come from? It's hard to deduce that from the single flame graph,
when I don't have anything to compare it with (i.e. the flame graph for
the "normal" case).

I guess that bottleneck is in disk operations. You can check
logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
writes (~26%) take around 35% of CPU time in summary. To compare,
please, see attached flame graph for the following transaction:

INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);

Execution Time: 44519.816 ms
Time: 98333,642 ms (01:38,334)

where disk IO is only ~7-8% in total. So we get very roughly the same
~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests.

Therefore, probably you may write changes on receiver in bigger chunks,
not each change separately.

Possibly, I/O is certainly a possible culprit, although we should be
using buffered I/O and there certainly are not any fsyncs here. So I'm
not sure why would it be cheaper to do the writes in batches.

BTW does this mean you see the overhead on the apply side? Or are you
running this on a single machine, and it's difficult to decide?

So I'm not particularly worried, but I'll look into that. I'd be much
more worried if there was measurable overhead in cases when there's no
streaming happening (either because it's disabled or the memory limit
was not hit).

What I have also just found, is that if a table row is large enough to
be TOASTed, e.g.:

INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);

then logical_work_mem limit is not hit and we neither stream, nor spill
to disk this transaction, while it is still large. In contrast, the
transaction above (with 1000000 smaller rows) being comparable in size
is streamed. Not sure, that it is easy to add proper accounting of
TOAST-able columns, but it worth it.

That's certainly strange and possibly a bug in the memory accounting
code. I'm not sure why would that happen, though, because TOAST data
look just like regular INSERT changes. Interesting. I wonder if it's
already fixed in this updated version, but it's a bit too late to
investigate that today.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Add-logical_work_mem-to-limit-ReorderBuffer-20181219.patch.gzapplication/gzip; name=0001-Add-logical_work_mem-to-limit-ReorderBuffer-20181219.patch.gzDownload
0002-Immediately-WAL-log-assignments-20181219.patch.gzapplication/gzip; name=0002-Immediately-WAL-log-assignments-20181219.patch.gzDownload
0003-Issue-individual-invalidations-with-wal_lev-20181219.patch.gzapplication/gzip; name=0003-Issue-individual-invalidations-with-wal_lev-20181219.patch.gzDownload
0004-Extend-the-output-plugin-API-with-stream-me-20181219.patch.gzapplication/gzip; name=0004-Extend-the-output-plugin-API-with-stream-me-20181219.patch.gzDownload
0005-Implement-streaming-mode-in-ReorderBuffer-20181219.patch.gzapplication/gzip; name=0005-Implement-streaming-mode-in-ReorderBuffer-20181219.patch.gzDownload
��\0005-Implement-streaming-mode-in-ReorderBuffer-20181219.patch�<ks�8���_����H�����'�qlg���Y�N���T	Y\S�� ck&����/Q��������4��FC��p&��r������vv��UW�������+'�����3o�@�\�D�����A����0��M8�Z�=�H��c|�� p��Hx�]'�����X=C�EO�) ����|g��`W�z�^�>L��RN�\��������}�����}5SA,t)9��[1]%�@\�0rU�*�LTT�_:V��Dhy������T�p���=�L
��wBNb	��L���L>x�d&�d6���3������L��h!���}����J8a���"v<��O*7Q.���)�{�q������]q3�t}*�s��
i����FS$Q��0�iZ��~x�9B�zy�i8�y1L��z�w�Y�
a��+9�X�b����)"�;�����z��6�����7���?����+��x
��c��/��n�>~?��w;�����6���A�n-��@�7W�f��b�H!]~s��@K`���HL�?!Gj�&l��z��~';��I�,�R13&�E~���NE�_QOI�yU�9��F��*�}�"Bw�-��n�Yg����!h�h �(`��Y��G8���#>���t�$�-3/���>(`��/����!��^�J�H��b�K���z}�$O�{[h�]���� ��S����X������|X��k����4PfM��>���s?�������Y9�^?GY`�##�n�����dD�9c�P��.�TJV�vf�r�n�N�.���v��>�9�M�$}xF{������������V�����K�Da;�=_o��Lm�`3�tz������4AW��UD�L��i���x>J3��+�;8�3�t��q���A��z���tn��r;��c�0~�`������rv��U��?��N��`���ww�O��d�
|~�Et��Gm !�
x/��U$F	JA�/b�a�V�����<��S���/�?^��<�\_��c�������?����j��i���Uo�-c_��Cc
w�]D������ �DJ�4�J��#k����&X�O*�8
jjL�F��(���'`�����zV����Wg��BF*���<���^o�[���^0O�z��z���2��V��8��:8�'&Kl1*K����p��Z��aE�$Y���h1Ur.��7�1��T�Z�`y�\4�#�����;Z��2V�j���x��b{X)������ o�*�
N\��/��A�g����P{4,!��|d����OO���V4n�:�����!��j���\��A9I�.rQ��Xn�3�p4I�P���Gw4H�1
@�	�c,�Ml����5#.��@�~*��iTnc�j������J=��)���G����[&f-��0�M��u(���hB]����~h��E�h��
;5k����c�}�S�E$��QS�����5(@|4���U��5]�qv)_� �?E�V��&B�0z���h��)f��u�����8�$k-�l�{O<s/�4�(��v����x� DC�6T�W!�m��$4`KQhm�������R�@V���)���EnK��R�8_l�B~7BpvSb�	�F!@�}B����3�U3;L���v=Vf��a���.^iy�kE4K����38��?"9���/v�����t�73�}�DI��;DI�����PP(���J�4Ax8����=�]t^�	d� +�e�cH�0tbI��v�w������O���jE	�z@��0E�4�g��/� Ak��'D�.�/�*nwg���?��5/l������+���Y�,�,��j��s��APy�����6��DQcc�y	��	��C��[��hLf�9T~�y96�=���
/L���,� t	�n`b�V����a�����d��ac�:x�;����A�M��y���x>�z5�U
CQ���	��@���7��7���]\�������q1��G����K��x�fR��!d��c���tj�'��]�(�pT�d����Ay��b��hS����M����22{IpF:>[��j���?�%!o"x0�y5.����g�����O�Q���4w���e��%'�m�;����?E����i�&��P~ �8��E�8��WX����N=�}�IdK	�n�m��r�;��vP`��x�B�2��������T��B�.�'���%s&�(�����qLnU41�\ac^�:)�EcXstp&����T�f��O�#!G<��_QN/(��1�����b��il$��M|�0�d��M3�]���u��K��)V�3'3!��_�.(����m��Z�E����^�h�3���N����2�t�!W��A�N&��N�s'��x�R���$��"Ui��4���:�����w�}se3+X*L"��sJ(s���<�\*Me���eM�Y]!���\*l����>V�v7F�
��\S�����t�o�T}gG���x��#��TQg
gV��,d�C��<�b��%��
,���;TK�����*��� �'�5�"��\��Q?+u��;�N�V��T!�X� �����#����P�$�&(��bH����@Mm��]��JS������0���,h6yl=�����0�4K�z8�����/\�#���~���L`B����AV���5�e�E�����]H�Wy:��}�TssB�����LI�\O�T�P�|mCIl6�q�P=�J���������r^���p���
F��0����~F����<�0�
3!&��/R��3m���y�Fm*c�`����6��yqeNABa����os����h�u����H='+7���=��=-���D��ti'�)�Al�a������ d`�R�Jc���	D�B��1SY!������������t�[��CL��T���A�w��p�	vS��4���H�������Y:�����s�����O��3�������Fvl��n�d���q����3�Tt��{sb����,���q<�Q��9!��@>@��$����bH�x�(	��d��X�;��]���2��BM&����g����5F���r��t�������m�<�����?�*�C�R�[�(�:5��"V@�zl!O�1yAs�Q��FN�����(j�V��� �L)�@���-E�������/a�H��N"S{��S��'
��$g�������d��ThH�_�L.
�������%*�
��<
�	�m����M�
�S����P���h��)�^��!���Q�)]S��a!$`{b���d]m�m]�LG����5'�������f�-e:I�"� �mOj����@j�s\��OG�R�.�8!�s.���e�@��$��~0�Y����bFm�[!��Q�'����+y��mt��T/L��]��`��n������EU`����(x�N�MI1�����+W;^��,�����4bI����F������*��Mc j���3����3���M�N���D`Q?a�j[�q9�Y����F�"�1�]
���z�K4��
�N�
�2E�����-2&������l�`iU|Bp,��u>�ss�Zu��Bg�����v/����f,���(���~�����AJ�����k����66�_r���V�F�Z�m��4��oS�)q>�����]:{�~?�z�x(���k���e���%-�fS*���L/��� �XB(�m�����-�Rm�9t�������F�.�w����h�����~o�|*�H<��+�����@���[�)��
�&��=�):/!��y�#g&�b&�s�I���>^8���o��7�1���$o<l����?=���3n]Ou��1�,(c��!gZ����T���7����5��R88#�)��"����O�� ]:�"}�I���z�Y��++����M�����N�������e���9��U1fI,!������A�!����$�"������L�)>e�*5�4�1��l���,�����d��Y�z�E��`s��#xO`&�_o������g����'���~\���������(+*��E�8?���GL�@�8q�P �+�s�����Dw������
��Bq�6
��cQ��s`&��QE��}�^ay�'����I�KX=��bK���@:��#��*�3��.*-:�.���Y,� ~xw'��/���Y���{:H���A����lU��JMv�.;c���P7g-�F1$��%���n���f����-�����?��$H�Oea�j�������,����w����������7n3og����-��+�>�G���z��1�X�
���!!9d�H�����(��o#���6� ���I�G�~�VFwK:!�a�����Z��&��&�������)/^`�.����rL�Q�T����^��%<��Sf7��%O��-33Rj�k-���N^�Q�s:Y��3C�YSZ*M��Nr��������B�s�o��#�6G��g����<��fJ</��F./�8[r������S����IW�tS�	�|���2�J:w��tG���.%�Fuw���2�l��9]O�RD9X��������r��rmK�����C�J�Qxo���/o���B\��0���7��'o��r'`�#e�2�hk��D�{��0���R,?�g`�1m=5�,Z|<�:
�RU�
������]��!VI�s�Ag��@L]<��lY����FDu��`��W�4�&\��������?�tC���\\aS(��{6!7��NOn�I�#6��!��?V
�Y�����_����+����y�1���[S2���I�-N�ag	���-��I���L�8=}�c��9?~Dq��{3k�
�}�t�x��/3�+{?��OP���H�7��I�^u������0��r�H��s���"�Fq(��T�)���^��[���GG�0���������
�l��~���U���
���Z�!bk=)bi[bU�u�k���Li��:��{G���Ra�[d{�*Z����Z��)Ns��gE�
��./W1���*6���
��p�[�XY�o����Q-z�y�}���*2�S������]a�9�y%��	��w|O_]�fav���}�'������������D|BR�/��^�	u�4�]�
���<i���)}�E��@>t����z	L$X�X��*����������l��5��e1�����~��� �����*������������>��r�-R������#F�����&��������G��@9����%H���|O(}}\��O�7�����d)��� �1K�a<�L���(���q�yt��~�#����'N: ]����~�%<\�|]�1��DIr����)0����c�S�s%�$R�f6��.�{9�j�Wj4Q��o���6!�x`���CW�`��=�J����i(���h
JM��o	�mX
q��B�BG*S��b�D:_��RhO'�{z��"`k<)-r=7>�w���q��a���1��Q�Q�O:;GM���*$�!$����#���v�F���:���G��am��E-�����K�>�spJ��e}�0{��3����*�X5�u}:j��l%�������~H�O�	�
�!�xR��F��W����r�@�����*z���&tR�I����`�l��pys12���Q@��T7�r������V�w���M.N@�f�L��#�����7���z���]��_�^}x����9����rs~}ur9�??�0������|1����`�����})t_u��31�=9��V�-����"�d�,����bb dl1j$���L��<��ui����&f����.]9���WW���[VEWm��
s���>���;�J����u��}z?��7��e���k�(�����0u1;R:wv@�9��9��T�CM������w�k�Z�������a[��5������mad��n��u����m����H����������!6x<��A����|�q�DK��D��[20��o��^]-	��]���RWu=3�2��
(�tE�L\#����\7ht��(���k5{_����:�F���0C�ZB�)q���=�x���]/���M���+I�e5�>3yp4@����\��37�"����������rO�):tF���G�bpQ=1�0����0��)�C�����i�+�����Tn�����BQ>�{ >����VkA�v�5�D��
)��*j��"R�%5�iJ"�~�!��RdRJ�$���,��pS�|��������Gu�	H�%���������QM�3��4X�m�{�UL+��`fH�<��%++��Ys6���Md�P��%�����~��h�`Z����G��������2����cD���}�fq>R���F���~�'x�=l��F_b�5��M��E�jsZ�Yo��i���5E~ _������W���v�z9��N���O�>�{�
HZS|<AF�`�����]������9T~}�{0�FYu��p��"��m�]�~D�w^�����^oc�.NH�4�0�5����>c{@+�P��R<(��������Q^��?���X�)!�>�)���������3�����s�)K~n���~�n�;�
�'X��)$����W�e�B����Q���l�d2p�$��jT����c8����Q=x1�"�j���k����B?���}5��}T�8��f\���)�%�n�NEk��0���D�)&�����$�dQ����j�+������t���/=Q��^�|�k�oR���Z"�������P&�����)[���Qw��b��G�� {~���{���e����e)������KH��It�R��xn>/2{��M����dQ�4P�� l,��lf���=f62$lyt�3��L�	]Ad04��`	T���e��$������CH��:o:��c
���X��RtR��])��7�����"��G�_��-*�U�[\Q.\�
6�����9��nl�Tt�c[�G����*Ze�0�2�s�%�x~�~���������r�A�]�{�}!�
�uya�<h����G��=5�Y:]��}W;�-�52<c�����P��i�}p��o��xd�t�&�$�i����m���������E@��7.���@~p������7{o��=W�#[5��aP�08����������|p@�l�E��d�����qG/GJ��r���37�b� �>U���\(
5 ��P2���n����a��7E5��B�7� ���	R�Z��!.��Uq��a�kj���CH����	c�������;tY�|(w���&��]*�V����	(O���~- Z0I��1�r��[�	�S���Z-�+�aw��U�;��� �c!;�HM.�kJ��R���"�����m�� zf�YW��!4�fXw�0��R��4�PY\Us�.���n@�l��
��O����"�F���8�Z���s��JW}��f��*(%�#�d\,��#����;bR|q��
FK#��C�l�qQj�>����EP�����2����}����] �W��tCcN����*t�G���>'��3{���n��)���NZ���J���	�8��x���#�
�#�tp9�b�����#�-�W ��<�T���U�|seq�`���o��t�u�Ry�{��d����\N�j��~�}�df�-p7����M��v��Wu�g��3��)
��tT�N�B��}�T�Ch�/���NhU�#�Qt(��-���=,R��VU����g-��F��b4�k8���"\�C����Oo�p�J�f�s�i|�
K*��c������0�x�^�$��D:��p��}\��#?4Lc���G���L�>�6��@D\�Vg�} �ue1���d5X4��Z�XT4�&���A�Wd��O�x����R���YI��	-���km�5��f�IJWA�����+b�	H n-��`3}�0`/���/��w��]�����MlL$��%���Xl���"��%�X���UE)d=t����&��b�~��_�S�f�[����n��W�������X��������	�n1�#��**A89��,E��1�+���6A��<�M2��V);�uN�V����@�}��o���ax!Fr�-$���cN�
X���R�l��Y��]��$�8��P"}A�2��~��p�����$��z�
t����9+�?����yy����RK��V�Ji2����I"������G���gd���o�.`���x��-{�����A(�f��j��]l<�vs�>�4��"�q9����M�����0�@a��N�Rkk�	&�^�����L�&F�����`��>{f�]&
�@h�Af���t������������}gj�e��;_���p�>H��G��<JW8m�@�2��W�}�b2����j��6/�)������������$�W�� O��$AM��Hl|��QS��j�24��5n�5r(���1A�����.������}�|��,�hp2���A�T}�4�v/�
���=�������v���y���Nk\e�����S+?�_h���/��:6��Y.H���tW���9q����h����gDT�.r��fW�L���]Q����<Z[w��u1j����SYt�)8�&]Z}w�����oD��JN����i$�������?#�K����gY���@�iy�4���F����W���9�����Q�"���L���X%���8���i�]1Ms���E�GmI^�;^O����]G
������G���d���E��	
�e�;�J���N����>������M�����"E�t����RJ�����������*�����<������<�`����p46*�+����Nn��H��_�O��.�28=�4b�Y��Y���.5H�}�w�Z��~���N�?�U�W�
!#��Ka�;w��%dR�I1=�0�g����ai�&�CO�7t��&a�����;�,���+�wW&�w���#��\Z�2����G�|#���e�`;��mW��>=o���lR�Jm�;�eC�<y��g��5�rm�i����C#A��f���S���T����Y��Q�l��m)&AN8��YMe�h#���)��g�db�������� �DAq{��A����6	��@<T��*�Tp7,���i�����dZ�����VA#�V�b�����[�a|[�5'Z����%��F2���AF�2A��
#�n�h\ ��)�j*����_��I|FJGN���}���;W���O�*�T`�4t�u��S��o�'E��h$[��g��P��[-�V>qh;70g��"�$�'|��3�XHLq��s����	�-���>��t��t�Q;�}a�5�)�'@�`����&����V�(�`�/�z�|m:;/�J���x���;V
�Gr�����"���q'�=S���y�Bm���>Ms����`��l������%xCO��q�E-;B����?��@��q��-iO�+�o5�I��I�X��7!�T#Q? �(�|3h��+HS���S�)�Q�;���@�^����wA����d07T������%��H�(�^�',��.5EZ�^[���C�����@;�WEY�C$����um������������C����Z��a�9-9
=�<��h�O:�@@���i�w��������<f,Z���McW��MEF�W!K94��r��m�g0�vu�Q��.�0�����nq��������*-�&���vW]w�e:��=��A*���<Y2=}��|��L�Q �������$1��RIg �����2��S����P�4K���X�W�(�X����D���3h����P��]����WA'Fso���Re��-	{�%�����B�TP��J��E2a����]B��&ry���I�����3	��{�u[�v�r�q����]�vqS�/r��U�"�K����S���v�"'�����������������u�l���kF�����!j��6�N����%wQ-����&�4g��z=)�����1 8��+��/���j�{x����b^���z����������q�v0vZ�LD��J���s�QYe��v�>U0}5�x�e}����*� �X2��=q�D
��c�P�m�������������i���[OgF�z!��
�g�6�de++��}��)�7��sy����g�pIB�yL�
��uj��
��lV����5)��'�b�;��,�����K�T�M++�s��1���+\'���������K�@4�J5��C1�o��}V�d�3�l���:���	�g�r��\
�8�/��f ��q��3|�������\ImXz�����`j|�Bt���!��l@��/p>���jwH����z1qq���8/��U'�Crv8)���~���X����T���W����|��u5��m�^uf� ��js�kl��3�G.������o�vA3�^Ks�UiE�^�G��#��
+�-$���O���^
����Z��b�a�
U���o���M��������r�
et�1��uQ$b.�)g���L����m@rMs��9@o::�@�}�
��p-��c�HU8�O�as#����K�L��F�bJ-�nf����@��@��
�5	�F�>�%�����2��0���v��el���FN�f4-&���!{�� ���:/HV�����[��4���������G��%7m���
t;���!���
i���2B��V�7(8 j��=�p�[c������cE0}W��F@9�~fF�8��m<l'@��(�O-(�f�|pu����/����� �,�,�\b��*-q����|�+�����=�:".<�5,�O�o���:��M�o'�;Y������yt��/�R�����E?�%�xmn^C��VU��`��<�R+`Rm�qf�i������<I���ho�f��`���z���v_���{��{{k���K�������m���������$o[L-��_G730Sk�v�*sz�I������sd=g&?-
�1��9�J!�v"�1���FU�l�z���Epy��sN�S�1�qF�/U�Q&U$���D_7=�+�8V*� #@e��9����&��?���0���3J��<�x�y�g�+c��k%�K!v0�%q,�'H@p I��K����<jp/)�8W��\}��nQ�s!�����!x_39��g�'$���j/�c�-��V��������y.�����%�a��{��*1�M���n~GVZG����F�.�����?d�j-�B��i��+��h,�����7�_��^��?�c/�^Vc���\f�n�c'������������}�Jl��5+"�G�y�n��Z�R��y�v_���
O�]��j�w������������~�����Uj4�*O�t�������'�H������^�#?�����?p��|�����V�'g��w�z����$\����_������O�����O��kpd?���}����s�|�����I��_J�#��v������]���Yxr��$�����X���Aom��w�$����j}���������M}�k����z����8��	�-9�C��A*]r� �+����7��mee�7_���@W�{N�����m/d�
'���H4����h�]=^*>����u����fc���F_eYp�)�0�[f_!�$���6pR�Sh���l�k\
�:	c����!���7;k�����r��4@/:����i����B�$�sQ�W1��I��	{X$?�R�����Q_�,���;����N�eH_�J�{��9�k�+W&���af2��& ���'��R���T�#^��������T��aPIrE&fZk<p�VL�;�^���=�el'��
�h�8��f�d���>U
*�kF�s_��M,��-��M�|���MM0J_�-@f�m�*����D����qNKU����C;���3|8>����Z9j?�`�\i��~.C�q��J���1�?
���������+�]x��b�@X�(.�G���"PV=K�$M�S���I�5=d���&�{�T�_��
XR��+���#)Td���4D��sp�4����JND�e�����[X<����W��Y'����&K�h[p��RbL� 7)�'c�Yd�E58�[��j���KS����X���Q�w ��&����J5@�h�,�hm�����Y[�7���DR�
0006-Add-support-for-streaming-to-built-in-repli-20181219.patch.gzapplication/gzip; name=0006-Add-support-for-streaming-to-built-in-repli-20181219.patch.gzDownload
��\0006-Add-support-for-streaming-to-built-in-repli-20181219.patch�<is�����_1�VlR���d����J�\;��B
�!�5�hI���ow��$H�z�f_��lQt���=�G�]�F��������N���4�S�oj��t�����m�]���������i=�h����hh���+6q��g����qv�����y���	j��</\�@�bw<�0��.��>kj�:�W�r�����z���^���
'?�n��;�&�����l�z�<���3g���k����<��-���
���X��YNu��sO�>�I�����,D!��x\a��9B����p�����
�c6z�c">��9pc'�b�e��]a��,S8�5{L�RH�Rahq�v�q0�9~�3�;�_����K�6�>�/����&��k��[��.�Aw��[��9������2��Z�l����Jz���1�Q�Sn|���_����
�$�P���,$a�je?�{��$�
�H+�A��`��0[t���+��CVLpR��"QB�S�>���Y2��\Z��A��{��a���X�=+��FL v�ws�AS!���n��vDf��/%qe�`rA�}�u^�u��� ���7��rmC��:E���j�}Q)� ��'\��$��_�)�
�q���As�9 ��������L����=���j�u�����]�����@x:��ox�
�����?��y"�Il�DD��N
q����:����W�4����P��e���z�X�~��x�&1�D\�~�
F}�]���Z�n[���{n{��c�!�!�ot��W]������q	qC2�x�SuIx��:$�Mr�uz�dL��!��Lk�@���t�x�A���V�Ctv���7I�)��{���A{O�&���x�c�kSl5�E���S��`[���i�!K�,����N���e8�\%~�r�zPo4��}�O����%^8���5"���B5b�!5�O j!�i����t�!Dt2kf��|
a5DH��L!�����&2!�	��d3��0,��Y���A��G9�tR��La�P-d�M�nt�1F O�R���@���8��N�
PC�X�:���{g����Pe���u���)�fK�jB�M�^Gh�{L��^��u?���Z�U���
�������swq;~7�\�3�hA��as�]\qJ% [<w���z������oGlr��
����\}a,)�������#$5�gA�5 �["��+D;{��0(�0��.�����xrh��y��d��(1�2���P�D�l
9��z�]�
���m/�>�H��!�MX���b�G�����k_��^�P8�Z]WBR{�[��0Y�K��d
K��ZX}�p�P���
���R6���p������p���X��n�U��nB9����Z�m4zS��l��p��4���Vxt����@���mB���h8�BI�z��B��� Oy�8	5��2"`by��X�,x\������;gu�V�O�LP��9���L�
�n%kfA��i���sg��m3Ii
B�j���\%�'����J���*�o)�IR�{�5FS.�0a�B������Y#:���4�tp�05�OR�OaX\�B�%R�H$��HY5&�-1>;�9S����f��r�f�h��V�������=]�p?z����E��i�w���^���0�g��Ph�li�>���~����T��{T��0.W�
��4�M+
�Z#(��Z3P���p"���g�d#PZ�c��1Q��:4��x�Hw����@��+%������C�;���}!��g�7Ek&8�������j;���b��w@���=P�	��0��c���h�
���oJ��jE"�����a��T@����G�W`E����}�pW�:.b���y���E����v���J�����I��`Y�!��F����=
i�?�E���q�rlh�}t�d(a�*g8�=%��XF��M��#L`�����g�XII�L����X�H$ U
�S�K��d<T	I�t�b���
#����!���mT�eY����A���~I��
P�)��!wB�;A�Y���]>����������B� 
r�����k� ;n��%$]���;�����������!:����%�:���|CY5_�O�C��3 �\�~\�,�noon#�/	�����7�#}|���j|����l�=:|?*�^K^*~o�h�u�[��	��Y:1s�����Y��f�kj�^����{�bD=%�r�zt��`<L���\I�W���"+FW,���Y��x��6�G,�
?DFw�^O�u����������c�0���!�)�,�q^��H�K?�L���5#�c(&�Cv�zO��@�iuZ2I\P��*6���e��p:��O�����&B���n����G��c����h�9M!,;����9'���UB�Z����v�e*��&��q��
���(��O��q�������������(���K��
3@%l��r���7���$^�y6�wDZ&���r�vop��m�������p��<W�>a�DJnkTr���h����#
��oCg��3���������kR�p�K�W��
�<t���j_�ql8���
f��
A����Ot�
�}�L�8�"�4��&��d��
��|���A���(�
��Q����.Wd��#I*������\���hmZ��E���l��6R���Dh�t��g{d����fk%��������(��|<n1f�f�Q��)����>��,���F���)u�!'�M9���+Z���q���m���.���;�Z�>��U�b<E4Q�`�&�^����-�C��'��_R���^�'�,�#�m?c��\�H���c�����O6�:�@YIi���V)�X�|���'����U�r���%�����u��R�61%�;Q��J�E����@��!-c���CK�����i�?����zT2I<,������2���O9v��9oWE�����J�%j�E�����s"�� �^���6�}���j���u;����<J�wyP4@���~���\��@��V�[n�|�����
��(gp��~�'�����D�0��?��'�Wr������?xV �2���{�I�����P��yx���N�
/_���o��������B���,JK�����]�������o>/&{��Nnq4���=G�������1Lh�����!EZ4�je�"��e���{�\%�k����qj��`G������L�j�f�hk�L3�{���lz�����R,~hM�9���/:��DE8�}�P/`�.(���
z���������j�"���y����/���������Px",�C�����
��������G��dqX-<��{�`FH��?�,6�`$�N���xE����,�ir.:��M�qV&_�O�xx�w��G�(�A=���Z
��A�a��Qa!20��Gn'�a����Z��6��wk���������[|�����@��r9�M�Lf�����~�X�p��$;�}�[�k��d��M��F51��>ymQ�U2�>���LB�0r�{��gXF4z]apc08P1�=���ZZm*4�'4���0���T�������Ro��$�cwQ+,�:�x�����v��_Dg��	�	��s=��
�C}"�*,kD�����i��Ei3�hw��K	*��u�U����u���S���
n����G�������k�����t�U���u�>��q��K_\�d�e\f���@+O|��5F$EM�G'��*}
ZR_�������P���?����Hxz�i�U�*�h�vG�im��j�Y���u���$��_���1*[��Th0!��. ���6Ua�����@U�1�3�z���nt;Q'���_���d�g��B5��=b���o�G�F�����v��,�jB�L����&(!�%5�7�&{����y�:�������c JD���e���z�z�>S�����}u(��
�����]Q�����I�X��������c��%Lj��_�a.����Y�����G��Np��������n��W�+\/=!���ZO,S��� N	�Q�zPI���v>:��F����&������M��M�������hw�*:���{�������VH�\��;�RAI�x��C>�%�������&�F4c��[����-��C����z�� %����<���A����{����f���''������>��&+:������K�:�����\��9�(��c���I�	���M_����>jT�v	�������W�����1��ac�{��}�����<��^�cl����"-��4��xx0{���������d�L��1��V��l���D3y��}��B�2��i�����8_�1��7��"�V�Yn��dLgr���"Jcl�|�$��p�l��o�������t0�D��p����kT�n�����r��3�^SD�-���@gg��y8O��@�NX@�<��o��
H��!�\
�=����Q�m7h����E2"����^3��t�	�[��=�i�Mq�W�h��	S�
��8��!#��D��7O�9�3�����	�ZR�8���+�:U����n5��'�os�fmn����?���Qw�{�"|EB+!k�R���>�ft$*�<^I1��LY5u��x)+�R�������s����Xa	�R�A���wTzM�Y.����!6���D����+
���	����%��e���'S���|L\s����_��e����v4|C9���U�mW��#��i%4u��m��Su�Z�v� �������O�Ll������=��Y>����?��+���H)����Cj���x�'�
]�&����(Z�0���/am�K��+�9�X��&������=w�����F��5����dZ�;-��]����g�C�L��_r��S�+���\�$���U��HVl�YC���L#_�7o���
�T���+5�%�$��F��d��u����{�&��P2��������	�������"XKq��n�Iz
vA��L�B��rL�H�fh$�J��m�����/���""$x�I�� ��]���4����������D���k���D���(n�Q2�PcB
��A=���z$oix�����T�6�g�����%���/������?|s����<s���'"}���V9'�jC�h�iA���y�������y}w���[�=c��u��-�4<{8B�S�{����w;�F����$������>���:�q��"Y����Y ��!p{x���hB�����i�������[��%���=��e�����K�h�
6<�,�p"�+�]�����nx���;z��$�SW���O(����t�x0@+�%��X��#���P����J�t��H�Y�t[@�ZB'����:������n��l�!R����������.
��Q�:�B��3�<�_�F����OB]�7_��X�q�jx�_Y1���������!_�@r[��p��&�(������ �7 ������W�-,�2��@�2>��-f0G��}�il�%��b�`0��E+eR0����Z`5���~��&
R^*C�9�����r�,�wb����e��"���Kw��&�I_������-<��Hx8'���� P�n��
RfK�H���@���������,��q�'�2'��El[r�$�,-\G�I)�I�W���1`�����EO\��thm,	��-2d�l�an(\tC��\�_���L��y ��!�a�Y�9�#'�����Asc�����v�qei�_Q�'�d�@�nbg0`����{2YZ%��ZH�J2f:�����[����IzH�2 U��>������7+��-�a�cP�� ��4����aj���N;5�E<h�|���<_!���a����o??��d�a����/aq��7��������o���\a�^���k5��Ze��������Nf ���>�gb$<�
�Y������TNy���H�����%�D*�l|����b>�Or��z�?@�d�)��<��d|���EL���;��r��E�!0ali����]���E��a�s�r,\-����0�,��l__��W����������'M+?���M��}�jx��
Cu�,� �u�d����e
pS��k���+�pt��lA����v�����_CH�tY(� B�����q����|(�V+�9�{�u��,3s^G���o��v�x.�,B�H��o�-�2O8���:���N2������_���9
���aQ����������������Z9��#@m#��!OR��l�CMI�N���4�L�s���y���e��
��p��������!]����p��g���Z��-�8�u%�s����_��a��'�
x��K��-dt_��U�����n��&:iYU�3ec����<���A_�&�k4d�"�-�;�j���M�j�����qq����3
�+�B%��p�n���	s\�8plq;�j��.ew`�D��� ������������r���I��2J}c��N�Cp��su�*���;�	=�u�����'G��g�b���Z3���n������R������!�Lo�eG�|!�v����=��PK�4���d!���P�-
�\x'4j�@q���E��_z�+�PZ�������]��CR�E#�(C�X8ajw����C�6�`DF$���.v��O�	m7� ��nH�d����J��]�/W�&��X���a5]�@F&�#!�����ElJ4���;�-W%�?}m:�<0�����xH
���|��"����
,�A��D���H��a�U$� aG���
C�~��v���X�#*m�mU�8��9�t6T>��uz~���D��f�|�#�1�|��.��Qa~�R��U$�_�C!���>pX`BF7���W+�'X.�X�j����"r�������������L>�1*��%��4���.�������>��C0�(_���|�)va��	�����
�1������/�����18�EWht"Ab�p���6���
#!��v���e����+���w��U���J@������''G'o��WK>L����{���^�I�K��2�DdE�����#nLk�������<\�5��.+�+�����N�kdU�y�2.�"a �!�}f�l�~,�tv����m	{����B6ge���-��Vet��d)������[�������`MlI��8b�>���^�$�QL�*�?���x��l���/�����4��^ �,�4Q��B�k���G�Z���
KK~��@���J����`w�\����7{@��$�}��������<*�G������*i�G��0^e���������,<��Y���Fy�Xe���`�
O"r
����9�W�B�?�BKV�%��))�VI��T��X=��	+9�4��n��H���v��r������*X�$�
=_!(�rV��3$�v��8��j9>D�y��'k7���%4w� ��0�V���R���wj"x�X�!Y������e�k�`�@��
c�t`�����i��="�;��?� :�� ��i�7��{�������}M��O�(n\��\{$�eDE�<��D;��`�"��jT5��zp|h)�|Jd��E��������x�1�\D]n	d/�=]#��Vu�����]�:�_�����e���r�ue�Q��X�h�>�;$��3hyC@�k��}����f�`$�R��v�� ^��cy9����9&`�j���uzn%zTG������~��2�23+��aQi��������g���
�>��B^��
" �
w=���j�$sz�Ux�>��������t^��W�����C,?��`�P&\
��m����`�D 0,�7����'��H��Y���t	�{%s.&��J��Z$�J���2h6<P	������"��w(z0����zB,�,y�d�4B������V�7��"Z����m�����*��2vl�"�M
=<8|��M��j1z�u7���y��(_#������������g�M�x�@�O�1�w��y
c&���Ny�
�x���'��~��4^������b7�J�#~�;����t��eN�p�9s��	���m���\��Eg5�H2��p)���Q]�Le�
�]�`�����.��Y7M����]�������
������s�j#������.���L�5����u���e�
�BL\��Ws�P��lvk��J��9XK��r��~�,<��K�������,D��q��rgl���8��@�����*��*
`�]�A� ����&�\�p-��~o���bP�P�NR�A�l���c����J�T��.��& �;%�]�x��z��u�q�V�G1��`$����2^<��Wt��ue����}HN^O3��p�k�b��	4�(8�>������Y�\��i��^0o��O���nD�EUs��H%����C��Z�}�j%����u�.���Y��i�'\]����X)���R��m�aD��X�����v���fK�9K��j���9#�Z�4�9/��0�7{(��W�����+o�r�S�lg���&z���W�R
&�l{8�w,^���\�{m.1�;���3���y����ZhF�=��Z��'�����+�����oK�1�!N��(2Hu�Rp6�7�&ZLX��7C	j;�8���#	�����8��09qU�K��=Y�������)~D*�-��	����I�;��b5(�&�(������-`Z	�[���`]������0WFv3H�l2ix��5Xj�>G���Y��5�eo@�g��4�X	}����d6!�r&��lw-0��
�=���:u������J�t�Q$7�=}���J��j���J��6�BX6�	rzg���?�J�"^��j�y����������p,�/�W�
���������`1\�K�&(��!�<��������F���n�I�S�����d08������L�#k&�	���GA��=s�B�Q4%�_b����y�������Ox�����?W����]�E@�"h
`��[��{��E�JL�)2�=p~�k~2R�����?����B�0U��f�~y��5Wi��'{�#�C|�1�`���j�u��I�����'GT�0Skq��^�>Y�-x�I���5K���4����U���S��Yl���_<]d�~_�sC�0��,�.���Xd<����D�uo�
Q@5wX�\P���kU���'����y�b��Ee�t�������by���rg����������N��-W$d���76�d�Y>�{�����E�J@��R	d�����(���_��%�'�k_BvTf�	���b�T�Y(���&��
�������D	��p����a@%��,��%����0jT$�
�{72t�P�2�j7~�\#'���t�)�9�%-����1�J @F�m5>��[��L���1<��7e�V�����b�Pq��;>>��8�T�{����>�(k��v_�c�����U5;�D98?=s
�K�v3�������������D;Y�J�v��'2����[p��q��jj�Y�976�����Nec�*�9)r,37Z���0.�K�\���P��[T���8��>'Y/~��*(��+7�� �Z���0P-��F���T;F?p|�����M�(auV6*%�2W�3P��[��9~�VTN����Z}��"�u�)�Hk�f��W_*e����
P|�m����%��Fn#t�6�F�q�2���a-�@���@?%J�1���G�';F#E���V��&/��5{(�����1�����B�>��e��2�:'`(�&N�+���y�� "����[�����T�&7���RE\�}[Yt���8��c���(E��#�N�.|\J�������19�����D;\����bs�{}t�I�<Qlbv�)��!S4y9�hd�����P0�;��\t��9���X�$1���Rju
7�Z����]0�b��%c���w_�b/����*����P��w��]S)��2!�(�;46E��Bn�,�l�a����E��[N��1Oc�&c9�8���e�`oh&�S�pCqPMt��C�Q�����5~s��E�^�[����)��T�H"Ax~�v0�>��YbU9�������4��tn��/���#����y%��LJz�E��Ywa��	���������aL�����<s�Yv{ gb��}UVW��-�D������p:����B�����l����6-������t��M��C�_�����i����?���'D��9O���C~�@�eeE�W��'Ei��+���7�iR ��0F�M�[�G)��X,O���G�{T�
_�����`�����K�V��?������n��n��b
B�w��� M�S!>7S�}'KD�;�g�9�3<�	l}y��dq���k���y�BB��E�t�g�����U������F��������q�53/�^s��|N1D�!�"�������\�2l+����@�s���k�M���U�C�hLe��g�"1}���(��t�du�Wa��8�9�/��b\�V�=��B���M�=T2�Z��Fh
�AE=���PHfUVD* m-���;C�b	���h��QNM�QT�&BY+��&��3����m
wA� �'�O	O���Z�\l��s�b;q�\��?��=4�
7������f��?8�m�7Ab��@E���g��qc8�6Fg�����*�Y���. ��9�s���8�U�O��������de��6�+� 'Lt�j������s��A==D^ln�i���1U��_��V�B����$��������������%���(�Sn��/�kJ��dk�;�����W��(q|C�Z������������<�GJ���@<��{��/����GF��W:�b��`��$�"rh���A���*�l�v��^��p;@G6�����.*N!%8��'�w�	Z����R�������	�%(�$(2�ta�D�����EV�r5��UX1�0J���?1v.[ ���&A�������r
�������E�����������wT���JB��^�~��jI���i�5R�B��@������D�/��������'VE���L`L��A+��4(�������-�_�lOG���1�!]�!aX�(�p�	=���a�I�dr �CR�ONA�F9A���,(���������$�6����U' x�)���)�������f:������(��@��Cf��
�q�]�0?���,�2Q��U^�r��:��`(��D���,�U��j]�U�eCd����q�c����q��:��1����Y�)A������6�D.@��N��Q���&J��gL�qo�>YO�?�
�C�`��^$�K�	���1�I���R��I
���E��}���4P����$�@�3��d����
��:#�z��"�����%$
M&"����;�-��=����|t���*
`C�|^Xd�Z-)��Ef@>��+�fM�!(�J!��AC�E���*����@�����F#���J��G}4�H�v� A�]|���'�wDQf��}U�#�b��X���xK�ui�XBh
�0mE��� }�����v��w��g�y�h��Yz��]��
]Q]��q��g�m�[��6������s�e�:6��
��>��	'O;���W�@cl'IM@S�QA/+��J�,]1���_"��D��S�gc�x���GM}Pc��^Q'g+�K�=�:s��R��
���S������@�J��@���@u�m��O �dAMy���0��s�':UOA��D'���n����*����	4����a�x!��-����+�^eu%��U�d�t3;�w��b��#�uS�l2n�z�b��J���r�[�o����p��k�^CC��b�i&T��0m����P���L)�n23kT��1X���
B!����	��~R}c�~�&���v��\���b:��QEW���Yi�������>�C��Ij��� 	0)����AZ��|^	��ut������#�MMq~i��|�K�G9���� 0)7��dAM�}iz���{r��YW�,z)����q'��AsG�U��k
R�����U��z �
Q��q��\�(D��1Y�j�^��7N��,���e���
�������WkZ���o��A2=O��vY��uH��o���'�	�aI
���
���:�s��nG�x�W�Y:�#
,-�b{v&�H�jpn�����|l����3e��2�@lT�P��	�(=xV�f�n58) z��4`�� ��1>�4c����?�����%+���?�����R��y��3�R/��$���-�6��51[�<^�k$SH��B��Y��uFN%�W���B�W�������(�K0��6���UX�)[����6"�J�2�o,\�T	
�_�p\��q����[1�A��FP�J\���|�v��_����eI�;ekV���%P���?au�O�n�C4V�nfW��$67�2����lr�����~���+I���+��Yi��pHj��\��ut�8FG���!�������m;�-\F$k_�2<�OY�6�������:��h�
u/���6,0�Py���0�8`��,�������2(�:04�#TM!H��=T�
�L�'����8]��	g��|Hb�u5[B�k+������W�������� ���=��;%V�5�Z�[�U�u������������g��/Y|�?hU
�
%_;��P����\@C�<C��u����i	�C�}��� ����u���S��l(��N���%���������wZ��N[[;T��3�9��l��)�>6(>��8��*WU�/����d���|~x����/@em��������R�\yTs'�y���L��N����
�A�� Xe_��B7� �zx�v5�y��"�rF9��n}f����F=�Ex@@S��\��;��/�e�=OK%��zsg#Z�^��V[��Z�����������b��7��7+��`	~�o����G
.��Dp�a
U�����KE���w�)��	
�L`J���^;j�Y������t.�W�,�iIz:�(�����'��������d��
�X�0���4{$�p6�P���+2MG�2}?���Td��)�n�D-�AGv��c�>�~f5Q���Ubt�]����	c�M��|��������,�#Zzu���*�;�K8P�hA-��l��btK�����
��lP�-����E�"q����*���PG��<o�X%��Rr�)�+RYF	by�\ue��x�1�������4���-�h����������	���"����;N8���d�mFV-tT�����B�9qE"�)y���~��awj[�t�Y�&V.�A��v]27�V/�=S��WMo5�~��
V����"jb����M��]]�X���J�\
y����)��-���UV(c��2�[��mMA0�5l�Q���BiCz�=�</��q��)��c
G���b/���1��5��0b�dQp�6*�Z���^��]:y������{<���'�WI��J��e� b
��n�)�A�m����q�8���G�j$�M'r�Y33���Ud
�����7�tOPa�g�1p$QZ�����8�)�;��67���Y����B����c5��+y�����N2-���6m�R]Hh_���;�"$�n����Tmo �.a��:�o��MC�y~���W�=�O�U3p�7QSi5A!d	��D��C���<�"a�y�dD�GpT�$��^�g��F����Q
�&CSi�(}��>�$�������,,��h}�z��^�w���<}�y�?�{�?4+W���)����Eu����Dc��Y�H�k��J��/�����etL>��D{k(���X���i��%8�a$C��y�; =�Wf6h_���h�T�J��4��2
@G���mU�����b��f���XgrF�Q����E+��oQEb�x(l�9��Oa��[�]~�v
� K;�eSi��K\�"��|��e6��6�Fe�������4�^�f���=�mh�[�V��eS+.t�����1^���r�c���X������ O �h���^�e�
j}v�M��91P����������t��<g�@v��]�[�`��g:���lpt�����A�l�|�����y�~�YQ��m����t���eN1Y)'M^#X ��x�	y��<�`a�C���G���'�v=�G�K`H�s_�|�A���-6��l��W`��A��#�/�S�:�r�����.�0�?<+�9PV����%1]����%��>��dK��1��)��SxcU��egVc�N�	����C��7�_����v�I����2���'2������-�k4������=�KMk�aQ����uL�5`��x��3	^Z}��\g���v�R�z����].������$�s��0f���w�5�gJ58���'�EL	o��)���MP�������.e�p�P��J��Gbq��c�Y��Q�h�G�!�18b��F"9�v��wY))����3��h-���%CvIj)���������|5���)g��@rJ��*lH�����$����9�SMQ�k�^�����,�����`G�5w&d�B���{RV�N�7�.�������,!���^Y�uW�m'��$�g����� ��em��y�R��Kk[��������X��r��A�[G�
�f���1���������1���Hu�kk�6�U|���z����������y��D���6��Q6���>>o�|��$ai�[��noC�`�F��Z��L����a��>�kIW-��hkF;#��pC\��Xk"������r�����:O?]����o=L����>�wQ�2i����%����B-��Z~��+.��=%���2��	��@r�-;,m�1{�����m�y`A�H�k6��a������2_q��\��M$�0�d<"8bP<��bJt�q��$�1T��������{��@&����-�=#/&�����)r�/�����l�w�G��^9���\�!^�3B�q���Tj�9���b����B�&���17gK}gN���@���:L($����Ui�x2�h*�h���n��W���~/=(�h�Z@��z���`Y 3��y��a[o�L>�,E��k��"F����Ik/����������N5����;����xe0��0�fD����'mJ�6�`�tR���)����:Jk��)�o�Wj��r��st��o�x4Z���Ey��"s��M��g��������$�����^�P�
�_I���]���*oo��G[e��M��B�#���DQ�qdS�x`
we� �573��[�Q��f9-��HyZ����y��H����'u9	���$*Q���������������c=A3�8<�|>��B�9�^jne9f���3�;L��c�w5%��7T��W�_	2f�������6V�*[�Gx��r�*=���q�������*`�T���L~�]|�}�����0�F8;��N;Q�?�N�sz/�����R�,���x!��'��������hk��m������"����e�vU4����
����6-\��5X��z�z��ddi^��e�8�Cm��f�Bp��lL�R�Qh�U1��|��Z.M�V�
��!� ����<��@z���x\�r���7���6���-�THBm���h�*b�t���5LF�ay�X���	��A,��4���'��l�?&��j<o��55M���\���r ����E?c���K����m�kc�"�J1�
V��D����5c����G�#"s1c��-J���&�1��@���:l�'
���J���`�3��=��k��
U��P|�P��?d�iRWg�����A�~*�w������|��m>.�����?��( Z?�rm�3A�)�����E�t�9q��hsB�Y��	�X���>������w�D:��=p�y^�R��|�OS,>�pqhSxh�vl�E�-[j]����N�5���`o����;z���Y�B:��7�?���LK���H�i�'��l9�@������(�|'� ��T��`O��� �8�]���\���VY#�|r�3}�����
��T�!R�Z<�����������{�o�K����c)BqIE�l�v0|
#��� �-�V�A��Y�A9�C{SS��>�,�B���M!�>h �� b5�k�*��AG&�y���1���17�Sr��O�iU���!}o�mq�������19	���v5,	��V�Z^p(��	6Q���0;Rui.o���Ov,k68�f��+����E�H��]Q#uo{s���s�������J��g�
�
J�P�����SuMD����PDO*�� S�X�R�C`
}��Cj�Vt;Vb���	�7o��6��������b���<�v�`���,�g\<w5�����U���/��X&,��(��]� l��a�;�����]o�.�6.������_�����K��:�^6�����5>��D���?�`z��z���������3��i���#���i������i���9Zkd����Y�2h ��&3l��l���b�}t�������n������n�)I�����7��zX�V[�h{��^.5U2��Zq��$���K��56a6�Zj��T���,�����>B�	(��>���v�|��/�Yc�U�G������-�T
�eEx[����p�(���o(�Fp��c+��#�zw��1c�T����E}�
���Z��q�m���A� ���@r#|�-���R��'
$�6���nq(@0�B����Z������F�]�����z{#Z���������~�kgm���H^ ~n^�@��~�����m�����{�k�
O%�����+��]3B&���vOrS���G1��/�F�WT��	L�o��&�L��
�qr�s�R�IZ�ps|9L&�7	&�(�b@P�\��>q�uJt�
4�_�a#�W��N&>'�lnGk��V�l���l�Z�,�Nn����g�
@�U����.��O��R�V6a$VD��G�r��G���{������=������v�p}VhP��P0,k
�T!~����#�1����s2�cNf�`�]#J
�w0�X���i����d�ZP���>���n;�PH�\b_�l�5��		p5Gn0�T0G*�Q������������e�]���[���o��6��l/��-��������H.��N9z����wI4?��)S[�8���dH�9;%s���Qnn�t�"�-�R�[C��S��;�����@c>�~��Ld�2�>�"���1�D�|�# i�3K��Q2��O8Rg���2�@}��lj��j�����i}3��o�v���O�~={��Wt�)J�j6�E��
L�B2��wL�6+��/�y���q����������4a�O�!D��}\q?Oc��h
)j��:�d�.�xt�����]���MU��o�-���L�m�P����zgk��U��N�Y��k����P�`�4
��5��I&�@`�%}���QK����N���������bn#j���M������������]Z]J��$���r����& �t��y��5H`HT�#�C_���3�]D������69��op����Dz�j������-�>��X��q�9&m
��}�<�8�``��%^Y9����P���w*��m�:��=�~�l����� L����
����~���@
Te���#Z
�������-���[�h�vH>s[.�D��mg��d��Q�K� 1����|�$��J��o�ju�>�KG�=���������>q�$��p��1����/`N��/���5;C� �0B�u�4L����7�/�7#���I{�����`������w��N\5����P�@.c`O~��2a�r[�_���7�^�d����/�H����^OB�H��o`�u;x���x11W���&����:�������^�T����v�c����l���u���oe_��	����6"��H5���)+��up�O?���g�����`���3+d��bI�6r���7�������8��\`�YF����r
��'��Jh�~��7=�!�{���'���!$��g�����@=�z�A��+]�m�%!ei���1�,�:�������0
�Z�h�Z�ms��
7=����+�0�g|�����c���b,�B���
�����@!i��(���m���N���`,^
2y��
RvJ�u�`��`�v��RB��J������(�|N������y�37$�!����+�-oe�����Y
�[���W�>>��������>��V���z{����6��J;���c��&���D*Y���T�1��2�'$2k:����&��SE�p�`���y�OFh}<��G��q���z���J�����`��{�����x����G ��_��.(�G������!/��l����Z~9�w��_����A�����g��������%� 9�'�q��@%�����>?}��)�vm����������#YT�Z�]��1r�G��e��\�������nZ�X�:�<��`A�9�M�+,�s~	V�f�0_p8T����o�5��^~I�{B�I�������k��xbfm�V����~�K�\������&j-Hj��7�����~���M�~�7@,6F��#A`=5�\��
K����UH:�:��(l��T��0�	������N��������(����S�a�\��Z��_,W�R~o���e��EJ����D��^ ����4Kt�T��an��U�1��#mM4���;;;�A{,mLNz����
*����j��	���������}���1"���|1x}z�Z��Qx	�|Vx�O��=W��Ui(�?=99��
`Z��H��������G�e���"Y�[6����Eb�*�a�@Y�����m^�,j3�^����/��BX5���Qm�����g��EPZ"�J�	,x��VqU�_
�����ax2�����qw�s
2|0���UyV��o-�&"����r�/NJ�i%���R�7����b08�"�1@��V*���q{��vYQs�����G�����	�KKE�m��`u�S�+���@�^^'W���F)~��Bsy�WQ/BT��p���Z%���$�'���O���e����R�,D�[)�������qV�yrM=�����%���=E8��������1�K<�������5���B��]�IE0���e��w�Q�|�n[�^*.��M�Y��c����
#L"mc�=f)�Mz��^�QG�v4������Kl)�����X_���N���a�XKg����*�����D������B����m��{'qm�\Bu>y��}�*~>�VX[�:�������V/�����N'��f��}Z�am���fTB���q��|��t�'��Ig|��t�'��Ig|��hT�?\]�?PQg�F*���O�g�0� �=�6V�����L�������-X{�����=X��i��n�
'������?O��'���k8��}S�IM%���$�I��i
&��agugu5������2��l�`+!����c:b��NR��<�I��$Ov�N�E������%���d	ytK��o������U2�v9_�8xw�x&���X�[���w|y�>{��������N��Li�P��s7��M%X^/{����DU����W�a�q�M��T&�Y{�5�(��r�6���]�?��%����������fI�{�!�]�}�����7�\^�_|k���BMr>���<���v/[����S�jN����5�����Z����QS���������;������uC��^��"��H>)�O���"��H>)�:Er*���j�TR�����p~z|�jo����,�E[�.������a�����3��s
;z����|�Y�j��n�>e+�8��"�������������kGkim���N�!�9���������m���O+�����mM�]��c�����ZJ�yGQ4rJ�}�����'��I/z�����'��I/���E��r����������QK���U�e}<���3=$r��
,��yvp~z�w���V%�v�������P�@}��Wk[�������Nf�l
0007-Track-statistics-for-streaming-spilling-20181219.patch.gzapplication/gzip; name=0007-Track-statistics-for-streaming-spilling-20181219.patch.gzDownload
0008-BUGFIX-set-final_lsn-for-subxacts-before-cl-20181219.patch.gzapplication/gzip; name=0008-BUGFIX-set-final_lsn-for-subxacts-before-cl-20181219.patch.gzDownload
#57Alexey Kondratov
a.kondratov@postgrespro.ru
In reply to: Tomas Vondra (#56)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Hi Tomas,

I'm a bit confused by the changes to TAP tests. Per the patch summary,
some .pl files get renamed (nor sure why), a new one is added, etc.

I added new tap test case, streaming=true option inside old stream_*
ones and incremented streaming tests number (+2) because of the
collision between 009_matviews.pl / 009_stream_simple.pl and
010_truncate.pl / 010_stream_subxact.pl. At least in the previous
version of the patch they were under the same numbers. Nothing special,
but for simplicity, please, find attached my new tap test separately.

So
I've instead enabled streaming subscriptions in all tests, which with
this patch produces two failures:

Test Summary Report
-------------------
t/004_sync.pl (Wstat: 7424 Tests: 1 Failed: 0)
Non-zero exit status: 29
Parse errors: Bad plan. You planned 7 tests but ran 1.
t/011_stream_ddl.pl (Wstat: 256 Tests: 2 Failed: 1)
Failed test: 2
Non-zero exit status: 1

So yeah, there's more stuff to fix. But I can't directly apply your
fixes because the updated patches are somewhat different.

Fixes should apply clearly to the previous version of your patch. Also,
I am not sure, that it is a good idea to simply enable streaming
subscriptions in all tests (e.g. pre streaming patch t/004_sync.pl),
since then they do not hit not streaming code.

Interesting. Any idea where does the extra overhead in this particular
case come from? It's hard to deduce that from the single flame graph,
when I don't have anything to compare it with (i.e. the flame graph for
the "normal" case).

I guess that bottleneck is in disk operations. You can check
logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
writes (~26%) take around 35% of CPU time in summary. To compare,
please, see attached flame graph for the following transaction:

INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);

Execution Time: 44519.816 ms
Time: 98333,642 ms (01:38,334)

where disk IO is only ~7-8% in total. So we get very roughly the same
~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests.

Therefore, probably you may write changes on receiver in bigger chunks,
not each change separately.

Possibly, I/O is certainly a possible culprit, although we should be
using buffered I/O and there certainly are not any fsyncs here. So I'm
not sure why would it be cheaper to do the writes in batches.

BTW does this mean you see the overhead on the apply side? Or are you
running this on a single machine, and it's difficult to decide?

I run this on a single machine, but walsender and worker are utilizing
almost 100% of CPU per each process all the time, and at apply side I/O
syscalls take about 1/3 of CPU time. Though I am still not sure, but for
me this result somehow links performance drop with problems at receiver
side.

Writing in batches was just a hypothesis and to validate it I have
performed test with large txn, but consisting of a smaller number of
wide rows. This test does not exhibit any significant performance drop,
while it was streamed too. So it seems to be valid. Anyway, I do not
have other reasonable ideas beside that right now.

Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Attachments:

0xx_stream_tough_ddl.plapplication/x-perl; name=0xx_stream_tough_ddl.plDownload
#58Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alexey Kondratov (#57)
10 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Hi,

Attached is an updated patch series, merging fixes and changes to TAP
tests proposed by Alexey. I've merged the fixes into the appropriate
patches, and I've kept the TAP changes / new tests as separate patches
towards the end of the series.

I'm a bit unhappy with two aspects of the current patch series:

1) We now track schema changes in two ways - using the pre-existing
schema_sent flag in RelationSyncEntry, and the (newly added) flag in
ReorderBuffer. While those options are used for regular vs. streamed
transactions, fundamentally it's the same thing and so having two
competing ways seems like a bad idea. Not sure what's the best way to
resolve this, though.

2) We've removed quite a few asserts, particularly ensuring sanity of
cmin/cmax values. To some extent that's expected, because by allowing
decoding of in-progress transactions relaxes some of those rules. But
I'd be much happier if some of those asserts could be reinstated, even
if only in a weaker form.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Add-logical_work_mem-to-limit-ReorderBuffer-20190114.patch.gzapplication/gzip; name=0001-Add-logical_work_mem-to-limit-ReorderBuffer-20190114.patch.gzDownload
0002-Immediately-WAL-log-assignments-20190114.patch.gzapplication/gzip; name=0002-Immediately-WAL-log-assignments-20190114.patch.gzDownload
0003-Issue-individual-invalidations-with-wal_lev-20190114.patch.gzapplication/gzip; name=0003-Issue-individual-invalidations-with-wal_lev-20190114.patch.gzDownload
0004-Extend-the-output-plugin-API-with-stream-me-20190114.patch.gzapplication/gzip; name=0004-Extend-the-output-plugin-API-with-stream-me-20190114.patch.gzDownload
0005-Implement-streaming-mode-in-ReorderBuffer-20190114.patch.gzapplication/gzip; name=0005-Implement-streaming-mode-in-ReorderBuffer-20190114.patch.gzDownload
0006-Add-support-for-streaming-to-built-in-repli-20190114.patch.gzapplication/gzip; name=0006-Add-support-for-streaming-to-built-in-repli-20190114.patch.gzDownload
��<\0006-Add-support-for-streaming-to-built-in-repli-20190114.patch�<is�����_1����x�7-Y/��$��%�D�N��P 0$��)m����{'A�T����U��8����{����b�V{�
����z�&�}���ak>0t��������{�������g��+��Z��R���bSg�y�'�6\�������m�h�`�u�Y]�4��b��_eJ�]q((�R^u;����T���}0�/����/�F��Y��P����a0/X��gs�e��rme��;l��_3m���e��o:v�0u����3���]g�r�c>��i:"z�h�K^��&W��3�s;6��~Q�U�p��>�
�g��n�$[������s�ip�7��I^
I^�LZ�e9r����L�#�����/����+|��$�r|��%2��:���
0��z!�86s���
�#H�o��
�w��f��	�=2���� m,'	�����e!	����#�8�'�V�@Zs�� �K�Q��;d�[���bh�F�2�O$Jh����(aV��@W��}�l�lp<lx>��3}�hh��	�N�n.4�M���L`a7@�f������	&���^0��YWH��2Xp�i���
�)Z���K���r����!��
�K�3gZ�a��m.�B��7!�Yr�#^h�$M�~�h��B�V+u������.�74���
v����F�u|�������Hr�;0Qi�J�!����h�b��"�����:T��@� &����W�a"^�I�&���Ch�B������X��z��������f�\�&�c�!�!lu��WC�����q	qK2�x�SvIx��:$�Mr�t��r�����4�����n:a<�� R�A��!:{x@�������h��}����'t��B<�����;
p��kg���,�����Z1G�B;�$b��1iN#W�������7���*b��Qr��-�7N�!*�1�=@�FL4�f�	D%D4�������N�`�}U�AX
�C�R��,"���L�Bl���� �plT��i
!{��[�����f��h��"p��Q7{��c�|�<qI!#�#�w��H:{�S(@
1g������88��*���g�g-�1��^��|6l��\��c<���c8��;VSz�*��2h]O�w��������������SE�s�OtK������B�d�6�q�H�\�������Mo�yA�9"\�/l�%P
<��9(0B�Ps�&�/PB��}��B��
��������UO����3�NP]%V�X�j�H���#'\�P�8kQ�k�un���f]`�W��&,�Bj1�#�M����SE�#*_��jH��~�1&Ku�W�LaIS���j��P!>�|�X����g��&�����'�gg5�c��`��tZP������^����L����'�g'�������������ry7M�G�P��������:��C�.*���y�A�L�.�+���k~1s�k�y����1�bv*1��������n�MHt0
@7=p��-��2��f0!�Q�9:�URzb�
L��\,������$��70�\wp�	��N3�g��,�m��������l>AH�?�ai��0FZ=)�FF"I5F���1�l�����S��r���7����*��[�`����N�h*e��_���w,:^_A���}�����)���833��B�e+����T�L'��v�llP�k��\C7���"4�4�l��0�Bh�@���BE@	��F���8�9c�`�
h�Iq�H����3���[JF���Uu����Gg�BJe�t�e�:��U�Z���{3c���P�Q�hTx����o
�`�^:&�2��������&�*W$��R��f*v$M?Q�q�{%V�.;���Qtg���"F����� �@�[Vq]���i��YU�d�L�T���5
��-��QH��,�/��s�SKGCG��)CV9�����#�MtG��|%����n}7���9g�x�e��� ��B���I|�����#��0���\���+�,�>(;$�a���}���
0�G�)�C��x1��v�s�|���5N�����%�?A��J�g�
8�$j;>�V�/9fU�f���7��Y�������3W�.���Yu���A���$��@\Z3T��fY����E�Z���)�/o���������J}7����XT��~\�
0V��T����8�������	61�a�_���Y����<k)�~g��u��bD=%�r�zr�'��`<L���Z��.���+FW,���Y��x��6�{,�-�<FF�?�LGU����2����e.7�� I�,�q^���K?�L��(#�c(&�Cv�zO�M��m�rIEw*�o7��+��*t��:�k��MD-�)f�>�k/nY����'�s�BXv�AkN�O=|�
Z���NS]��|9Y�	F��P�^Dq�E~�D��G.�G0��$'|F}D��E^���h7���*��_�k�E�y F5��"N���#�"�/�x���s���flkSFF��<C���u��V;���2D%w:����RZ>� ��2��������*e5���^�r(?��^a����h����&F�R�c�q�T���)2�2w">�����12E������P�2M*��������Z��_��/�q��Q8�.�������}$i"@5��WU����U�Ct�^�2����F����M�nr�"������Tp�n��� xx>��E���g�M"�����5�|-~p�9���~:�� =<������%&��AUi?�2N�:���M�����vygX�<�����A���&�,��������2=��~�)��nc����b��9�>��3�������;v���z�d���	���F�K�R������!�k]	,fL}Z�����s�����)���:%Wj/����R�����wO��N���K�u�K���Sg�����I�aY�������y����C�y�*b%l
]��,�P�,b�0v�����=!���n��U����^�]�c��T��^{������E3�!<�����������f����>�������
E9]�����d���L��k���d:~%�}L�go����k��(����m*��?L.����������xt����&x�����H���L�������{�<�E8��������rz�(����Q�{8p���Q�o���&�Da�)�/�E#���&.mL�b���*���� u���CL���z��l�y[W�u��T���a�w{���l{����
p3?�����g\���"�����0Y�
@��z�������;��
j�"��Q��y��o���x1Z�d�
t�'<�����V��A���g@�@������A�8��d��f0#$�<��h�O0�^���C�"���_�49��%�8+��zgI<<��H�#����y��-�n	� �kB����8�#�O�1�-�p�h�4���������o[CeO�z�����@��b9�C�Lf�����~�X�P��<���G������"���)���"f^���/��J���-�I+���������^����`��!������z��I-�����sX�T}�n\t-�$���`>�s��h��Z�o���@3c��	�et|�U��M���]�qq{��>�������u����gq�$��$��(k�������������jy�����7�iC�#�b��|�}��|�y�F
:�����*
H
g�4��K��`�e\f���@k�6�#��&���x����B_�������P���?����Hx��i���a%5��z�1��zg����f��HH���BKc��(&��`B��8;X��2�T���b�U]����e�����T�	�\�S���n
�D��T���d���f/Vy���*��k�)-3�|C�MPBJjo$L�`QQ�t'����M�S��G������������9|�L�}�;�mf����hv����k���[�w�dpS�n���1��
&5�����0g��,�����#U`p�}��+A���5�����K���������"&�O�S}���R6i������h�:���x(I�pBHq�&B�6��f�	��fA�;�Ztxw����m%�7�Y�����nN���.����c|�K�u��m����h�p����e�[������1�>�ARZ�
q�����-����/�A�nwO*��"���>��&3:X��<5���u��������q�Q���R�g����9��p��9|�:�D�gsW��G����5�/c������8��A�Q�D_&��<������EXP9i����wa���3�9�������c��(�����f�^]�t����a&N�O��&����C��+�������i�t�w�o.���7���gR��yq����w���:�[��F���
.\V��fU��M��L;;#�5x����H
t��������D=!���4�s\�6p�������^����+%�N��x;�Q$#rS���
1|�������h�#M��|�j�<a
��V'�B�!�a�B��4��	0'�a�����>^K��Qz�SBgRTp:������@�m����M~��q8��k���nPJ����n�������J�%�DR��4SGM��<^����t>A�s��^B�6(VY��PFo������~UQ�y�K��kM�[����\���ZF<tR!������������D������k��i������L�����Go�c$&���!K��J{$96#��F �M���B��b&O^K��mD_wY@��i���9����\9����'���)%5��vH���������������E`��[�%��C�c�>p�0������?��A���v�8���7W��L�(u���~�O����u���9�K�Zz� sb�S���z\
NND���5k�����yy�����PL�.��R�^�N��d��h�V
l�kD���M�\<jB��"^��^G`$�{�@B�S ��o���V�m&�)�%=s0�R�1�a���+9��AZ��z��C�����M;q�[�|vY/�$�w��Y�����o���D�i^�R�`,���"��z2�����R���/��]���t=�K:[U�u�'���]����"Oy�>�}�OD��'���rN��������<M�9�0%f)0t#|�+�����/`wz�*{a��[n!���k*Z�P�A�]��Z����
�s��_D��M�\�s�K��n��'���c��X���������&/Z���f�]IJ�����,�T��:�M#���0O�_�G3T�I��f����/]g|��w����%N���I�q�O�B����di����������"��q�=�b��ka�e��gq��mys�����G(do�1\Fw��b��s�H�;������"Rv���n�,��A�5�az�h��m�����|�F�J�c�������!]�G�~�H���%/�#��;gh;�2�L�� "}ya�����x��,o�(����bsDO\����Y�|/V3`�[�R&���	V�	M��h*!�}0����������R|�EX��8Y/B����q�<nc���I����i��������s�������?e�D�$@]�I
dO����HO�����s+s�K���� W%Ife�:2LJ�N����-��C[��*^���n����p���"C�������E7���B���F�g�^xC����#�\,�
81�>m%����v�qmi�_Q�v�B ���Nc�6+L
8��t�VI*�^K%E%�7�����T�$ Nr���HUg�g�=>����0 �fU~���<Lu�������w3L�]}y�I�F}�9�t��������w�@��~~<
��*��M4�_���*1n��s���n3�$���op@����f��^�����
2*r~�����@���}���H(x~�6�0O!y�h���8�������A�����	S���}d�}������K����Y�c�J��<��_��Hy��^�Q�h7&�-��#�����b�����?8Lc�Y����9���F��?�������Z������E����s�m����V����V0f7o1<�������/S��zG][%�K��[�gpO����6$7��B*����8L���{�{�|�u��m��=GtO�W�ef����<�m��)/���EH)Z��`�Q�����X'b<���g?7����v���<�!��y2L#
n��������~�0>�+Gu��m�98�qbU��D0������f|��f��9��)�7������TNX�q�2�8�A�@���p��'���Z��m�8�m������_��a����Mx���K��-dt_��U�����z�}��������1��7db�H/oP�52��n�����4�Q���#��?j^�}��������P�a���m��_{�9.u8���L5j���;�J��wI�v��`iZ0��S�\H�gY��uc���VY���8����y��sJ��������H���e��#���|1��Zb����V���hV�m��NEj� h�{&����#8=�o;�����|���RLs���S���z}��c5k�8�r��"}�/?���(�n������M��CR�E#�(C�X8ajw����C�6�`DF ����.v�N���n��?���h��|F�7�J_�f��cca��w��tA������K�����h��;��[�J���t�x`��U����`-��e�:�<b��+�x8���#"a4�"T�t��`>8�	+L8�1�)�<@�L�sb	������48��9�t:T>��uvq���T��]�|�#�1��8��]��3D��egHy�Vu�X�w|������am1���m�V6<O�\����:d+�E���e%m{�	�OA�������0��X�<4�s�4"�����L�!T�|e����'����'$����+<��3���*c��&������]����
��R:����7���f�!?7������O2�%W�S8�/q_6/�����5�_,�0y��p\Ot�z&Q</�>��-���S��T��1��:�j�����	M��si\�����7����p�]#�����r!	���3g+�c!������{R�q$���O����m���"Z���2��@~�vwoun�������5��)j��=���{i���D1�s���r���������_%K�,�i��^ ��,��P��B�k����GwZ���
KK��-����������������4n��f"I}��������FyT ��'FQ��U���#.�a��>�����o��Yx�}�[�x��FyLYe���`�
O"r
����9��d!m���a�$�:�x��i�$�J�T��=�����H�\�ZQ�gGT�\Z9�pp���j�Vbi���
oO]9�J����|�g��X=�?�y����e�~d	��X��x�v+Z�e�	}y�{5<S,@���f���|���y�h}Ra,���AX��5M���G�jGy��
D�D8\�l����[aD�+�;�n�����l����?@�����G"^FTT���L�C��+�����U�~a����P�)��;
�Gu�?xG��x�1�\D]nd/�=].�Vu�����]�:�������2�O`���2��(SY�m4GN	��r�/ow-����������T
a������+o
~��$�s�#���UM]3�N��D�*�(�@x��a��/zRFYffe�2,*���Q�.J��yj*6���
yU�7��*��H{�����FTa����{�v�g>��y��^y���*���W�IC�p)�Bv�O;t�z�������P��v��#MgjK��%<�����(�*A�K�*�������@%�2��u��)�y�����~+b�e��&#��x�|=��h��q�'��/��n��y&e�P���a�c�oR������ok�P��K��y�E�#p,F�y ��2���:����R����W
d��xQ~�<�g0f����7�p�7�yqpvz�O�w��m������?�p)v���=��Y�{���8��~��-�`�pa��|{�f��Xj�Y
1��L�*\J�3~P�&S��C|� X�o�������B�M���*m�v��4Q8����z9��u�������>���������B�_�6p�N����,��R��������u8�]�������
�������(�8�����a|c��Fksq[��������U�>xQ�P��al�k>����|��������V��+%%�$����&�	8�,k�`�$H]���E�����u	~�*�Q�d�s\^�Q�9�x�.��������U]k�F]�}���@�����"��o^:��d1Z��A����5Q_:�#���t�;"�������S+)F�{Q�,5R�xr���P��V��O�Z�XB���������%x�4�@�>1
����k��S+eC78S�H��>�q/5�����
�����Ns���^eYsF�#�*�!hJr^��aRo�
P���Iu�wW��0R�lg���&z���W�R
&�l{8�/��m.��6��{�]�L������<{y�R-4#���,n�%�����+���i��%��'�H$�:e)8�[~-&,t����_�V������f�pv�P�T������%���,�~}O���?"U�c���L�$��[P��d�s�C{�a�����0����M����lF��J��n�k�+#�
/�L�%k
����Q��rZ(��uM|93 %c��4�X	}�����lBT�LNz��Z`p���#���s������J�\�@n.^{� �W�6��
o�VhmV��l����f�C�VE�$��T�LU#3 #!)
�X�XD��RA19�{7CD���p].��?PHAC.y�u66(�ig{-cNMT<���&�O9�f�$��q? �~&gY3QO�TvR�m;c.i�1�&���+����8KL�X{N����8�\�����6qF~7�" �4�I����3��F��M�)2�=r~��j~2R�����?�����0���M�<�\�j�2�XOO�rG���4�0��f�U�K�����
U==:��*�8�d�f�,�<��I���5K�}���\�5��u������l�������:����4��j�r�����~b��3T����������L^�jo��=��5h����-*3��$��dx��*��9z�.�w~qvu�v�����o�rE2@�{ccKv��C��/A����\t*�T���@v���2���x_�w���%dGeF��},&IE�����i�|����}��o�
O�0�
g_�T���Z�h��FE��`�u#C*�a+S�v3��+5r2���L���_�_�B)L^�qU�2�m��a���b�e��������)k�*�$�-d(f���������+����<������}��u<��_^)��T=��(g�N�sI�nX7���~��=tv���h'�bX���2�H���srNz8.��RM-4<��&�m_k�T66����"�2s��*i
�����;*	5��E%zp�����s����p�\����r��
r1��7(
���]���* ��c�����^�D�Vge�Rb,cp�?eKn�U�����$PA���_�7���y�cXM�GZ�4cG���R)3���W������b/�}0r���;�)5*������e�k�x*�5�P�(���7���0��@�g�[�67er�Y���^�\��-�|y*���Zf�)��s�Rj���>��W���"�o?_|m_1����F��]��+�o+�X����u��3%(�x����U��KI}C=�Xq��8&g��AqQ�h�������y]l�so�N�F�2O���:D��{�M^3���l9�����&�]`�1??!I�@�a��X����������i����r�X9-��W���a��0���p��J�X����G�T���L�.
���M��;�#K0�<A�!��w�����jn������X/nA|��/�����(�(���Yqh#8���p^��{0g�ZDj������>�`�M��$��7o3����%Q��sqZ)��.-N��K�v�e�s�qD� =�2/�x��II�Y�.v���)jB;)o��l�'����-%�xK�;�n�LL������J^�������6SANF~�0Qh����H�-�Y3
���\�QnZOGZ���>�H�%+m�o�&_j_�	��@�i��DiI9��
�]ZVt|��~R������j��q�E�����8k�����/`�<1���Q1+|{hIA{����S�h��=V���������n��n��b
B?�Y�R�&����)����%"��R���)�6�>?Bg��@Z�5����#���"P:����xu������u�D#S��jTB������m
��ZD>��[��`�K�eA���D�`�w�\��XZ �9V^��5����*�eC�hLe����"1}���M�f�V�����0rs����J@1.@�������r�i��L���<��iA��cu�
������H���7��30-�`8?���i��4E%o"��2@k�;8�^O]��&pD��2x���J�������0�,������s��c3�psa���inv�����f�q$��	Ttm @z�+7���`t�	��I���Ur����6>��`;������)��;���5��lQ��{�!b�#������W�R�t��sN����'������3�p�8��_�k��*Q(��� �/�h�h*�Z��<<��yx�d��|�
���qM�x���a�pg�����!�*�%�oV�Ayt�afn�|w�a3�����x����������������Q������X� X&8���Z��`�$ ��G%[�<��p)������6�9���SH	N�r����>B��x�XJCqAq]���=���{%B���z�H4�Xa��Xde,W=�\����n���#c������j�d����k��D4dw�/�G?x�L�`�Z�P���x t*���o������X���f]#�,t���"��>���%:��BYV��Uy}?�;q��D�5
�0C�1w�0aa��,;�#�P���n��3,V�I8�����^XH����v29��!)s��g��� �Dl�����jz!����G*2A�	c
$��%v
�8������uh!uk�7��A!P%��3�i�������
0?���,)�2Q��U^�r��:��`(��D���,�U��j]�U�Cd����q�c�����5�u7c���Y�)A������6ZD.@��N��Q��a%e�3��0Z�A�c��On����X�>
cw�>a�;7f6.��~�!_J�5��z��\���ux�j��T��X(y���T��RA����������@�9�y�B�IB��H`����)@�1$@����2�����JX���,/��A���I�"3�,��+�fM�!(�J!��AC�E���*����@����
F#���J���G}4�I�v� A�]|���'�wDQf��}U�#�bj-��q���������2�Fa��*}���a�{�����0@G���D�c�D��}��
]Q]��q����������]��m����*��5��L�=�u}KN�v^�9��,*��N����,��^Vs��Y�b���l�D��F����E,4f�=;+5�A�YX�zE����/]���x�AT~K	w6�GOi0�������H3�$0cE��8��=�@b���(���a<����?�u��.(��=3�NZ1-.�B���ME\��[�����0F�����v���o���bC�*y�H���;Qo1W����f67Q=C�JW%|�S9���
�7icNU�L��t/����y�s��4c*�J�����%��9/G�����L��l���A(�W�4!�"�4;�>�1�0d���@��YZ�D��!�U���+�p��4c��zxx��wA�P�2B��(2H�C��%,m��f$�WB�q]`F b��p���	�/���/{��(����&��!`��,�i�/M��W�q$��u����B���� ~wB�4wtZ�@:�� ��]r^e��"�:�G]��5�B��e��&��UHy���l����\FXM��`0H�KO!��	�A�!x��5�k�K$S���'h��\g�dN���~B�p��0�@� f5_Gxn���H����7K���a����_m��D4iaZ
�M�V����y���G�LY����5�%���jc�,
F�����[��F
���*
�-H4r����j�!���O����f������?�����d��w~�x���uya��]����������i$SH��B��Y��uFN%�W���F�W�������(�K0�m�t���`S�LS�����*�����p�%(���qeJ��RZw�o�@�A�*q�KR�=��[�Y^�Z�%QL���Ya�#�@��r����>y�A��Xm��]����8��$�����*�~���&��$��s��
g�U
@�1 �q��s
J���!�eDp���pk�
�������pe��}%��8�?eM ���L��*����G�m*�1�|+�^���$B�ip.�������R��
[�� gt�����P��P5� e��Pe7�3��`'���t
�'�:�B{����\[��d���J�����(
��9��Q��S�a�\��U�Ea^�X�Q�<�����^x
�a~�O�����V� �P��=���N�4��Gl��:�K�=���G}I�n��T�VdC9H6tn<(AW�p��.�^����:�w����"��9�I�'�y������lxo�?��#�\U����&Z�wH���9C4:����y��d7��qK��7��,��N���+��8�%�x4���+�Yd���*�2��yy�����}����X�����~h�L��}LU�{p��`<�����xZ*qwj�m�S�7�j5��:����Nk�J�Y���z�q}�Rb�\k��v�5�$�[p��5T1ZK����>d���
"S`= �������v���&�s���>\T��Y����tP�kb�WN�����|mA2P�M�h

d�qL�=A8�j(�����#h����n\*��N���]�d����#�b�1�xC?��(���*1���[����!�&�G>t���HAH�hV�-�:Hz@��%(T4/��{>�2�����<�1G��/�lK��.x�H�.��
&�����,�[>V������h
��T�Q�XD^{W]9kl��S���Eb�����%4v;����d`o������(����ws@.n[�UUp���3��eN\��eJ�����v����A/]rV��LrrC`�'E!/Q�X��]�Gm���;;�����5i ���V(�������)D�#b��KV�a��w��#T'AY���+�~
�������'�R�P��6������e���j���d�O��w�N/����[�$2,����o��qT8���h$��rqX\)5o�&��E>����3�����R9��'y�;���`q��ns�&I�t$h�j�J}w�y��)���;�;���s9�Og
��.�����|��[h@W��|���<�G�u�$�G,Q����2����c�GZG�����R���rJ��O*�*���8�4�"�!Z_%�\/���6��<}�z�?�{�?�$WY�z��
���Q�����i�4�v��T���
_���RJ�a�[������2���67�A]
"�0r���LB
�K���u
��z.9������� @=�T����,u�O)>���X5��U��b����Go��GZ��*"[�U	�!
�����-g�+�P����X����:|�Dm,���w��\j#�*P4*����a���������~DT9����
h�q�%V\�,=��w�����MzJ#��f�����LX,���&���L����:�sb�b�0��cc�*/;1��H�?:��
�,|��z�<����w|t�<���;9�:�h���3�B�rk�i:��3m�s�SL6�I�����(�X�uD�%O�[G(�D7�x�l��[�V�F��_�8D�M������6
6<j��}��1����S���u��)���2��e
��^+�a�~�-�m�9� 
lO���0�Dn}N
U5-Z�G[X}�B���p������N�3L�;����\�g�H]4��GR������/+�w�{����Z�Y�>�)����j�yj��'��g��{UhSd{f��-�@���'������r��aSo�JX0���/�a�����B��&J�;��	c���q�������h���*{�D�SZ����	Lh���
�*���,��6Kl�jMj�4D,3#&iE�d8�10����<:���Dsh��4-1��KR�y����&������CA�c�S��Va3����7M��\���j�r_����]�?d	l��,�i��3!;��������v*�����t7��$�iEY�����#��l+��+�\����{.m)4���s�[nl6����:3����Q~Z�`���Ql�^.��&��g��k��&�vm��d��K2�e���\}�g�7e%
�6����+����+�	�`���(-�Z����Y��m4��:������t�7��X�E�4��&�|�)�WMq��*)\�l�G]�st��R�OW�����Z���=�7]�s��*�"	6a���D�aUZ��W@#+�R0.�F�G�+����QY%m�4!0�8�)x-.S���%?R��L���������D��(�����_8���_-��J�E�!_�M<����u��s�_���;���.��Z&�;����T��� �)���6k\i��"�@��)�CJ������T����*�����}-��a��p�A|O�Q�Pv����n\��8-�\���A��:�1Z���\���S�[]����`F�GE�0��bn�	�9��g
@ >�J�m{�)�(������>
U��	��A6�^rP�!S
���F�i�H/�qOx��m��M�
��agN���N6��� P�m�\<
]�@��_i���Dg�<'��Wv�n������:�4�P����q�j���0�>HX�3�=z���+�5Wyx�9�y�7P�2�Z��3����DJ(��$�t��q"�TD�]|K:�. �	�j$�>Z�
X�uX��5��d��R09B��O Nb��.<�(*�;��N��t� �����2��(ypN�,��hd=-��~�<M��������%,�R��\L�#�3�!�|S����g��~��
��xB�O./�^L����kS��Ni�X������h���NM����U��E2���_�l��!���@k���Ve��/k�\I�{a\�s�,xz��T��{�S�[&��.���U�d�}�b#���^���X�|��`����j-�z�H��<����\��HH��6��kt�m4��m���������5��Y�*��T��
W�l|���ZVfc3y=reRB3��/�@���aH`�]!��CD�,:)s�H�����@xYB�Z.M�����<���,�#�HUUw�+%�Q��fc��S�
�=F,��Pq ���w����C�_q}o�Fa�����b�}�L��b1gu�]%�qX����sB.9s#����">l���������n)���1��S��}�vVa�f��)�*"p��`e+L����R�p&����v��It�]..)���E����Y�~��?A�+"�[�I1_I��Lu&��G��s�p��6+���Z(�N(G4;d�iRW���2B'��<�<�(T�9bb���������Bb��'��h�)���"J�u?��e�,������~E�"�5O�����V�!���&p���&�IL��k��r�ZV��|:�b�i��C��}�b/���r����Z�e��y��w{c�����SDqd]
���'��� cOfZ�OT��M'9	�f��
��5�t�>q/8�U@�`q�~u�Loj�o�H��n�*
��g�%�
mM�X�2��v1F�
J��G; �m��?�QN�aG1�H�5Wl��A8q1�9;�^�x�}HM��n�
�'���� �������v�'���2�I1���,,4�E��]�,5*km�p5��
a��������6��������'��������5����K����.I�)s������c��������9��;�����<S�&�Tb���z<M�
3
p,u4oa������"��R�d��xSlu�L��������=�*����$�^_�tv�k��jug�z���^�^�.�n�8�~���;\�|Mdhr�ES}Y�7�u`K�7�O��HJ�U���w�C�m~����u�2��zG���@�4��?i�:Q����8bw��vg&dkl9j���	>�qG���k��X�^����a����zq������A�F8x�����:q	����Qr���MG^w8 �a!yYO)����V������u��i���`:��[+&/�I$��U
�5�w�����%RT����H+=����~�2���D�����)+��]�<k�xj{ �)�z>�QD��>����:|k����c�hP
���,JYiAC��d����&a�EmL��E�LQ����J��*pU�z+t2�9���Z��j�`�S�no�46�6��i
��������y��k���w�w|���x��	#��)��;����?5/��]4/���7�gg?�)�k�A�*�B�����lZ+��(�b�:�:�$�9�@�DD�U�������S�dZ���.�O���7�"�1�a����D�K�Cy5���Nq��epo�p�p���N���
o�.���gWMv
��om��omg&������wQ0��A�%�����g��> �d���C�~:��y7�%�V2
����_�f��@�(^��+E��1�$���q��.�~4F���E/��:��A$�|,&2�d��@@�~#neBc��,�H�=�H�����F���7��F����w�zc}�������]E7P��o��V�)-��=���B���1I��?����~������wg��K������Cf���q��<����)|������Gx���L���'���v9�6U����� 2 #k$�=�A!�����`�_���k`�^��h�j�|B(h0MSxn�s�_��I/HL	N�z�o��qd����V�`����T��]@��T��)N�y�XZa7��qb{f�K�"�D	��?����n~��N��	���S�-Z�l�G7fX�"M��*�N�R�H.Sm~%�Y�����2&V/x�}�I�"q(�+�5�K
�%qA�u
DVN�B��7������r��a�Y�����#1�Z�`!zUs^a�'���]H����?��`�8/��l7V��K�����&mC����J��%&9h;^\�"�|^RO_�m�!��� �Y�t�e��V��3�t���t��N��-q�K"W
��	dy��t9���<nQ��1�2
�?��q���������c�E"[f$b�3ir���������.X����#&��� a���/�V&,Tn��K���r���T��%�$����Iy��U�����l��=/&[��D"�j �Z�h3�����"$z'_���|�gG���gL�i��V�j�:��4<�F���9~�D�:n���*�Y�����:
#�O��T����Z!�t�z��#w����������/�����2���M����}z�����&��'��.3������,{b��Br^��l���A�)/�f�*�
-		K3V�
Ae)�1��D�l�t6����6���Z[[�;�A{:�m�X�t�eh�����]c1:I��e%4�V�h�%���R�w�\�H��v2X�c�j���X�����S�6�=��d�}�5J	�H+����6���9��.�#��O����gd��]�my������R*�:��^M��<f�1(?���>������zg�v�lm"��N�i5c(�&���D*Y���T�1�
Z�*#���5�_��Aa����"o,���ZOaF�6���^�8�2#2?�|�5yq���As����v�:��{�^�?�>
6���'�������`�j�����6��&�Y�2�E`*�����W���|�&� �����=������������S��������C���dQ
kqwT�h�NxW!�3p�M��Cm����X���{�������B1VX���2���i�xI@PM�&��YD��xz�y�	k!��R^y�rT�_�b�wm���n?X���m(������S0[�D�I���&9U�����p����e
�����b$@�����5W����rIrn��[zT��Ds����O*�`��0��s���B�������'yG�Wg�BP��J5���~�\�Ju����*.�e��(ar{��m�z�2�,k�L�DO�k��v�7��*��l%i�k�qh�?���I:�������*�����R����z�^�pT_��L��z���_� B^��?��7g���	��� _��a��)4{�U�vE'O���NO)��Vz1����flY�^��N3�Rb�,�-EH�"1C�0bu�Z��6�
~t��{/
Co����jx!��m]��6
�������|z���+-�u��2��VqU�_���c��a������qw�s�R|0���UY*W����
|���Z9��%�����[��;��[���j�(��q��A(�.��="�eE����,A^Q!�!0dB��VQx[�0X��������G��o&��	+^��Q
_���%��M���D�� .5*�]e/.�8���x����<D���Ze!*�J~�N�@������
���������=E:�4����?�V`��>����X���(dK�H��T������Rew���������lB��Z($8��Xgn�`bi��b��?����,�h#����"P�u�-{;�[���z�j��nz��xq�=�SE����_��h���W��q�H�z��u�%��C�K�.K�C$�,A?�N+��i���MP���V/�������`'�M/���Rkk�\�0���6��D^�����3>���:�����3>���:��uF�2��������:�WR/�~:<?�{q�)f���e�?a����f��lA�If�{P�S�`�)fP�=��-�70��67�~��<N�
'�����U
'5e��t���&�g�5�����k;kk����T��dgc[	����:v������N�l'y���v/ZD�G�?�-!���gK��[B�~_{�%����U�d���N9_�889~:��`
V�"������g��;8��q�������v
��:w�.�T���r�����J_{9�Uaf�cS���Ik��a���
�R���F������8�$���73;�0,��i�4xB�u�w���.}�������+��$��S0��k��~��o�E�AZa=�����I�a�[�j���A�:��f��{�#��_��z���P��/e���gE�Y�|V$��gE�Y���)�S)�U���*TM
�������{�?����mm�������!��Z���5��mj�<h�ifA����u�x����<E����;M�A���'����Im���N�!�9���������m���O+�����mM����B�����ZJ�yGQ4rB�}�����g��Y/z�����g��Y/���EO�r���������QK���U�e}2|A3*gzH���XJW��������G*[o�m���[��_C��e�^�mU�������
f
0007-Track-statistics-for-streaming-spilling-20190114.patch.gzapplication/gzip; name=0007-Track-statistics-for-streaming-spilling-20190114.patch.gzDownload
0008-Enable-streaming-for-all-subscription-TAP-t-20190114.patch.gzapplication/gzip; name=0008-Enable-streaming-for-all-subscription-TAP-t-20190114.patch.gzDownload
��<\0008-Enable-streaming-for-all-subscription-TAP-t-20190114.patch��mO�H�?���"Dv��p=�RP9���^?�N�z��9v���H����8/����)��DU�������3�]�&q�G<�c�sf�.g�fc�l�Z!af�:�����y�	��� ��Sa�n�M��8�
�f��+y�*~��.�{�)oh���6`
~����m����{#�\��9����������}������!�4K8��-�q4!�)K�~o�.!�i�*J��Q M����?-�������vG�[�v�!�^���^�>q���������H\

��K����1bK��b:��S3��#{b>Y������_�H���x^��"�[�����;��_0������X�&�C �4����Y���h�?p?Q��
4���A���|�T��#���2s
u�d|%������cbc��u����0\�^[$��'�P����������*��:�� ��B.pWV������5M�8B��U}��i���M��h%���x�
:��6a/_����N��Az���o)����>�i��4�Mx�j+�@��i������h_��u�6����F:�l_���������Wg�7g��������?����(S ���@@,.��N�y�����������r{�G�����qK����������3�M/!�v��=��;�d��_���
�;���|Cd��O&�SQ����h�K�IUUC��7�t�\�S�M�l�����9B!�3�p�����>-�5���)���x���.����kg\dY����Y*u��;g��9��jH����2����G3*\�i���C��O������e,U��[��c�zD���a������N��Q:����r<��������@N�H7e����b�������n}� ~�
q�<B���8��\=�0�����l<���l����G�"�t,�� ����<�f���E��;�bZ�(���;����(.�Y�v��og�\��dX[#���A��p��b����!�r�����>
�u��� %���X^S^�\��l�����&Y���ac��s�������q��W��c���;�
�t���ms��L�R�m��d�|���b*�����r\�,Zl�&�:�E��i���"v������v�,-%y�!E��.���(�#5G�����4��t�z�2H�Wpt~7G��O���%������;y������o����h���,����2�la=�4|�#�q[2X�mI8�V��U7�����O��	��69��nu��3�����7�ul�����
���>��
��	��#�*tf�G�W�V�bS�S��<m���:���}�dU�����
q,�hVQP��{J8���XP�2C��X�XGaY:�����\����g�1�s�����k���u����k�7��-�����H�5$Q��gSX��(�H:s2��-k��O�v��;�d�w�#�l9>mX9�.|����E���*�
|n�n��r=��!���0�Pg�O	���D��LI���X�TW��ny��*6]��i���NK�g*��U�~���~U� ��i�����C\m��P�+C%�Z�>��m���$���b�C�4�&���u����B�7�97���X�����'+��X�K�s��M����UydF��dH���F��.�f��,�yM��9�u�lV~Sz����)�*t�j >g>��(��a�2�4����^��r�
�?!���xW�T�Y@�#�F�Hdk���i�����W�Pz�����I!]lv���@"5>5
0009-BUGFIX-set-final_lsn-for-subxacts-before-cl-20190114.patch.gzapplication/gzip; name=0009-BUGFIX-set-final_lsn-for-subxacts-before-cl-20190114.patch.gzDownload
0010-Add-TAP-test-for-streaming-vs.-DDL-20190114.patch.gzapplication/gzip; name=0010-Add-TAP-test-for-streaming-vs.-DDL-20190114.patch.gzDownload
#59Michael Paquier
michael@paquier.xyz
In reply to: Tomas Vondra (#58)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jan 14, 2019 at 07:23:31PM +0100, Tomas Vondra wrote:

Attached is an updated patch series, merging fixes and changes to TAP
tests proposed by Alexey. I've merged the fixes into the appropriate
patches, and I've kept the TAP changes / new tests as separate patches
towards the end of the series.

Patch 4 of the latest set fails to apply, so I have moved the patch to
next CF, waiting on author.
--
Michael

#60Alexey Kondratov
a.kondratov@postgrespro.ru
In reply to: Tomas Vondra (#58)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Hi Tomas,

On 14.01.2019 21:23, Tomas Vondra wrote:

Attached is an updated patch series, merging fixes and changes to TAP
tests proposed by Alexey. I've merged the fixes into the appropriate
patches, and I've kept the TAP changes / new tests as separate patches
towards the end of the series.

I had problems applying this patch along with 2pc streaming one to the
current master, but everything applied well on 97c39498e5. Regression
tests pass. What I personally do not like in the current TAP tests set
is that you have added "WITH (streaming=on)" to all tests including old
non-streaming ones. It seems unclear, which mechanism is tested there:
streaming, but those transactions probably do not hit memory limit, so
it depends on default server parameters; or non-streaming, but then what
is the need for (streaming=on)? I would prefer to add (streaming=on)
only to the new tests, where it is clearly necessary.

I'm a bit unhappy with two aspects of the current patch series:

1) We now track schema changes in two ways - using the pre-existing
schema_sent flag in RelationSyncEntry, and the (newly added) flag in
ReorderBuffer. While those options are used for regular vs. streamed
transactions, fundamentally it's the same thing and so having two
competing ways seems like a bad idea. Not sure what's the best way to
resolve this, though.

Yes, sure, when I have found problems with streaming of extensive DDL, I
added new flag in the simplest way, and it worked. Now, old schema_sent
flag is per relation based, while the new one - is_schema_sent - is per
top-level transaction based. If I get it correctly, the former seems to
be more thrifty, since new schema is sent only if we are streaming
change for relation, whose schema is outdated. In contrast, in the
latter case we will send new schema even if there will be no new changes
which belong to this relation.

I guess, it would be better to stick to the old behavior. I will try to
investigate how to better use it in the streaming mode as well.

2) We've removed quite a few asserts, particularly ensuring sanity of
cmin/cmax values. To some extent that's expected, because by allowing
decoding of in-progress transactions relaxes some of those rules. But
I'd be much happier if some of those asserts could be reinstated, even
if only in a weaker form.

Asserts have been removed from two places: (1)
HeapTupleSatisfiesHistoricMVCC, which seems inevitable, since we are
touching the essence of the MVCC visibility rules, when trying to decode
an in-progress transaction, and (2) ReorderBufferBuildTupleCidHash,
which is probably not related directly to the topic of the ongoing
patch, since Arseny Sher faced the same issue with simple repetitive DDL
decoding [1]/messages/by-id/874l9p8hyw.fsf@ars-thinkpad recently.

Not many, but I agree, that replacing them with some softer asserts
would be better, than just removing, especially point 1).

[1]: /messages/by-id/874l9p8hyw.fsf@ars-thinkpad

Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

#61Alexey Kondratov
a.kondratov@postgrespro.ru
In reply to: Alexey Kondratov (#57)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Hi Tomas,

Interesting. Any idea where does the extra overhead in this particular
case come from? It's hard to deduce that from the single flame graph,
when I don't have anything to compare it with (i.e. the flame graph
for
the "normal" case).

I guess that bottleneck is in disk operations. You can check
logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
writes (~26%) take around 35% of CPU time in summary. To compare,
please, see attached flame graph for the following transaction:

INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);

Execution Time: 44519.816 ms
Time: 98333,642 ms (01:38,334)

where disk IO is only ~7-8% in total. So we get very roughly the same
~x4-5 performance drop here. JFYI, I am using a machine with SSD for
tests.

Therefore, probably you may write changes on receiver in bigger chunks,
not each change separately.

Possibly, I/O is certainly a possible culprit, although we should be
using buffered I/O and there certainly are not any fsyncs here. So I'm
not sure why would it be cheaper to do the writes in batches.

BTW does this mean you see the overhead on the apply side? Or are you
running this on a single machine, and it's difficult to decide?

I run this on a single machine, but walsender and worker are utilizing
almost 100% of CPU per each process all the time, and at apply side
I/O syscalls take about 1/3 of CPU time. Though I am still not sure,
but for me this result somehow links performance drop with problems at
receiver side.

Writing in batches was just a hypothesis and to validate it I have
performed test with large txn, but consisting of a smaller number of
wide rows. This test does not exhibit any significant performance
drop, while it was streamed too. So it seems to be valid. Anyway, I do
not have other reasonable ideas beside that right now.

I've checked recently this patch again and tried to elaborate it in
terms of performance. As a result I've implemented a new POC version of
the applier (attached). Almost everything in streaming logic stayed
intact, but apply worker is significantly different.

As I wrote earlier I still claim, that spilling changes on disk at the
applier side adds additional overhead, but it is possible to get rid of
it. In my additional patch I do the following:

1) Maintain a pool of additional background workers (bgworkers), that
are connected with main logical apply worker via shm_mq's. Each worker
is dedicated to the processing of specific streamed transaction.

2) When we receive a streamed change for some transaction, we check
whether there is an existing dedicated bgworker in HTAB (xid ->
bgworker), or there are some in the idle list, or spawn a new one.

3) We pass all changes (between STREAM START/STOP) to that bgworker via
shm_mq_send without intermediate waiting. However, we wait for bgworker
to apply the entire changes chunk at STREAM STOP, since we don't want
transactions reordering.

4) When transaction is commited/aborted worker is being added to the
idle list and is waiting for reassigning message.

5) I have used the same machinery with apply_dispatch in bgworkers,
since most of actions are practically very similar.

Thus, we do not spill anything at the applier side, so transaction
changes are processed by bgworkers as normal backends do. In the same
time, changes processing is strictly serial, which prevents transactions
reordering and possible conflicts/anomalies. Even though we trade off
performance in favor of stability the result is rather impressive. I
have used a similar query for testing as before:

EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
    SELECT round(random()*10), random(), random()*142
    FROM generate_series(1, 1000000) s(i);

with 1kk (1000000), 3kk and 5kk rows; logical_work_mem = 64MB and
synchronous_standby_names = 'FIRST 1 (large_sub)'. Table schema is
following:

CREATE TABLE large_test (
    id serial primary key,
    num1 bigint,
    num2 double precision,
    num3 double precision
);

Here are the results:

-------------------------------------------------------------------
| N | Time on master, sec | Total xact time, sec |     Ratio      |
-------------------------------------------------------------------
|                        On commit (master, v13)                  |
-------------------------------------------------------------------
| 1kk | 6.5               | 17.6                 | x2.74          |
-------------------------------------------------------------------
| 3kk | 21                | 55.4                 | x2.64          |
-------------------------------------------------------------------
| 5kk | 38.3              | 91.5                 | x2.39          |
-------------------------------------------------------------------
|                        Stream + spill                           |
-------------------------------------------------------------------
| 1kk | 5.9               | 18                   | x3             |
-------------------------------------------------------------------
| 3kk | 19.5              | 52.4                 | x2.7           |
-------------------------------------------------------------------
| 5kk | 33.3              | 86.7                 | x2.86          |
-------------------------------------------------------------------
|                        Stream + BGW pool                        |
-------------------------------------------------------------------
| 1kk | 6                 | 12                   | x2             |
-------------------------------------------------------------------
| 3kk | 18.5              | 30.5                 | x1.65          |
-------------------------------------------------------------------
| 5kk | 35.6              | 53.9                 | x1.51          |
-------------------------------------------------------------------

It seems that overhead added by synchronous replica is lower by 2-3
times compared with Postgres master and streaming with spilling.
Therefore, the original patch eliminated delay before large transaction
processing start by sender, while this additional patch speeds up the
applier side.

Although the overall speed up is surely measurable, there is a room for
improvements yet:

1) Currently bgworkers are only spawned on demand without some initial
pool and never stopped. Maybe we should create a small pool on
replication start and offload some of idle bgworkers if they exceed some
limit?

2) Probably we can track somehow that incoming change has conflicts with
some of being processed xacts, so we can wait for specific bgworkers
only in that case?

3) Since the communication between main logical apply worker and each
bgworker from the pool is a 'single producer --- single consumer'
problem, then probably it is possible to wait and set/check flags
without locks, but using just atomics.

What do you think about this concept in general? Any concerns and
criticism are welcome!

Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

P.S. This patch shloud be applicable to your last patch set. I would rebase it against master, but it depends on 2pc patch, that I don't know well enough.

Attachments:

0011-BGWorkers-pool-for-streamed-transactions-apply-witho.patchtext/x-patch; name=0011-BGWorkers-pool-for-streamed-transactions-apply-witho.patchDownload
From 11c7549d2732f2f983d4548a81cd509dd7e41ec4 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov <kondratov.aleksey@gmail.com>
Date: Wed, 28 Aug 2019 15:26:50 +0300
Subject: [PATCH 11/11] BGWorkers pool for streamed transactions apply without
 spilling on disk

---
 src/backend/postmaster/bgworker.c        |    3 +
 src/backend/postmaster/pgstat.c          |    3 +
 src/backend/replication/logical/proto.c  |   17 +-
 src/backend/replication/logical/worker.c | 1780 +++++++++++-----------
 src/include/pgstat.h                     |    1 +
 src/include/replication/logicalproto.h   |    4 +-
 src/include/replication/logicalworker.h  |    1 +
 7 files changed, 933 insertions(+), 876 deletions(-)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index f5db5a8c4a..6860df07ca 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -129,6 +129,9 @@ static const struct
 	},
 	{
 		"ApplyWorkerMain", ApplyWorkerMain
+	},
+	{
+		"LogicalApplyBgwMain", LogicalApplyBgwMain
 	}
 };
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e5a4d147a7..b32994784f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3637,6 +3637,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
 			event_name = "Hash/GrowBuckets/Reinserting";
 			break;
+		case WAIT_EVENT_LOGICAL_APPLY_WORKER_READY:
+			event_name = "LogicalApplyWorkerReady";
+			break;
 		case WAIT_EVENT_LOGICAL_SYNC_DATA:
 			event_name = "LogicalSyncData";
 			break;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 4bec9fe8b5..954ce7343a 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -789,14 +789,11 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendint64(out, txn->commit_time);
 }
 
-TransactionId
+void
 logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
 {
-	TransactionId	xid;
 	uint8			flags;
 
-	xid = pq_getmsgint(in, 4);
-
 	/* read flags (unused for now) */
 	flags = pq_getmsgbyte(in);
 
@@ -807,8 +804,6 @@ logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
 	commit_data->commit_lsn = pq_getmsgint64(in);
 	commit_data->end_lsn = pq_getmsgint64(in);
 	commit_data->committime = pq_getmsgint64(in);
-
-	return xid;
 }
 
 void
@@ -823,13 +818,3 @@ logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 	pq_sendint32(out, xid);
 	pq_sendint32(out, subxid);
 }
-
-void
-logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
-							 TransactionId *subxid)
-{
-	Assert(xid && subxid);
-
-	*xid = pq_getmsgint(in, 4);
-	*subxid = pq_getmsgint(in, 4);
-}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ca632b7dc4..dc6c895fca 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -92,11 +92,16 @@
 #include "rewrite/rewriteHandler.h"
 
 #include "storage/bufmgr.h"
+// #include "storage/condition_variable.h"
+#include "storage/dsm.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
+#include "storage/shm_mq.h"
+#include "storage/shm_toc.h"
+#include "storage/spin.h"
 
 #include "tcop/tcopprot.h"
 
@@ -115,6 +120,54 @@
 #include "utils/syscache.h"
 
 #define NAPTIME_PER_CYCLE 1000	/* max sleep time between cycles (1s) */
+#define PG_LOGICAL_APPLY_SHM_MAGIC 0x79fb2447 // TODO Consider change
+
+typedef struct ParallelState
+{
+	slock_t	mutex;
+	// ConditionVariable cv;
+	bool	attached;
+	bool	ready;
+	bool	finished;
+	Oid		database_id;
+	Oid		authenticated_user_id;
+	Oid		subid;
+	Oid		stream_xid;
+	uint32	n;
+} ParallelState;
+
+typedef struct WorkerState
+{
+	TransactionId			 xid;
+	BackgroundWorkerHandle	*handle;
+	shm_mq_handle			*mq_handle;
+	dsm_segment				*dsm_seg;
+	ParallelState volatile	*pstate;
+} WorkerState;
+
+/* Apply workers hash table (initialized on first use) */
+static HTAB *ApplyWorkersHash = NULL;
+static WorkerState **ApplyWorkersIdleList = NULL;
+static uint32 pool_size = 10; /* MaxConnections default? */
+static uint32 nworkers = 0;
+static uint32 nfreeworkers = 0;
+
+/* Fields valid only for apply background workers */
+bool isLogicalApplyWorker = false;
+volatile ParallelState *MyParallelState = NULL;
+
+/* Worker setup and interactions */
+static void setup_dsm(WorkerState *wstate);
+static void setup_background_worker(WorkerState *wstate);
+static void cleanup_background_worker(dsm_segment *seg, Datum arg);
+static void handle_sigterm(SIGNAL_ARGS);
+
+static bool check_worker_status(WorkerState *wstate);
+static void wait_for_worker(WorkerState *wstate);
+static void wait_for_worker_to_finish(WorkerState *wstate);
+
+static WorkerState * find_or_start_worker(TransactionId xid, bool start);
+static void stop_worker(WorkerState *wstate);
 
 typedef struct FlushPosition
 {
@@ -143,47 +196,13 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
-/* fields valid only when processing streamed transaction */
+/* Fields valid only when processing streamed transaction */
 bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
-static int	stream_fd = -1;
-
-typedef struct SubXactInfo
-{
-	TransactionId xid;			/* XID of the subxact */
-	off_t		offset;			/* offset in the file */
-}			SubXactInfo;
-
-static uint32 nsubxacts = 0;
-static uint32 nsubxacts_max = 0;
-static SubXactInfo * subxacts = NULL;
-static TransactionId subxact_last = InvalidTransactionId;
-
-static void subxact_filename(char *path, Oid subid, TransactionId xid);
-static void changes_filename(char *path, Oid subid, TransactionId xid);
-
-/*
- * Information about subtransactions of a given toplevel transaction.
- */
-static void subxact_info_write(Oid subid, TransactionId xid);
-static void subxact_info_read(Oid subid, TransactionId xid);
-static void subxact_info_add(TransactionId xid);
-
-/*
- * Serialize and deserialize changes for a toplevel transaction.
- */
-static void stream_cleanup_files(Oid subid, TransactionId xid);
-static void stream_open_file(Oid subid, TransactionId xid, bool first);
-static void stream_write_change(char action, StringInfo s);
-static void stream_close_file(void);
-
-/*
- * Array of serialized XIDs.
- */
-static int	nxids = 0;
-static int	maxnxids = 0;
-static TransactionId	*xids = NULL;
+static TransactionId current_xid = InvalidTransactionId;
+static TransactionId prev_xid = InvalidTransactionId;
+static uint32 nchanges = 0;
 
 static bool handle_streamed_transaction(const char action, StringInfo s);
 
@@ -199,6 +218,16 @@ static volatile sig_atomic_t got_SIGHUP = false;
 /* prototype needed because of stream_commit */
 static void apply_dispatch(StringInfo s);
 
+// /* Debug only */
+// static void
+// iter_sleep(int seconds)
+// {
+// 	for (int i = 0; i < seconds; i++)
+// 	{
+// 		pg_usleep(1 * 1000L * 1000L);
+// 	}
+// }
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -250,6 +279,107 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Look up worker inside ApplyWorkersHash for requested xid.
+ * Throw error if not found or start a new one if start=true is passed.
+ */
+static WorkerState *
+find_or_start_worker(TransactionId xid, bool start)
+{
+	bool found;
+	WorkerState *entry = NULL;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* First time through, initialize apply workers hashtable */
+	if (ApplyWorkersHash == NULL)
+	{
+		HASHCTL		ctl;
+
+		MemSet(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(TransactionId);
+		ctl.entrysize = sizeof(WorkerState);
+		ctl.hcxt = ApplyContext; /* Allocate ApplyWorkersHash in the ApplyContext */
+		ApplyWorkersHash = hash_create("logical apply workers hash", 8,
+									 &ctl,
+									 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	Assert(ApplyWorkersHash != NULL);
+
+	/*
+	 * Find entry for requested transaction.
+	 */
+	entry = hash_search(ApplyWorkersHash, &xid, HASH_FIND, &found);
+
+	if (!found && start)
+	{
+		/* If there is at least one worker in the idle list, then take one. */
+		if (nfreeworkers > 0)
+		{
+			char action = 'R';
+
+			Assert(ApplyWorkersIdleList != NULL);
+
+			entry = ApplyWorkersIdleList[nfreeworkers - 1];
+			if (!hash_update_hash_key(ApplyWorkersHash,
+									  (void *) entry,
+									  (void *) &xid))
+				elog(ERROR, "could not reassign apply worker #%u entry from xid %u to xid %u",
+													entry->pstate->n, entry->xid, xid);
+
+			entry->xid = xid;
+			entry->pstate->finished = false;
+			entry->pstate->stream_xid = xid;
+			shm_mq_send(entry->mq_handle, 1, &action, false);
+
+			ApplyWorkersIdleList[--nfreeworkers] = NULL;
+		}
+		else
+		{
+			/* No entry in hash and no idle workers. Create a new one. */
+			entry = hash_search(ApplyWorkersHash, &xid, HASH_ENTER, &found);
+			entry->xid = xid;
+			setup_background_worker(entry);
+
+			if (nworkers == pool_size)
+			{
+				ApplyWorkersIdleList = repalloc(ApplyWorkersIdleList, pool_size + 10);
+				pool_size += 10;
+			}
+		}
+	}
+	else if (!found && !start)
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+				errmsg("could not find logical apply worker for xid %u", xid)));
+	else
+		elog(DEBUG5, "there is an existing logical apply worker for xid %u", xid);
+
+	Assert(entry != NULL);
+
+	return entry;
+}
+
+/*
+ * Gracefully teardown apply worker.
+ */
+static void
+stop_worker(WorkerState *wstate)
+{
+	/*
+	 * Sending zero-length data to worker in order to stop it.
+	 */
+	shm_mq_send(wstate->mq_handle, 0, NULL, false);
+
+	elog(LOG, "detaching DSM of apply worker #%u for xid %u",
+									wstate->pstate->n, wstate->xid);
+	dsm_detach(wstate->dsm_seg);
+
+	/* Delete worker entry */
+	(void) hash_search(ApplyWorkersHash, &wstate->xid, HASH_REMOVE, NULL);
+}
+
 /*
  * Handle streamed transactions.
  *
@@ -262,12 +392,12 @@ static bool
 handle_streamed_transaction(const char action, StringInfo s)
 {
 	TransactionId xid;
+	WorkerState *entry;
 
 	/* not in streaming mode */
-	if (!in_streamed_transaction)
+	if (!in_streamed_transaction || isLogicalApplyWorker)
 		return false;
 
-	Assert(stream_fd != -1);
 	Assert(TransactionIdIsValid(stream_xid));
 
 	/*
@@ -278,11 +408,16 @@ handle_streamed_transaction(const char action, StringInfo s)
 
 	Assert(TransactionIdIsValid(xid));
 
-	/* Add the new subxact to the array (unless already there). */
-	subxact_info_add(xid);
+	/*
+	 * Find worker for requested xid.
+	 */
+	entry = find_or_start_worker(stream_xid, false);
 
-	/* write the change to the current file */
-	stream_write_change(action, s);
+	// elog(LOG, "sending message of length=%d and raw=%s, action=%s", s->len, s->data, (char *) &action);
+	shm_mq_send(entry->mq_handle, s->len, s->data, false);
+	nchanges += 1;
+
+	// iter_sleep(3600);
 
 	return true;
 }
@@ -643,7 +778,8 @@ apply_handle_origin(StringInfo s)
 static void
 apply_handle_stream_start(StringInfo s)
 {
-	bool		first_segment;
+	bool		 first_segment;
+	WorkerState *entry;
 
 	Assert(!in_streamed_transaction);
 
@@ -652,17 +788,16 @@ apply_handle_stream_start(StringInfo s)
 
 	/* extract XID of the top-level transaction */
 	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+	nchanges = 0;
 
-	/* open the spool file for this transaction */
-	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+	/* Find worker for requested xid */
+	entry = find_or_start_worker(stream_xid, true);
 
-	/*
-	 * if this is not the first segment, open existing file
-	 *
-	 * XXX Note that the cleanup is performed by stream_open_file.
-	 */
-	if (!first_segment)
-		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+	SpinLockAcquire(&entry->pstate->mutex);
+	entry->pstate->ready = false;
+	SpinLockRelease(&entry->pstate->mutex);
+
+	elog(LOG, "starting streaming of xid %u", stream_xid);
 
 	pgstat_report_activity(STATE_RUNNING, NULL);
 }
@@ -673,16 +808,19 @@ apply_handle_stream_start(StringInfo s)
 static void
 apply_handle_stream_stop(StringInfo s)
 {
+	WorkerState *entry;
+	char action = 'E';
+
 	Assert(in_streamed_transaction);
 
-	/*
-	 * Close the file with serialized changes, and serialize information about
-	 * subxacts for the toplevel transaction.
-	 */
-	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
-	stream_close_file();
+	/* Find worker for requested xid */
+	entry = find_or_start_worker(stream_xid, false);
+
+	shm_mq_send(entry->mq_handle, 1, &action, false);
+	wait_for_worker(entry);
 
 	in_streamed_transaction = false;
+	elog(LOG, "stopped streaming of xid %u, %u changes streamed", stream_xid, nchanges);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
@@ -695,96 +833,67 @@ apply_handle_stream_abort(StringInfo s)
 {
 	TransactionId xid;
 	TransactionId subxid;
+	WorkerState *entry;
 
 	Assert(!in_streamed_transaction);
 
-	logicalrep_read_stream_abort(s, &xid, &subxid);
-
-	/*
-	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
-	 * just delete the files with serialized info.
-	 */
-	if (xid == subxid)
+	if(isLogicalApplyWorker)
 	{
-		char		path[MAXPGPATH];
+		subxid = pq_getmsgint(s, 4);
 
-		/*
-		 * XXX Maybe this should be an error instead? Can we receive abort for
-		 * a toplevel transaction we haven't received?
-		 */
+		ereport(LOG,
+				(errcode_for_file_access(),
+				errmsg("[Apply BGW #%u] aborting current transaction xid=%u, subxid=%u",
+				MyParallelState->n, GetCurrentTransactionIdIfAny(), GetCurrentSubTransactionId())));
 
-		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		if (subxid == stream_xid)
+			AbortCurrentTransaction();
+		else
+		{
+			char *spname = (char *) palloc(64 * sizeof(char));
+			sprintf(spname, "savepoint_for_xid_%u", subxid);
 
-		if (unlink(path) < 0)
-			ereport(ERROR,
+			ereport(LOG,
 					(errcode_for_file_access(),
-					 errmsg("could not remove file \"%s\": %m", path)));
+					errmsg("[Apply BGW #%u] rolling back to savepoint %s", MyParallelState->n, spname)));
 
-		subxact_filename(path, MyLogicalRepWorker->subid, xid);
-
-		if (unlink(path) < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not remove file \"%s\": %m", path)));
+			RollbackToSavepoint(spname);
+			CommitTransactionCommand();
+			// RollbackAndReleaseCurrentSubTransaction();
 
-		return;
+			pfree(spname);
+		}
 	}
 	else
 	{
-		/*
-		 * OK, so it's a subxact. We need to read the subxact file for the
-		 * toplevel transaction, determine the offset tracked for the subxact,
-		 * and truncate the file with changes. We also remove the subxacts
-		 * with higher offsets (or rather higher XIDs).
-		 *
-		 * We intentionally scan the array from the tail, because we're likely
-		 * aborting a change for the most recent subtransactions.
-		 *
-		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
-		 * would allow us to use binary search here.
-		 *
-		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
-		 * order, i.e. from the inner-most subxact (when nested)? In which
-		 * case we could simply check the last element.
-		 */
+		xid = pq_getmsgint(s, 4);
+		subxid = pq_getmsgint(s, 4);
 
-		int64		i;
-		int64		subidx;
-		bool		found = false;
-		char		path[MAXPGPATH];
+		/* Find worker for requested xid */
+		entry = find_or_start_worker(stream_xid, false);
 
-		subidx = -1;
-		subxact_info_read(MyLogicalRepWorker->subid, xid);
+		elog(LOG, "processing abort request of streamed transaction xid %u, subxid %u",
+			xid, subxid);
+		shm_mq_send(entry->mq_handle, s->len, s->data, false);
 
-		/* FIXME optimize the search by bsearch on sorted data */
-		for (i = nsubxacts; i > 0; i--)
+		if (subxid == stream_xid)
 		{
-			if (subxacts[i - 1].xid == subxid)
-			{
-				subidx = (i - 1);
-				found = true;
-				break;
-			}
-		}
-
-		/* We should not receive aborts for unknown subtransactions. */
-		Assert(found);
+			char action = 'F';
+			shm_mq_send(entry->mq_handle, 1, &action, false);
+			// shm_mq_send(entry->mq_handle, 0, NULL, false);
 
-		/* OK, truncate the file at the right offset. */
-		Assert((subidx >= 0) && (subidx < nsubxacts));
+			wait_for_worker_to_finish(entry);
 
-		changes_filename(path, MyLogicalRepWorker->subid, xid);
+			elog(LOG, "adding finished apply worker #%u for xid %u to the idle list",
+												entry->pstate->n, entry->xid);
+			ApplyWorkersIdleList[nfreeworkers++] = entry;
 
-		if (truncate(path, subxacts[subidx].offset))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not truncate file \"%s\": %m", path)));
+			// elog(LOG, "detaching DSM of apply worker for xid=%u\n", entry->xid);
+			// dsm_detach(entry->dsm_seg);
 
-		/* discard the subxacts added later */
-		nsubxacts = subidx;
-
-		/* write the updated subxact list */
-		subxact_info_write(MyLogicalRepWorker->subid, xid);
+			// /* Delete worker entry */
+			// (void) hash_search(ApplyWorkersHash, &xid, HASH_REMOVE, NULL);
+		}
 	}
 }
 
@@ -794,159 +903,56 @@ apply_handle_stream_abort(StringInfo s)
 static void
 apply_handle_stream_commit(StringInfo s)
 {
-	int			fd;
 	TransactionId xid;
-	StringInfoData s2;
-	int			nchanges;
-
-	char		path[MAXPGPATH];
-	char	   *buffer = NULL;
+	WorkerState *entry;
 	LogicalRepCommitData commit_data;
 
-	MemoryContext oldcxt;
-
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	/* open the spool file for the committed transaction */
-	changes_filename(path, MyLogicalRepWorker->subid, xid);
-
-	elog(DEBUG1, "replaying changes from file '%s'", path);
-
-	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (fd < 0)
+	if (isLogicalApplyWorker)
 	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
-	}
-
-	/* XXX Should this be allocated in another memory context? */
+		// logicalrep_read_stream_commit(s, &commit_data);
 
-	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
-
-	buffer = palloc(8192);
-	initStringInfo(&s2);
-
-	MemoryContextSwitchTo(oldcxt);
-
-	ensure_transaction();
-
-	/*
-	 * Make sure the handle apply_dispatch methods are aware we're in a remote
-	 * transaction.
-	 */
-	in_remote_transaction = true;
-	pgstat_report_activity(STATE_RUNNING, NULL);
-
-	/*
-	 * Read the entries one by one and pass them through the same logic as in
-	 * apply_dispatch.
-	 */
-	nchanges = 0;
-	while (true)
+		CommitTransactionCommand();
+	}
+	else
 	{
-		int			nbytes;
-		int			len;
-
-		/* read length of the on-disk record */
-		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
-		nbytes = read(fd, &len, sizeof(len));
-		pgstat_report_wait_end();
-
-		/* have we reached end of the file? */
-		if (nbytes == 0)
-			break;
-
-		/* do we have a correct length? */
-		if (nbytes != sizeof(len))
-		{
-			int			save_errno = errno;
-
-			CloseTransientFile(fd);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file: %m")));
-			return;
-		}
-
-		Assert(len > 0);
+		char action = 'F';
 
-		/* make sure we have sufficiently large buffer */
-		buffer = repalloc(buffer, len);
-
-		/* and finally read the data into the buffer */
-		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
-		if (read(fd, buffer, len) != len)
-		{
-			int			save_errno = errno;
-
-			CloseTransientFile(fd);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file: %m")));
-			return;
-		}
-		pgstat_report_wait_end();
+		Assert(!in_streamed_transaction);
 
-		/* copy the buffer to the stringinfo and call apply_dispatch */
-		resetStringInfo(&s2);
-		appendBinaryStringInfo(&s2, buffer, len);
+		xid = pq_getmsgint(s, 4);
+		logicalrep_read_stream_commit(s, &commit_data);
 
-		/* Ensure we are reading the data into our memory context. */
-		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+		elog(DEBUG1, "received commit for streamed transaction %u", xid);
 
-		apply_dispatch(&s2);
+		/* Find worker for requested xid */
+		entry = find_or_start_worker(xid, false);
 
-		MemoryContextReset(ApplyMessageContext);
+		/* Send commit message */
+		shm_mq_send(entry->mq_handle, s->len, s->data, false);
 
-		MemoryContextSwitchTo(oldcxt);
+		/* Notify worker, that we are done with this xact */
+		shm_mq_send(entry->mq_handle, 1, &action, false);
 
-		nchanges++;
+		wait_for_worker_to_finish(entry);
 
-		if (nchanges % 1000 == 0)
-			elog(DEBUG1, "replayed %d changes from file '%s'",
-				 nchanges, path);
+		elog(LOG, "adding finished apply worker #%u for xid %u to the idle list",
+											entry->pstate->n, entry->xid);
+		ApplyWorkersIdleList[nfreeworkers++] = entry;
 
 		/*
-		 * send feedback to upstream
-		 *
-		 * XXX Probably should send a valid LSN. But which one?
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
 		 */
-		send_feedback(InvalidXLogRecPtr, false, false);
-	}
-
-	CloseTransientFile(fd);
-
-	/*
-	 * Update origin state so we can restart streaming from correct
-	 * position in case of crash.
-	 */
-	replorigin_session_origin_lsn = commit_data.end_lsn;
-	replorigin_session_origin_timestamp = commit_data.committime;
-
-	CommitTransactionCommand();
-	pgstat_report_stat(false);
-
-	store_flush_position(commit_data.end_lsn);
-
-	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
-		 nchanges, path);
+		replorigin_session_origin_lsn = commit_data.end_lsn;
+		replorigin_session_origin_timestamp = commit_data.committime;
 
-	in_remote_transaction = false;
-	pgstat_report_activity(STATE_IDLE, NULL);
+		pgstat_report_stat(false);
 
-	/* unlink the files with serialized changes and subxact info */
-	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+		store_flush_position(commit_data.end_lsn);
 
-	pfree(buffer);
-	pfree(s2.data);
+		in_remote_transaction = false;
+		pgstat_report_activity(STATE_IDLE, NULL);
+	}
 }
 
 /*
@@ -965,6 +971,8 @@ apply_handle_relation(StringInfo s)
 	if (handle_streamed_transaction('R', s))
 		return;
 
+	// iter_sleep(3600);
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -1407,6 +1415,35 @@ apply_dispatch(StringInfo s)
 {
 	char		action = pq_getmsgbyte(s);
 
+	if (isLogicalApplyWorker)
+	{
+		/*
+		 * Inside logical apply worker we can figure out that new subtransaction
+		 * was started if new change arrived with different xid. In that case we
+		 * can define named savepoint, so that we were able to commit/rollback it
+		 * separately later.
+		 */
+		current_xid = pq_getmsgint(s, 4);
+
+		if (prev_xid == InvalidTransactionId)
+			prev_xid = current_xid;
+		else if (current_xid != prev_xid && current_xid != stream_xid)
+		{
+			char *spname = (char *) palloc(64 * sizeof(char));
+			sprintf(spname, "savepoint_for_xid_%u", current_xid);
+
+			elog(LOG, "[Apply BGW #%u] defining savepoint %s", MyParallelState->n, spname);
+
+			DefineSavepoint(spname);
+			CommitTransactionCommand();
+			// BeginInternalSubTransaction(NULL);
+		}
+
+		prev_xid = current_xid;
+	}
+	// else
+	// 	elog(LOG, "Logical worker: applying dispatch for action=%s", (char *) &action);
+
 	switch (action)
 	{
 			/* BEGIN */
@@ -1435,6 +1472,7 @@ apply_dispatch(StringInfo s)
 			break;
 			/* RELATION */
 		case 'R':
+			// elog(LOG, "%s worker: applying dispatch for action=R", isLogicalApplyWorker ? "Apply" : "Logical");
 			apply_handle_relation(s);
 			break;
 			/* TYPE */
@@ -1565,12 +1603,18 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 static void
 worker_onexit(int code, Datum arg)
 {
-	int	i;
+	HASH_SEQ_STATUS status;
+	WorkerState *entry;
 
-	elog(LOG, "cleanup files for %d transactions", nxids);
-
-	for (i = nxids-1; i >= 0; i--)
-		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i]);
+	if (ApplyWorkersHash != NULL)
+	{
+		hash_seq_init(&status, ApplyWorkersHash);
+		while ((entry = (WorkerState *) hash_seq_search(&status)) != NULL)
+		{
+			stop_worker(entry);
+		}
+		hash_seq_term(&status);
+	}
 }
 
 /*
@@ -1593,6 +1637,8 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
+	ApplyWorkersIdleList = palloc(sizeof(WorkerState *) * pool_size);
+
 	for (;;)
 	{
 		pgsocket	fd = PGINVALID_SOCKET;
@@ -1904,8 +1950,9 @@ maybe_reread_subscription(void)
 	Subscription *newsub;
 	bool		started_tx = false;
 
+	// TODO Probably we have to handle subscription reread in apply workers too.
 	/* When cache state is valid there is nothing to do here. */
-	if (MySubscriptionValid)
+	if (MySubscriptionValid || isLogicalApplyWorker)
 		return;
 
 	/* This function might be called inside or outside of transaction. */
@@ -2039,608 +2086,50 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
-/*
- * subxact_info_write
- *	  Store information about subxacts for a toplevel transaction.
- *
- * For each subxact we store offset of it's first change in the main file.
- * The file is always over-written as a whole, and we also include CRC32C
- * checksum of the information.
- *
- * XXX We should only store subxacts that were not aborted yet.
- *
- * XXX Maybe we should only include the checksum when the cluster is
- * initialized with checksums?
- *
- * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
- */
+/* SIGHUP: set flag to reload configuration at next convenient time */
 static void
-subxact_info_write(Oid subid, TransactionId xid)
+logicalrep_worker_sighup(SIGNAL_ARGS)
 {
-	int			fd;
-	char		path[MAXPGPATH];
-	uint32		checksum;
-	Size		len;
-
-	Assert(TransactionIdIsValid(xid));
-
-	subxact_filename(path, subid, xid);
-
-	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
-	if (fd < 0)
-	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create file \"%s\": %m",
-						path)));
-		return;
-	}
-
-	len = sizeof(SubXactInfo) * nsubxacts;
-
-	/* compute the checksum */
-	INIT_CRC32C(checksum);
-	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
-	COMP_CRC32C(checksum, (char *) subxacts, len);
-	FIN_CRC32C(checksum);
-
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
-
-	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
-	{
-		int			save_errno = errno;
+	int			save_errno = errno;
 
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not write to file \"%s\": %m",
-						path)));
-		return;
-	}
+	got_SIGHUP = true;
 
-	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
-	{
-		int			save_errno = errno;
+	/* Waken anything waiting on the process latch */
+	SetLatch(MyLatch);
 
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not write to file \"%s\": %m",
-						path)));
-		return;
-	}
+	errno = save_errno;
+}
 
-	if ((len > 0) && (write(fd, subxacts, len) != len))
-	{
-		int			save_errno = errno;
+/* Logical Replication Apply worker entry point */
+void
+ApplyWorkerMain(Datum main_arg)
+{
+	int			worker_slot = DatumGetInt32(main_arg);
+	MemoryContext oldctx;
+	char		originname[NAMEDATALEN];
+	XLogRecPtr	origin_startpos;
+	char	   *myslotname;
+	WalRcvStreamOptions options;
 
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not write to file \"%s\": %m",
-						path)));
-		return;
-	}
+	/* Attach to slot */
+	logicalrep_worker_attach(worker_slot);
 
-	pgstat_report_wait_end();
+	/* Setup signal handling */
+	pqsignal(SIGHUP, logicalrep_worker_sighup);
+	pqsignal(SIGTERM, die);
+	BackgroundWorkerUnblockSignals();
 
 	/*
-	 * We don't need to fsync or anything, as we'll recreate the files after a
-	 * crash from scratch. So just close the file.
+	 * We don't currently need any ResourceOwner in a walreceiver process, but
+	 * if we did, we could call CreateAuxProcessResourceOwner here.
 	 */
-	CloseTransientFile(fd);
 
-	/*
-	 * But we free the memory allocated for subxact info. There might be one
-	 * exceptional transaction with many subxacts, and we don't want to keep
-	 * the memory allocated forewer.
-	 *
-	 */
-	if (subxacts)
-		pfree(subxacts);
+	/* Initialise stats to a sanish value */
+	MyLogicalRepWorker->last_send_time = MyLogicalRepWorker->last_recv_time =
+		MyLogicalRepWorker->reply_time = GetCurrentTimestamp();
 
-	subxacts = NULL;
-	subxact_last = InvalidTransactionId;
-	nsubxacts = 0;
-	nsubxacts_max = 0;
-}
-
-/*
- * subxact_info_read
- *	  Restore information about subxacts of a streamed transaction.
- *
- * Read information about subxacts into the global variables, and while
- * reading the information verify the checksum.
- *
- * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
- *
- * XXX Do we need to allocate it in TopMemoryContext?
- */
-static void
-subxact_info_read(Oid subid, TransactionId xid)
-{
-	int			fd;
-	char		path[MAXPGPATH];
-	uint32		checksum;
-	uint32		checksum_new;
-	Size		len;
-	MemoryContext oldctx;
-
-	Assert(TransactionIdIsValid(xid));
-	Assert(!subxacts);
-	Assert(nsubxacts == 0);
-	Assert(nsubxacts_max == 0);
-
-	subxact_filename(path, subid, xid);
-
-	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (fd < 0)
-	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
-		return;
-	}
-
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
-
-	/* read the checksum */
-	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not read file \"%s\": %m",
-						path)));
-		return;
-	}
-
-	/* read number of subxact items */
-	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not read file \"%s\": %m",
-						path)));
-		return;
-	}
-
-	pgstat_report_wait_end();
-
-	len = sizeof(SubXactInfo) * nsubxacts;
-
-	/* we keep the maximum as a power of 2 */
-	nsubxacts_max = 1 << my_log2(nsubxacts);
-
-	/* subxacts are long-lived */
-	oldctx = MemoryContextSwitchTo(TopMemoryContext);
-	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
-	MemoryContextSwitchTo(oldctx);
-
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
-
-	if ((len > 0) && ((read(fd, subxacts, len)) != len))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not read file \"%s\": %m",
-						path)));
-		return;
-	}
-
-	pgstat_report_wait_end();
-
-	/* recompute the checksum */
-	INIT_CRC32C(checksum_new);
-	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
-	COMP_CRC32C(checksum_new, (char *) subxacts, len);
-	FIN_CRC32C(checksum_new);
-
-	if (checksum_new != checksum)
-		ereport(ERROR,
-				(errmsg("checksum failure when reading subxacts")));
-
-	CloseTransientFile(fd);
-}
-
-/*
- * subxact_info_add
- *	  Add information about a subxact (offset in the main file).
- *
- * XXX Do we need to allocate it in TopMemoryContext?
- */
-static void
-subxact_info_add(TransactionId xid)
-{
-	int64		i;
-
-	/*
-	 * If the XID matches the toplevel transaction, we don't want to add it.
-	 */
-	if (stream_xid == xid)
-		return;
-
-	/*
-	 * In most cases we're checking the same subxact as we've already seen in
-	 * the last call, so make ure just ignore it (this change comes later).
-	 */
-	if (subxact_last == xid)
-		return;
-
-	/* OK, remember we're processing this XID. */
-	subxact_last = xid;
-
-	/*
-	 * Check if the transaction is already present in the array of subxact. We
-	 * intentionally scan the array from the tail, because we're likely adding
-	 * a change for the most recent subtransactions.
-	 *
-	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
-	 * would allow us to use binary search here.
-	 */
-	for (i = nsubxacts; i > 0; i--)
-	{
-		/* found, so we're done */
-		if (subxacts[i - 1].xid == xid)
-			return;
-	}
-
-	/* This is a new subxact, so we need to add it to the array. */
-
-	if (nsubxacts == 0)
-	{
-		MemoryContext oldctx;
-
-		nsubxacts_max = 128;
-		oldctx = MemoryContextSwitchTo(TopMemoryContext);
-		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
-		MemoryContextSwitchTo(oldctx);
-	}
-	else if (nsubxacts == nsubxacts_max)
-	{
-		nsubxacts_max *= 2;
-		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
-	}
-
-	subxacts[nsubxacts].xid = xid;
-	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
-
-	nsubxacts++;
-}
-
-/* format filename for file containing the info about subxacts */
-static void
-subxact_filename(char *path, Oid subid, TransactionId xid)
-{
-	char		tempdirpath[MAXPGPATH];
-
-	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
-
-	/*
-	 * We might need to create the tablespace's tempfile directory, if no
-	 * one has yet done so.
-	 *
-	 * Don't check for error from mkdir; it could fail if the directory
-	 * already exists (maybe someone else just did the same thing).  If
-	 * it doesn't work then we'll bomb out when opening the file
-	 */
-	mkdir(tempdirpath, S_IRWXU);
-
-	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
-			 tempdirpath, subid, xid);
-}
-
-/* format filename for file containing serialized changes */
-static void
-changes_filename(char *path, Oid subid, TransactionId xid)
-{
-	char		tempdirpath[MAXPGPATH];
-
-	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
-
-	/*
-	 * We might need to create the tablespace's tempfile directory, if no
-	 * one has yet done so.
-	 *
-	 * Don't check for error from mkdir; it could fail if the directory
-	 * already exists (maybe someone else just did the same thing).  If
-	 * it doesn't work then we'll bomb out when opening the file
-	 */
-	mkdir(tempdirpath, S_IRWXU);
-
-	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
-			 tempdirpath, subid, xid);
-}
-
-/*
- * stream_cleanup_files
- *	  Cleanup files for a subscription / toplevel transaction.
- *
- * Remove files with serialized changes and subxact info for a particular
- * toplevel transaction. Each subscription has a separate set of files.
- *
- * Note: The files may not exists, so handle ENOENT as non-error.
- *
- * TODO: Add missing_ok flag to specify in which cases it's OK not to
- * find the files, and when it's an error.
- */
-static void
-stream_cleanup_files(Oid subid, TransactionId xid)
-{
-	int			i;
-	char		path[MAXPGPATH];
-	bool		found = false;
-
-	subxact_filename(path, subid, xid);
-
-	if ((unlink(path) < 0) && (errno != ENOENT))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not remove file \"%s\": %m", path)));
-
-	changes_filename(path, subid, xid);
-
-	if ((unlink(path) < 0) && (errno != ENOENT))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not remove file \"%s\": %m", path)));
-
-	/*
-	 * Cleanup the XID from the array - find the XID in the array and
-	 * remove it by shifting all the remaining elements. The array is
-	 * bound to be fairly small (maximum number of in-progress xacts,
-	 * so max_connections + max_prepared_transactions) so simply loop
-	 * through the array and find index of the XID. Then move the rest
-	 * of the array by one element to the left.
-	 *
-	 * Notice we also call this from stream_open_file for first segment
-	 * of each transaction, to deal with possible left-overs after a
-	 * crash, so it's entirely possible not to find the XID in the
-	 * array here. In that case we don't remove anything.
-	 *
-	 * XXX Perhaps it'd be better to handle this automatically after a
-	 * restart, instead of doing it over and over for each transaction.
-	 */
-	for (i = 0; i < nxids; i++)
-	{
-		if (xids[i] == xid)
-		{
-			found = true;
-			break;
-		}
-	}
-
-	if (!found)
-		return;
-
-	/*
-	 * Move the last entry from the array to the place. We don't keep
-	 * the streamed transactions sorted or anything - we only expect 
-	 * a few of them in progress (max_connections + max_prepared_xacts)
-	 * so linear search is just fine.
-	 */
-	xids[i] = xids[nxids-1];
-	nxids--;
-}
-
-/*
- * stream_open_file
- *	  Open file we'll use to serialize changes for a toplevel transaction.
- *
- * Open a file for streamed changes from a toplevel transaction identified
- * by stream_xid (global variable). If it's the first chunk of streamed
- * changes for this transaction, perform cleanup by removing existing
- * files after a possible previous crash.
- *
- * This can only be called at the beginning of a "streaming" block, i.e.
- * between stream_start/stream_stop messages from the upstream.
- */
-static void
-stream_open_file(Oid subid, TransactionId xid, bool first_segment)
-{
-	char		path[MAXPGPATH];
-	int			flags;
-
-	Assert(in_streamed_transaction);
-	Assert(OidIsValid(subid));
-	Assert(TransactionIdIsValid(xid));
-	Assert(stream_fd == -1);
-
-	/*
-	 * If this is the first segment for this transaction, try removing
-	 * existing files (if there are any, possibly after a crash).
-	 */
-	if (first_segment)
-	{
-		MemoryContext	oldcxt;
-
-		/* XXX make sure there are no previous files for this transaction */
-		stream_cleanup_files(subid, xid);
-
-		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
-
-		/*
-		 * We need to remember the XIDs we spilled to files, so that we can
-		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
-		 *
-		 * The number of XIDs we may need to track is fairly small, because
-		 * we can only stream toplevel xacts (so limited by max_connections
-		 * and max_prepared_transactions), and we only stream the large ones.
-		 * So we simply keep the XIDs in an unsorted array. If the number of
-		 * xacts gets large for some reason (e.g. very high max_connections),
-		 * a more elaborate approach might be better - e.g. sorted array, to
-		 * speed-up the lookups.
-		 */
-		if (nxids == maxnxids)	/* array of XIDs is full */
-		{
-			if (!xids)
-			{
-				maxnxids = 64;
-				xids = palloc(maxnxids * sizeof(TransactionId));
-			}
-			else
-			{
-				maxnxids = 2 * maxnxids;
-				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
-			}
-		}
-
-		xids[nxids++] = xid;
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-
-	changes_filename(path, subid, xid);
-
-	elog(DEBUG1, "opening file '%s' for streamed changes", path);
-
-	/*
-	 * If this is the first streamed segment, the file must not exist, so
-	 * make sure we're the ones creating it. Otherwise just open the file
-	 * for writing, in append mode.
-	 */
-	if (first_segment)
-		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
-	else
-		flags = (O_WRONLY | O_APPEND | PG_BINARY);
-
-	stream_fd = OpenTransientFile(path, flags);
-
-	if (stream_fd < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
-}
-
-/*
- * stream_close_file
- *	  Close the currently open file with streamed changes.
- *
- * This can only be called at the beginning of a "streaming" block, i.e.
- * between stream_start/stream_stop messages from the upstream.
- */
-static void
-stream_close_file(void)
-{
-	Assert(in_streamed_transaction);
-	Assert(TransactionIdIsValid(stream_xid));
-	Assert(stream_fd != -1);
-
-	CloseTransientFile(stream_fd);
-
-	stream_xid = InvalidTransactionId;
-	stream_fd = -1;
-}
-
-/*
- * stream_write_change
- *	  Serialize a change to a file for the current toplevel transaction.
- *
- * The change is serialied in a simple format, with length (not including
- * the length), action code (identifying the message type) and message
- * contents (without the subxact TransactionId value).
- *
- * XXX The subxact file includes CRC32C of the contents. Maybe we should
- * include something like that here too, but doing so will not be as
- * straighforward, because we write the file in chunks.
- */
-static void
-stream_write_change(char action, StringInfo s)
-{
-	int			len;
-
-	Assert(in_streamed_transaction);
-	Assert(TransactionIdIsValid(stream_xid));
-	Assert(stream_fd != -1);
-
-	/* total on-disk size, including the action type character */
-	len = (s->len - s->cursor) + sizeof(char);
-
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
-
-	/* first write the size */
-	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not serialize streamed change to file: %m")));
-
-	/* then the action */
-	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not serialize streamed change to file: %m")));
-
-	/* and finally the remaining part of the buffer (after the XID) */
-	len = (s->len - s->cursor);
-
-	if (write(stream_fd, &s->data[s->cursor], len) != len)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not serialize streamed change to file: %m")));
-
-	pgstat_report_wait_end();
-}
-
-/* SIGHUP: set flag to reload configuration at next convenient time */
-static void
-logicalrep_worker_sighup(SIGNAL_ARGS)
-{
-	int			save_errno = errno;
-
-	got_SIGHUP = true;
-
-	/* Waken anything waiting on the process latch */
-	SetLatch(MyLatch);
-
-	errno = save_errno;
-}
-
-/* Logical Replication Apply worker entry point */
-void
-ApplyWorkerMain(Datum main_arg)
-{
-	int			worker_slot = DatumGetInt32(main_arg);
-	MemoryContext oldctx;
-	char		originname[NAMEDATALEN];
-	XLogRecPtr	origin_startpos;
-	char	   *myslotname;
-	WalRcvStreamOptions options;
-
-	/* Attach to slot */
-	logicalrep_worker_attach(worker_slot);
-
-	/* Setup signal handling */
-	pqsignal(SIGHUP, logicalrep_worker_sighup);
-	pqsignal(SIGTERM, die);
-	BackgroundWorkerUnblockSignals();
-
-	/*
-	 * We don't currently need any ResourceOwner in a walreceiver process, but
-	 * if we did, we could call CreateAuxProcessResourceOwner here.
-	 */
-
-	/* Initialise stats to a sanish value */
-	MyLogicalRepWorker->last_send_time = MyLogicalRepWorker->last_recv_time =
-		MyLogicalRepWorker->reply_time = GetCurrentTimestamp();
-
-	/* Load the libpq-specific functions */
-	load_file("libpqwalreceiver", false);
+	/* Load the libpq-specific functions */
+	load_file("libpqwalreceiver", false);
 
 	/* Run as replica session replication role. */
 	SetConfigOption("session_replication_role", "replica",
@@ -2798,3 +2287,580 @@ IsLogicalWorker(void)
 {
 	return MyLogicalRepWorker != NULL;
 }
+
+/*
+ * Apply Background Worker main loop.
+ */
+void
+LogicalApplyBgwMain(Datum main_arg)
+{
+	volatile ParallelState *pst;
+
+	dsm_segment			*seg;
+	shm_toc				*toc;
+	PGPROC				*registrant;
+	shm_mq				*mq;
+	shm_mq_handle		*mqh;
+	shm_mq_result		 shmq_res;
+	// ConditionVariable	 cv;
+	LogicalRepWorker	 lrw;
+	MemoryContext		 oldcontext;
+
+	MemoryContextSwitchTo(TopMemoryContext);
+
+	/* Load the subscription into persistent memory context. */
+	ApplyContext = AllocSetContextCreate(TopMemoryContext,
+										 "ApplyContext",
+										 ALLOCSET_DEFAULT_SIZES);
+
+	oldcontext = MemoryContextSwitchTo(ApplyContext);
+
+	/*
+	 * Init the ApplyMessageContext which we clean up after each replication
+	 * protocol message.
+	 */
+	ApplyMessageContext = AllocSetContextCreate(ApplyContext,
+												"ApplyMessageContext",
+												ALLOCSET_DEFAULT_SIZES);
+
+	isLogicalApplyWorker = true;
+
+	/*
+	 * Establish signal handlers.
+	 *
+	 * We want CHECK_FOR_INTERRUPTS() to kill off this worker process just as
+	 * it would a normal user backend.  To make that happen, we establish a
+	 * signal handler that is a stripped-down version of die().
+	 */
+	pqsignal(SIGTERM, handle_sigterm);
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Connect to the dynamic shared memory segment.
+	 *
+	 * The backend that registered this worker passed us the ID of a shared
+	 * memory segment to which we must attach for further instructions.  In
+	 * order to attach to dynamic shared memory, we need a resource owner.
+	 * Once we've mapped the segment in our address space, attach to the table
+	 * of contents so we can locate the various data structures we'll need to
+	 * find within the segment.
+	 */
+	CurrentResourceOwner = ResourceOwnerCreate(NULL, "Logical apply worker");
+	seg = dsm_attach(DatumGetInt32(main_arg));
+	if (seg == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("unable to map dynamic shared memory segment")));
+	toc = shm_toc_attach(PG_LOGICAL_APPLY_SHM_MAGIC, dsm_segment_address(seg));
+	if (toc == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("bad magic number in dynamic shared memory segment")));
+
+	/*
+	 * Acquire a worker number.
+	 *
+	 * By convention, the process registering this background worker should
+	 * have stored the control structure at key 0.  We look up that key to
+	 * find it.  Our worker number gives our identity: there may be just one
+	 * worker involved in this parallel operation, or there may be many.
+	 */
+	pst = shm_toc_lookup(toc, 0, false);
+	MyParallelState = pst;
+
+	SpinLockAcquire(&pst->mutex);
+	pst->attached = true;
+	SpinLockRelease(&pst->mutex);
+
+	/*
+	 * Attach to the message queue.
+	 */
+	mq = shm_toc_lookup(toc, 1, false);
+	shm_mq_set_receiver(mq, MyProc);
+	mqh = shm_mq_attach(mq, seg, NULL);
+
+	/* Restore database connection. */
+	BackgroundWorkerInitializeConnectionByOid(pst->database_id,
+											  pst->authenticated_user_id, 0);
+
+	/*
+	 * Set the client encoding to the database encoding, since that is what
+	 * the leader will expect.
+	 */
+	SetClientEncoding(GetDatabaseEncoding());
+
+	lrw.subid = pst->subid;
+	MyLogicalRepWorker = &lrw;
+
+	stream_xid = pst->stream_xid;
+
+	StartTransactionCommand();
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	// PushActiveSnapshot(GetTransactionSnapshot());
+
+	MySubscription = GetSubscription(MyLogicalRepWorker->subid, true);
+
+	/*
+	 * Indicate that we're fully initialized and ready to begin the main part
+	 * of the parallel operation.
+	 *
+	 * Once we signal that we're ready, the user backend is entitled to assume
+	 * that our on_dsm_detach callbacks will fire before we disconnect from
+	 * the shared memory segment and exit.  Generally, that means we must have
+	 * attached to all relevant dynamic shared memory data structures by now.
+	 */
+	SpinLockAcquire(&pst->mutex);
+	pst->ready = true;
+	// cv = pst->cv;
+	// if (pst->workers_ready == pst->workers_total)
+	// {
+	//	 registrant = BackendPidGetProc(MyBgworkerEntry->bgw_notify_pid);
+	//	 if (registrant == NULL)
+	//	 {
+	//		 elog(DEBUG1, "registrant backend has exited prematurely");
+	//		 proc_exit(1);
+	//	 }
+	//	 SetLatch(&registrant->procLatch);
+	// }
+	SpinLockRelease(&pst->mutex);
+	elog(LOG, "[Apply BGW #%u] started", pst->n);
+
+	registrant = BackendPidGetProc(MyBgworkerEntry->bgw_notify_pid);
+	SetLatch(&registrant->procLatch);
+
+	for (;;)
+	{
+		void *data;
+		Size  len;
+		StringInfoData s;
+		MemoryContext	oldctx;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Ensure we are reading the data into our memory context. */
+		oldctx = MemoryContextSwitchTo(ApplyMessageContext);
+
+		shmq_res = shm_mq_receive(mqh, &len, &data, false);
+
+		if (shmq_res != SHM_MQ_SUCCESS)
+			break;
+
+		if (len == 0)
+		{
+			elog(LOG, "[Apply BGW #%u] got zero-length message, stopping", pst->n);
+			break;
+		}
+		else
+		{
+			s.cursor = 0;
+			s.maxlen = -1;
+			s.data = (char *) data;
+			s.len = len;
+
+			/*
+			 * We use first byte of message for additional communication between
+			 * main Logical replication worker and Apply BGWorkers, so if it
+			 * differs from 'w', then process it first.
+			 */
+			switch (pq_getmsgbyte(&s))
+			{
+				/* Stream stop */
+				case 'E':
+				{
+					in_remote_transaction = false;
+
+					SpinLockAcquire(&pst->mutex);
+					pst->ready = true;
+					SpinLockRelease(&pst->mutex);
+					SetLatch(&registrant->procLatch);
+
+					elog(LOG, "[Apply BGW #%u] ended processing streaming chunk, waiting on shm_mq_receive", pst->n);
+
+					continue;
+				}
+				/* Reassign to the new transaction */
+				case 'R':
+				{
+					elog(LOG, "[Apply BGW #%u] switching from processing xid %u to xid %u",
+											pst->n, stream_xid, pst->stream_xid);
+					stream_xid = pst->stream_xid;
+
+					StartTransactionCommand();
+					BeginTransactionBlock();
+					CommitTransactionCommand();
+					StartTransactionCommand();
+
+					MySubscription = GetSubscription(MyLogicalRepWorker->subid, true);
+
+					continue;
+				}
+				/* Finished processing xact */
+				case 'F':
+				{
+					elog(LOG, "[Apply BGW #%u] finished processing xact %u", pst->n, stream_xid);
+
+					MemoryContextSwitchTo(ApplyContext);
+
+					CommitTransactionCommand();
+					EndTransactionBlock();
+					CommitTransactionCommand();
+
+					SpinLockAcquire(&pst->mutex);
+					pst->finished = true;
+					SpinLockRelease(&pst->mutex);
+
+					continue;
+				}
+				default:
+					break;
+			}
+
+			pq_getmsgint64(&s); // Read LSN info
+			pq_getmsgint64(&s); // TODO Do we need to process it here again somehow?
+			pq_getmsgint64(&s);
+
+			/*
+			 * Make sure the handle apply_dispatch methods are aware we're in a remote
+			 * transaction.
+			 */
+			in_remote_transaction = true;
+			pgstat_report_activity(STATE_RUNNING, NULL);
+
+			elog(DEBUG5, "[Apply BGW #%u] applying dispatch for action=%s",
+									pst->n, (char *) &s.data[s.cursor]);
+			apply_dispatch(&s);
+		}
+
+		MemoryContextSwitchTo(oldctx);
+		MemoryContextReset(ApplyMessageContext);
+	}
+
+	CommitTransactionCommand();
+	EndTransactionBlock();
+	CommitTransactionCommand();
+
+	MemoryContextSwitchTo(oldcontext);
+	MemoryContextReset(ApplyContext);
+
+	SpinLockAcquire(&pst->mutex);
+	pst->finished = true;
+	// if (pst->workers_finished == pst->workers_total)
+	// {
+	//	 registrant = BackendPidGetProc(MyBgworkerEntry->bgw_notify_pid);
+	//	 if (registrant == NULL)
+	//	 {
+	//		 elog(DEBUG1, "registrant backend has exited prematurely");
+	//		 proc_exit(1);
+	//	 }
+	//	 SetLatch(&registrant->procLatch);
+	// }
+	SpinLockRelease(&pst->mutex);
+
+	elog(LOG, "[Apply BGW #%u] exiting", pst->n);
+
+	/* Signal main process that we are done. */
+	// ConditionVariableBroadcast(&cv);
+	SetLatch(&registrant->procLatch);
+
+	/*
+	 * We're done.  Explicitly detach the shared memory segment so that we
+	 * don't get a resource leak warning at commit time.  This will fire any
+	 * on_dsm_detach callbacks we've registered, as well.  Once that's done,
+	 * we can go ahead and exit.
+	 */
+	dsm_detach(seg);
+	proc_exit(0);
+}
+
+/*
+ * When we receive a SIGTERM, we set InterruptPending and ProcDiePending just
+ * like a normal backend.  The next CHECK_FOR_INTERRUPTS() will do the right
+ * thing.
+ */
+static void
+handle_sigterm(SIGNAL_ARGS)
+{
+	int save_errno = errno;
+
+	SetLatch(MyLatch);
+
+	if (!proc_exit_inprogress)
+	{
+		InterruptPending = true;
+		ProcDiePending = true;
+	}
+
+	errno = save_errno;
+}
+
+/*
+ * Set up a dynamic shared memory segment.
+ *
+ * We set up a control region that contains a ParallelState,
+ * plus one region per message queue. There are as many message queues as
+ * the number of workers.
+ */
+static void
+setup_dsm(WorkerState *wstate)
+{
+	shm_toc_estimator	 e;
+	int					 toc_key = 0;
+	Size				 segsize;
+	dsm_segment			*seg;
+	shm_toc				*toc;
+	ParallelState		*pst;
+	shm_mq				*mq;
+	int64				 queue_size = 160000000; /* 16 MB for now */
+
+	/* Ensure a valid queue size. */
+	if (queue_size < 0 || ((uint64) queue_size) < shm_mq_minimum_size)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("queue size must be at least %zu bytes",
+						shm_mq_minimum_size)));
+	if (queue_size != ((Size) queue_size))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("queue size overflows size_t")));
+
+	/*
+	 * Estimate how much shared memory we need.
+	 *
+	 * Because the TOC machinery may choose to insert padding of oddly-sized
+	 * requests, we must estimate each chunk separately.
+	 *
+	 * We need one key to register the location of the header, and we need
+	 * nworkers keys to track the locations of the message queues.
+	 */
+	shm_toc_initialize_estimator(&e);
+	shm_toc_estimate_chunk(&e, sizeof(ParallelState));
+	shm_toc_estimate_chunk(&e, (Size) queue_size);
+
+	shm_toc_estimate_keys(&e, 1 + 1);
+	segsize = shm_toc_estimate(&e);
+
+	/* Create the shared memory segment and establish a table of contents. */
+	seg = dsm_create(shm_toc_estimate(&e), 0);
+	toc = shm_toc_create(PG_LOGICAL_APPLY_SHM_MAGIC, dsm_segment_address(seg),
+						 segsize);
+
+	/* Set up the header region. */
+	pst = shm_toc_allocate(toc, sizeof(ParallelState));
+	SpinLockInit(&pst->mutex);
+	pst->attached = false;
+	pst->ready = false;
+	pst->finished = false;
+	pst->database_id = MyDatabaseId;
+	pst->subid = MyLogicalRepWorker->subid;
+	pst->stream_xid = stream_xid;
+	pst->authenticated_user_id = GetAuthenticatedUserId();
+	pst->n = nworkers + 1;
+	// ConditionVariableInit(&pst->cv);
+
+	shm_toc_insert(toc, toc_key++, pst);
+
+	/* Set up one message queue per worker, plus one. */
+	mq = shm_mq_create(shm_toc_allocate(toc, (Size) queue_size),
+						(Size) queue_size);
+	shm_toc_insert(toc, toc_key++, mq);
+	shm_mq_set_sender(mq, MyProc);
+
+	/* Attach the queues. */
+	wstate->mq_handle = shm_mq_attach(mq, seg, wstate->handle);
+
+	/* Return results to caller. */
+	wstate->dsm_seg = seg;
+	wstate->pstate = pst;
+}
+
+/*
+ * Register background workers.
+ */
+static void
+setup_background_worker(WorkerState *wstate)
+{
+	MemoryContext		oldcontext;
+	BackgroundWorker	worker;
+
+	elog(LOG, "setting up apply worker #%u", nworkers + 1);
+
+	/*
+	 * TOCHECK: We need the worker_state object and the background worker handles to
+	 * which it points to be allocated in TopMemoryContext rather than
+	 * ApplyMessageContext; otherwise, they'll be destroyed before the on_dsm_detach
+	 * hooks run.
+	 */
+	oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+	setup_dsm(wstate);
+
+	/*
+	 * Arrange to kill all the workers if we abort before all workers are
+	 * finished hooking themselves up to the dynamic shared memory segment.
+	 *
+	 * If we die after all the workers have finished hooking themselves up to
+	 * the dynamic shared memory segment, we'll mark the two queues to which
+	 * we're directly connected as detached, and the worker(s) connected to
+	 * those queues will exit, marking any other queues to which they are
+	 * connected as detached.  This will cause any as-yet-unaware workers
+	 * connected to those queues to exit in their turn, and so on, until
+	 * everybody exits.
+	 *
+	 * But suppose the workers which are supposed to connect to the queues to
+	 * which we're directly attached exit due to some error before they
+	 * actually attach the queues.  The remaining workers will have no way of
+	 * knowing this.  From their perspective, they're still waiting for those
+	 * workers to start, when in fact they've already died.
+	 */
+	on_dsm_detach(wstate->dsm_seg, cleanup_background_worker,
+				  PointerGetDatum(wstate));
+
+	/* Configure a worker. */
+	MemSet(&worker, 0, sizeof(BackgroundWorker));
+
+	worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
+		BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_ConsistentState;
+	worker.bgw_restart_time = BGW_NEVER_RESTART;
+	worker.bgw_notify_pid = MyProcPid;
+	sprintf(worker.bgw_library_name, "postgres");
+	sprintf(worker.bgw_function_name, "LogicalApplyBgwMain");
+
+	worker.bgw_main_arg = UInt32GetDatum(dsm_segment_handle(wstate->dsm_seg));
+
+	/* Register the workers. */
+	snprintf(worker.bgw_name, BGW_MAXLEN,
+			"logical replication apply worker #%u for subscription %u",
+										nworkers + 1, MySubscription->oid);
+	if (!RegisterDynamicBackgroundWorker(&worker, &wstate->handle))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+					errmsg("could not register background process"),
+					errhint("You may need to increase max_worker_processes.")));
+
+	/* All done. */
+	MemoryContextSwitchTo(oldcontext);
+
+	/* Wait for worker to become ready. */
+	wait_for_worker(wstate);
+
+	/*
+	 * Once we reach this point, all workers are ready.  We no longer need to
+	 * kill them if we die; they'll die on their own as the message queues
+	 * shut down.
+	 */
+	cancel_on_dsm_detach(wstate->dsm_seg, cleanup_background_worker,
+						 PointerGetDatum(wstate));
+
+	nworkers += 1;
+}
+
+static void
+cleanup_background_worker(dsm_segment *seg, Datum arg)
+{
+	WorkerState *wstate = (WorkerState *) DatumGetPointer(arg);
+
+	TerminateBackgroundWorker(wstate->handle);
+}
+
+static void
+wait_for_worker(WorkerState *wstate)
+{
+	bool result = false;
+
+	for (;;)
+	{
+		// ConditionVariable cv;
+		bool ready;
+
+		/* If the worker is ready, we have succeeded. */
+		SpinLockAcquire(&wstate->pstate->mutex);
+		ready = wstate->pstate->ready;
+		// cv = wstate->pstate->cv;
+		SpinLockRelease(&wstate->pstate->mutex);
+		if (ready)
+		{
+			result = true;
+			break;
+		}
+
+		/* If any workers (or the postmaster) have died, we have failed. */
+		if (!check_worker_status(wstate))
+		{
+			result = false;
+			break;
+		}
+
+		/* Wait for the workers to wake us up. */
+		// ConditionVariableSleep(&cv, WAIT_EVENT_LOGICAL_APPLY_WORKER_READY);
+
+		/* Wait to be signalled. */
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
+							WAIT_EVENT_LOGICAL_APPLY_WORKER_READY);
+
+		/* Reset the latch so we don't spin. */
+		ResetLatch(MyLatch);
+
+		/* An interrupt may have occurred while we were waiting. */
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	// ConditionVariableCancelSleep();
+
+	if (!result)
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+				 errmsg("one or more background workers failed to start")));
+}
+
+static bool
+check_worker_status(WorkerState *wstate)
+{
+	BgwHandleStatus status;
+	pid_t			pid;
+
+	status = GetBackgroundWorkerPid(wstate->handle, &pid);
+	if (status == BGWH_STOPPED || status == BGWH_POSTMASTER_DIED)
+		return false;
+
+	/* Otherwise, things still look OK. */
+	return true;
+}
+
+static void
+wait_for_worker_to_finish(WorkerState *wstate)
+{
+	elog(LOG, "waiting for apply worker #%u to finish processing xid %u",
+										wstate->pstate->n, wstate->xid);
+
+	for (;;)
+	{
+		// ConditionVariable cv;
+		bool finished;
+
+		/* If the worker is finished, we have succeeded. */
+		SpinLockAcquire(&wstate->pstate->mutex);
+		finished = wstate->pstate->finished;
+		// cv = wstate->pstate->cv;
+		SpinLockRelease(&wstate->pstate->mutex);
+		if (finished)
+		{
+			break;
+		}
+
+		/* Wait for the workers to wake us up. */
+		// ConditionVariableSleep(&cv, WAIT_EVENT_LOGICAL_APPLY_WORKER_READY);
+
+		/* Wait to be signalled. */
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
+							WAIT_EVENT_LOGICAL_APPLY_WORKER_READY);
+
+		/* Reset the latch so we don't spin. */
+		ResetLatch(MyLatch);
+
+		/* An interrupt may have occurred while we were waiting. */
+		CHECK_FOR_INTERRUPTS();
+	}
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 3a89e23488..7c72db9e83 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -819,6 +819,7 @@ typedef enum
 	WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
 	WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
 	WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
+	WAIT_EVENT_LOGICAL_APPLY_WORKER_READY,
 	WAIT_EVENT_LOGICAL_SYNC_DATA,
 	WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
 	WAIT_EVENT_MQ_INTERNAL,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 802275311d..afb15c2736 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -122,12 +122,10 @@ extern TransactionId logicalrep_read_stream_stop(StringInfo in);
 
 extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn);
-extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+extern void logicalrep_read_stream_commit(StringInfo out,
 					   LogicalRepCommitData *commit_data);
 
 extern void logicalrep_write_stream_abort(StringInfo out,
 							  TransactionId xid, TransactionId subxid);
-extern void logicalrep_read_stream_abort(StringInfo in,
-							 TransactionId *xid, TransactionId *subxid);
 
 #endif							/* LOGICALREP_PROTO_H */
diff --git a/src/include/replication/logicalworker.h b/src/include/replication/logicalworker.h
index e9524aefd9..30ad40247d 100644
--- a/src/include/replication/logicalworker.h
+++ b/src/include/replication/logicalworker.h
@@ -13,6 +13,7 @@
 #define LOGICALWORKER_H
 
 extern void ApplyWorkerMain(Datum main_arg);
+extern void LogicalApplyBgwMain(Datum main_arg);
 
 extern bool IsLogicalWorker(void);
 
-- 
2.17.1

#62Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alexey Kondratov (#61)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Aug 28, 2019 at 08:17:47PM +0300, Alexey Kondratov wrote:

Hi Tomas,

Interesting. Any idea where does the extra overhead in this particular
case come from? It's hard to deduce that from the single flame graph,
when I don't have anything to compare it with (i.e. the flame
graph for
the "normal" case).

I guess that bottleneck is in disk operations. You can check
logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
writes (~26%) take around 35% of CPU time in summary. To compare,
please, see attached flame graph for the following transaction:

INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);

Execution Time: 44519.816 ms
Time: 98333,642 ms (01:38,334)

where disk IO is only ~7-8% in total. So we get very roughly the same
~x4-5 performance drop here. JFYI, I am using a machine with SSD
for tests.

Therefore, probably you may write changes on receiver in bigger chunks,
not each change separately.

Possibly, I/O is certainly a possible culprit, although we should be
using buffered I/O and there certainly are not any fsyncs here. So I'm
not sure why would it be cheaper to do the writes in batches.

BTW does this mean you see the overhead on the apply side? Or are you
running this on a single machine, and it's difficult to decide?

I run this on a single machine, but walsender and worker are
utilizing almost 100% of CPU per each process all the time, and at
apply side I/O syscalls take about 1/3 of CPU time. Though I am
still not sure, but for me this result somehow links performance
drop with problems at receiver side.

Writing in batches was just a hypothesis and to validate it I have
performed test with large txn, but consisting of a smaller number of
wide rows. This test does not exhibit any significant performance
drop, while it was streamed too. So it seems to be valid. Anyway, I
do not have other reasonable ideas beside that right now.

I've checked recently this patch again and tried to elaborate it in
terms of performance. As a result I've implemented a new POC version
of the applier (attached). Almost everything in streaming logic stayed
intact, but apply worker is significantly different.

As I wrote earlier I still claim, that spilling changes on disk at the
applier side adds additional overhead, but it is possible to get rid
of it. In my additional patch I do the following:

1) Maintain a pool of additional background workers (bgworkers), that
are connected with main logical apply worker via shm_mq's. Each worker
is dedicated to the processing of specific streamed transaction.

2) When we receive a streamed change for some transaction, we check
whether there is an existing dedicated bgworker in HTAB (xid ->
bgworker), or there are some in the idle list, or spawn a new one.

3) We pass all changes (between STREAM START/STOP) to that bgworker
via shm_mq_send without intermediate waiting. However, we wait for
bgworker to apply the entire changes chunk at STREAM STOP, since we
don't want transactions reordering.

4) When transaction is commited/aborted worker is being added to the
idle list and is waiting for reassigning message.

5) I have used the same machinery with apply_dispatch in bgworkers,
since most of actions are practically very similar.

Thus, we do not spill anything at the applier side, so transaction
changes are processed by bgworkers as normal backends do. In the same
time, changes processing is strictly serial, which prevents
transactions reordering and possible conflicts/anomalies. Even though
we trade off performance in favor of stability the result is rather
impressive. I have used a similar query for testing as before:

EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
��� SELECT round(random()*10), random(), random()*142
��� FROM generate_series(1, 1000000) s(i);

with 1kk (1000000), 3kk and 5kk rows; logical_work_mem = 64MB and
synchronous_standby_names = 'FIRST 1 (large_sub)'. Table schema is
following:

CREATE TABLE large_test (
��� id serial primary key,
��� num1 bigint,
��� num2 double precision,
��� num3 double precision
);

Here are the results:

-------------------------------------------------------------------
| N | Time on master, sec | Total xact time, sec |���� Ratio����� |
-------------------------------------------------------------------
|����������������������� On commit (master, v13)����������������� |
-------------------------------------------------------------------
| 1kk | 6.5�������������� | 17.6���������������� | x2.74��������� |
-------------------------------------------------------------------
| 3kk | 21��������������� | 55.4���������������� | x2.64��������� |
-------------------------------------------------------------------
| 5kk | 38.3������������� | 91.5���������������� | x2.39��������� |
-------------------------------------------------------------------
|����������������������� Stream + spill�������������������������� |
-------------------------------------------------------------------
| 1kk | 5.9�������������� | 18������������������ | x3������������ |
-------------------------------------------------------------------
| 3kk | 19.5������������� | 52.4���������������� | x2.7���������� |
-------------------------------------------------------------------
| 5kk | 33.3������������� | 86.7���������������� | x2.86��������� |
-------------------------------------------------------------------
|����������������������� Stream + BGW pool����������������������� |
-------------------------------------------------------------------
| 1kk | 6���������������� | 12������������������ | x2������������ |
-------------------------------------------------------------------
| 3kk | 18.5������������� | 30.5���������������� | x1.65��������� |
-------------------------------------------------------------------
| 5kk | 35.6������������� | 53.9���������������� | x1.51��������� |
-------------------------------------------------------------------

It seems that overhead added by synchronous replica is lower by 2-3
times compared with Postgres master and streaming with spilling.
Therefore, the original patch eliminated delay before large
transaction processing start by sender, while this additional patch
speeds up the applier side.

Although the overall speed up is surely measurable, there is a room
for improvements yet:

1) Currently bgworkers are only spawned on demand without some initial
pool and never stopped. Maybe we should create a small pool on
replication start and offload some of idle bgworkers if they exceed
some limit?

2) Probably we can track somehow that incoming change has conflicts
with some of being processed xacts, so we can wait for specific
bgworkers only in that case?

3) Since the communication between main logical apply worker and each
bgworker from the pool is a 'single producer --- single consumer'
problem, then probably it is possible to wait and set/check flags
without locks, but using just atomics.

What do you think about this concept in general? Any concerns and
criticism are welcome!

Hi Alexey,

I'm unable to do any in-depth review of the patch over the next two weeks
or so, but I think the idea of having a pool of apply workers is sound and
can be quite beneficial for some workloads.

I don't think it matters very much whether the workers are started at the
beginning or allocated ad hoc, that's IMO a minor implementation detail.

There's one huge challenge that I however don't see mentioned in your
message or in the patch (after cursory reading) - ensuring the same commit
order, and introducing deadlocks that would not exist in single-process
apply.

Surely, we want to end up with the same commit order as on the upstream,
otherwise we might easily get different data on the subscriber. So when we
pass the large transaction to a separate process, then this process has
to wait for the other processes processing transactions that committed
first. And similarly, other processes have to wait for this process.
Depending on the commit order. I might have missed something, but I don't
see anything like that in your patch.

Essentially, this means there needs to be some sort of wait between those
apply processes, enforcing the commit order.

That however means we can easily introduce deadlocks into workloads where
the serial-apply would not have that issue - imagine multiple large
transactions, touching the same set of rows. We may ship them to different
bgworkers, and those processes may deadlock.

Of course, the deadlock detector will come around (assuming the wait is
done in a way visible to the detector) and will abort one of the
processes. But we don't know it'll abort the right one - it may easily
abort the apply process that needs to comit first, and eveyone else is
waitiing for it. Which stalls the apply forever.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#63Alexey Kondratov
a.kondratov@postgrespro.ru
In reply to: Tomas Vondra (#62)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 28.08.2019 22:06, Tomas Vondra wrote:

Interesting. Any idea where does the extra overhead in this
particular
case come from? It's hard to deduce that from the single flame
graph,
when I don't have anything to compare it with (i.e. the flame
graph for
the "normal" case).

I guess that bottleneck is in disk operations. You can check
logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
writes (~26%) take around 35% of CPU time in summary. To compare,
please, see attached flame graph for the following transaction:

INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);

Execution Time: 44519.816 ms
Time: 98333,642 ms (01:38,334)

where disk IO is only ~7-8% in total. So we get very roughly the same
~x4-5 performance drop here. JFYI, I am using a machine with SSD
for tests.

Therefore, probably you may write changes on receiver in bigger
chunks,
not each change separately.

Possibly, I/O is certainly a possible culprit, although we should be
using buffered I/O and there certainly are not any fsyncs here. So I'm
not sure why would it be cheaper to do the writes in batches.

BTW does this mean you see the overhead on the apply side? Or are you
running this on a single machine, and it's difficult to decide?

I run this on a single machine, but walsender and worker are
utilizing almost 100% of CPU per each process all the time, and at
apply side I/O syscalls take about 1/3 of CPU time. Though I am
still not sure, but for me this result somehow links performance
drop with problems at receiver side.

Writing in batches was just a hypothesis and to validate it I have
performed test with large txn, but consisting of a smaller number of
wide rows. This test does not exhibit any significant performance
drop, while it was streamed too. So it seems to be valid. Anyway, I
do not have other reasonable ideas beside that right now.

It seems that overhead added by synchronous replica is lower by 2-3
times compared with Postgres master and streaming with spilling.
Therefore, the original patch eliminated delay before large
transaction processing start by sender, while this additional patch
speeds up the applier side.

Although the overall speed up is surely measurable, there is a room
for improvements yet:

1) Currently bgworkers are only spawned on demand without some
initial pool and never stopped. Maybe we should create a small pool
on replication start and offload some of idle bgworkers if they
exceed some limit?

2) Probably we can track somehow that incoming change has conflicts
with some of being processed xacts, so we can wait for specific
bgworkers only in that case?

3) Since the communication between main logical apply worker and each
bgworker from the pool is a 'single producer --- single consumer'
problem, then probably it is possible to wait and set/check flags
without locks, but using just atomics.

What do you think about this concept in general? Any concerns and
criticism are welcome!

Hi Tomas,

Thank you for a quick response.

I don't think it matters very much whether the workers are started at the
beginning or allocated ad hoc, that's IMO a minor implementation detail.

OK, I had the same vision about this point. Any minor differences here
will be neglectable for a sufficiently large transaction.

There's one huge challenge that I however don't see mentioned in your
message or in the patch (after cursory reading) - ensuring the same
commit
order, and introducing deadlocks that would not exist in single-process
apply.

Probably I haven't explained well this part, sorry for that. In my patch
I don't use workers pool for a concurrent transaction apply, but rather
for a fast context switch between long-lived streamed transactions. In
other words we apply all changes arrived from the sender in a completely
serial manner. Being written step-by-step it looks like:

1) Read STREAM START message and figure out the target worker by xid.

2) Put all changes, which belongs to this xact to the selected worker
one by one via shm_mq_send.

3) Read STREAM STOP message and wait until our worker will apply all
changes in the queue.

4) Process all other chunks of streamed xacts in the same manner.

5) Process all non-streamed xacts immediately in the main apply worker loop.

6) If we read STREAMED COMMIT/ABORT we again wait until selected worker
either commits or aborts.

Thus, it automatically guaranties the same commit order on replica as on
master. Yes, we loose some performance here, since we don't apply
transactions concurrently, but it would bring all those problems you
have described.

However, you helped me to figure out another point I have forgotten.
Although we ensure commit order automatically, the beginning of streamed
xacts may reorder. It happens if some small xacts have been commited on
master since the streamed one started, because we do not start streaming
immediately, but only after logical_work_mem hit. I have performed some
tests with conflicting xacts and it seems that it's not a problem, since
locking mechanism in Postgres guarantees that if there would some
deadlocks, they will happen earlier on master. So if some records hit
the WAL, it is safe to apply the sequentially. Am I wrong?

Anyway, I'm going to double check the safety of this part later.

Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

#64Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alexey Kondratov (#63)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Aug 29, 2019 at 05:37:45PM +0300, Alexey Kondratov wrote:

On 28.08.2019 22:06, Tomas Vondra wrote:

Interesting. Any idea where does the extra overhead in
this particular
case come from? It's hard to deduce that from the single
flame graph,
when I don't have anything to compare it with (i.e. the
flame graph for
the "normal" case).

I guess that bottleneck is in disk operations. You can check
logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
writes (~26%) take around 35% of CPU time in summary. To compare,
please, see attached flame graph for the following transaction:

INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);

Execution Time: 44519.816 ms
Time: 98333,642 ms (01:38,334)

where disk IO is only ~7-8% in total. So we get very roughly the same
~x4-5 performance drop here. JFYI, I am using a machine with
SSD for tests.

Therefore, probably you may write changes on receiver in
bigger chunks,
not each change separately.

Possibly, I/O is certainly a possible culprit, although we should be
using buffered I/O and there certainly are not any fsyncs here. So I'm
not sure why would it be cheaper to do the writes in batches.

BTW does this mean you see the overhead on the apply side? Or are you
running this on a single machine, and it's difficult to decide?

I run this on a single machine, but walsender and worker are
utilizing almost 100% of CPU per each process all the time, and
at apply side I/O syscalls take about 1/3 of CPU time. Though I
am still not sure, but for me this result somehow links
performance drop with problems at receiver side.

Writing in batches was just a hypothesis and to validate it I
have performed test with large txn, but consisting of a smaller
number of wide rows. This test does not exhibit any significant
performance drop, while it was streamed too. So it seems to be
valid. Anyway, I do not have other reasonable ideas beside that
right now.

It seems that overhead added by synchronous replica is lower by
2-3 times compared with Postgres master and streaming with
spilling. Therefore, the original patch eliminated delay before
large transaction processing start by sender, while this
additional patch speeds up the applier side.

Although the overall speed up is surely measurable, there is a
room for improvements yet:

1) Currently bgworkers are only spawned on demand without some
initial pool and never stopped. Maybe we should create a small
pool on replication start and offload some of idle bgworkers if
they exceed some limit?

2) Probably we can track somehow that incoming change has
conflicts with some of being processed xacts, so we can wait for
specific bgworkers only in that case?

3) Since the communication between main logical apply worker and
each bgworker from the pool is a 'single producer --- single
consumer' problem, then probably it is possible to wait and
set/check flags without locks, but using just atomics.

What do you think about this concept in general? Any concerns and
criticism are welcome!

Hi Tomas,

Thank you for a quick response.

I don't think it matters very much whether the workers are started at the
beginning or allocated ad hoc, that's IMO a minor implementation detail.

OK, I had the same vision about this point. Any minor differences here
will be neglectable for a sufficiently large transaction.

There's one huge challenge that I however don't see mentioned in your
message or in the patch (after cursory reading) - ensuring the same
commit
order, and introducing deadlocks that would not exist in single-process
apply.

Probably I haven't explained well this part, sorry for that. In my
patch I don't use workers pool for a concurrent transaction apply, but
rather for a fast context switch between long-lived streamed
transactions. In other words we apply all changes arrived from the
sender in a completely serial manner. Being written step-by-step it
looks like:

1) Read STREAM START message and figure out the target worker by xid.

2) Put all changes, which belongs to this xact to the selected worker
one by one via shm_mq_send.

3) Read STREAM STOP message and wait until our worker will apply all
changes in the queue.

4) Process all other chunks of streamed xacts in the same manner.

5) Process all non-streamed xacts immediately in the main apply worker loop.

6) If we read STREAMED COMMIT/ABORT we again wait until selected
worker either commits or aborts.

Thus, it automatically guaranties the same commit order on replica as
on master. Yes, we loose some performance here, since we don't apply
transactions concurrently, but it would bring all those problems you
have described.

OK, so it's apply in multiple processes, but at any moment only a single
apply process is active.

However, you helped me to figure out another point I have forgotten.
Although we ensure commit order automatically, the beginning of
streamed xacts may reorder. It happens if some small xacts have been
commited on master since the streamed one started, because we do not
start streaming immediately, but only after logical_work_mem hit. I
have performed some tests with conflicting xacts and it seems that
it's not a problem, since locking mechanism in Postgres guarantees
that if there would some deadlocks, they will happen earlier on
master. So if some records hit the WAL, it is safe to apply the
sequentially. Am I wrong?

I think you're right the way you interleave the changes ensures you
can't introduce new deadlocks between transactions in this stream. I don't
think reordering the blocks of streamed trasactions does matter, as long
as the comit order is ensured in this case.

Anyway, I'm going to double check the safety of this part later.

OK.

FWIW my understanding is that the speedup comes mostly from elimination of
the serialization to a file. That however requires savepoints to handle
aborts of subtransactions - I'm pretty sure I'd be trivial to create a
workload where this will be much slower (with many aborts of large
subtransactions).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#65Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Tomas Vondra (#64)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

FWIW my understanding is that the speedup comes mostly from
elimination of
the serialization to a file. That however requires savepoints to handle
aborts of subtransactions - I'm pretty sure I'd be trivial to create a
workload where this will be much slower (with many aborts of large
subtransactions).

I think that instead of defining savepoints it is simpler and more
efficient to use

BeginInternalSubTransaction +
ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction

as it is done in PL/pgSQL (pl_exec.c).
Not sure if it can pr

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#66Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tomas Vondra (#1)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

In the interest of moving things forward, how far are we from making
0001 committable? If I understand correctly, the rest of this patchset
depends on https://commitfest.postgresql.org/24/944/ which seems to be
moving at a glacial pace (or, actually, slower, because glaciers do
move, which cannot be said of that other patch.)

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#67Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alvaro Herrera (#66)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Sep 02, 2019 at 06:06:50PM -0400, Alvaro Herrera wrote:

In the interest of moving things forward, how far are we from making
0001 committable? If I understand correctly, the rest of this patchset
depends on https://commitfest.postgresql.org/24/944/ which seems to be
moving at a glacial pace (or, actually, slower, because glaciers do
move, which cannot be said of that other patch.)

I think 0001 is mostly there. I think there's one bug in this patch
version, but I need to check and I'll post an updated version shortly if
needed.

FWIW maybe we should stop comparing things to glaciers. 50 years from not
people won't know what a glacier is, and it'll be just like the floppy
icon on the save button.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#68Alexey Kondratov
a.kondratov@postgrespro.ru
In reply to: Konstantin Knizhnik (#65)
2 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

FWIW my understanding is that the speedup comes mostly from
elimination of
the serialization to a file. That however requires savepoints to handle
aborts of subtransactions - I'm pretty sure I'd be trivial to create a
workload where this will be much slower (with many aborts of large
subtransactions).

Yes, and it was my main motivation to eliminate that extra serialization
to file. I've experimented a bit with large transactions + savepoints +
aborts and ended up with a following query (the same schema as before
with 600k rows):

BEGIN;
SAVEPOINT s1;
UPDATE large_test SET num1 = num1 + 1, num2 = num2 + 1, num3 = num3 + 1;
SAVEPOINT s2;
UPDATE large_test SET num1 = num1 + 1, num2 = num2 + 1, num3 = num3 + 1;
SAVEPOINT s3;
UPDATE large_test SET num1 = num1 + 1, num2 = num2 + 1, num3 = num3 + 1;
ROLLBACK TO SAVEPOINT s3;
ROLLBACK TO SAVEPOINT s2;
ROLLBACK TO SAVEPOINT s1;
END;

It looks like the worst case scenario, as we do a lot of work and then
abort all subxacts one by one. As expected,it takes much longer (up to
x30) to process using background worker instead of spilling to file.
Surely, it is much easier to truncate a file, than apply all changes +
abort. However, I guess that this kind of load pattern is not the most
typical for real-life applications.

Also this test helped me to find a bug in my current savepoints routine,
so new patch is attached.

On 30.08.2019 18:59, Konstantin Knizhnik wrote:

I think that instead of defining savepoints it is simpler and more
efficient to use

BeginInternalSubTransaction +
ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction

as it is done in PL/pgSQL (pl_exec.c).
Not sure if it can pr

Both BeginInternalSubTransaction and DefineSavepoint use
PushTransaction() internally for a normal subtransaction start. So they
seems to be identical from the performance perspective, which is also
stated in the comment section:

/*
 * BeginInternalSubTransaction
 *        This is the same as DefineSavepoint except it allows
TBLOCK_STARTED,
 *        TBLOCK_IMPLICIT_INPROGRESS, TBLOCK_END, and TBLOCK_PREPARE
states,
 *        and therefore it can safely be used in functions that might
be called
 *        when not inside a BEGIN block or when running deferred
triggers at
 *        COMMIT/PREPARE time.  Also, it automatically does
 *        CommitTransactionCommand/StartTransactionCommand instead of
expecting
 *        the caller to do it.
 */

Please, correct me if I'm wrong.

Anyway, I've performed a profiling of my apply worker (flamegraph is
attached) and it spends the vast amount of time (>90%) applying changes.
So the problem is not in the savepoints their-self, but in the fact that
we first apply all changes and then abort all the work. Not sure, that
it is possible to do something in this case.

Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Attachments:

worker_aborts_perf.zipapplication/zip; name=worker_aborts_perf.zipDownload
PK�0O��
 worker_aborts_perf.svgUT
o�]s�]o�]ux����m���u&�Y��fCZP�^�,'�����,>$c;�upa�13���hr&Y����f@��0]�,�����������������m^V�����<~4;��v��w���G����������~�x��/��o�g/����_�}�����������n������9{���������O��{�n�N�w������f�8<�>�������/��#�^������jY���yu���o��%�>����]����j�*�o�'�foW��ow�_?�g��2�^	�n{�z���1>@��_��W�7C/dI�<�~�O�w_~9�~�m��5~"���
��p�����Jx���U���B�����9^�����"�n�����iu��*��]1[ge~(��|���t�Z]�
�^=�a~S�W��o��~���/�l��?�~��������0�����>M�p��*�����>�u�;n��fw���`|�O��[^����o�n?�}���w����G^�����C_���G��#���O�M�����`��O��qu�w�|V��aM��m�tq8�������]�������f�	7*vo�__���|����j!�:���f�cq����V�2/���
nQ]��(V��{�|����V��_.����7����Y1[�e�Z���xqsY������M�,o�w�W�����b�.���U�7���-w��>��:/��s���w���5/y�d~�*���j��~��n����=�7�j������^|�������`��=`�<�K���C����~���?�����fw<�_V_
����j�1���v��� ��k7�wp%|
\�:-w)~�k�:�l�?�c�>���G�7���������O7Y<�n�wX��)�s������e���������g�V����<[.��;��:��6/?z���/��������;^]�����������1&g����O����������_b�'��?���"�{}�_e�u��2Bc�t_�6��g�q���~r�������.����X�[���)�������k0F��y��?���.h�������W��o����|{]�|�����������e��yY��|����k>+��X���!f����4_>�����n|�����_�������_w_�X�����+V��!{�?n>s4�Kt���-���}Q����������_���y���Of/SB|_>��������?��gE�����N}��=�����c�/�}�"����G����M���%���Y�����r�����v������w?���{�{�����R`��&/�8�S��#~��������g�2p��K�7������q��.s�=�lU�E�
�/#|���������ct��x���:��0<�U��PE�(�C�2��%������!�VQ���}��}�|�Of_���c_��b���"��g�?�����o�������>���/{��}tZk�}����|S[��&[����7����������Jx���}U�1��~3��b�W $�Vv������=��^fh��p�����^����>�������9|��-O�'TN����|������8���������St���O���5��-��_~�9��w��/8\�k�x����5?/����`������I�(����T�k���v��J���E�\�o���{�p���s�A��?�W����eO>�������h�������9�7;y���X�;j�����n��hV��������^�Y����������������*���[���Of�����K���}/��)����{T��p
�������]Vh�##t��}C����Y3���
�e�)�Xm2\c���d�[F�[p�i������V�B>��>4�'���������i)?f�������_��5_��������V�qo�l;{�I�
�����_��������v����v���m���{�������=�{p�����=�|I����yZ��9�?������iW���ZE�����m����������p�o����vV�u������8f����-�~"��_T/��
��p��G�}�X����a��v��N����v��W����nZU-��zx����^#�������������W]����W=\>�]<?��h
���XfW�]����N����e�/|Z��V�����!9������=G�6� �����:��w�-F@������E���j���}]-wL��w�x������{�n�X���6=��?���|}`�����5o������>�vW���c�w�T&�	lw�I������nw����Pp���o������Ww�-v�
�R�M��j�A���b�5�{��H�{~v/Z�{t=��9)F���S�}����G�C'CS*�#a������Mm����GV��n$�v^�o�u�yU��+��&����?�����r�{�&Xg���q���]�D��V���'��{e�6���r�\�.9<k�?z��b8i����}\�������������w�������L�O0�m��*��B��������T��'�����}�v���rk�*��e�|���
N�����1��T8Y������������T��Wo���������~E�/�������wX]�;��u�R�_q�O��������b�mL
����M���}. 1^m�B���SS�jW9���Nu��E�u��Q��6�<�]��wE���Kt���i�`]��jUv���Q�\e�C�l������u����?���.Z[���Nz�����
���q>�i�������0�X���d�b����O�Lz���I�Z#���-���:��]�\������~��:n�[X�7�6�Or/<�dE����s��{��~�fF�w����|�E���I���,`?m.�v���y������7���{4�g�����+�7���G��X��BW`�:zj~=t�&f�[���C� ����q�V��M��o�g�K<�g����E�^`�8���b�ZsL�{�]��K|\���
^Y}�u�o���6�;��p���T�d������M~�G�7E��=>��7��v���'��������bY�=~��^�x���x�������]QVgJ�Q�]���t�����T������F��m��B������Vx��m�E��'�y��`9�����_��.	9���{
�h������k��_��r����o^�ys4f[8)��}��v�2/���X2.��
���V�.�<�p��R��wU��m��z�e�w|	�~��R/�����L/w?��OR5�����2�`Wt������_O���@�/O^�c�zq}�,�/gM������K��v���x{���vW���T�(�])�&�|2�p�n_���_5��i{�~���}�#�E���������K����������m�|��|���X��n-�e��=����}?�����l^���%��:���������~!a�����������"��@w�������8�Wt4_?:�����#y�h����T����u�����j�\����W��K�8�/�]�2�_���j
��c^,�m���N]X���m����4�y�7�quK��1�Y�������q
pk�7��1��lr�Y���<~��-���v_����_��.�����?���'N~�gn�(�� 7��a�q��~u,���~x����2����~��'S>���	������Y�|���{���},z��T�@R�X�mY��<���������1X�"t-����T�mw4c� �>������\��4S�-]E�LG�D?a��w��wwmR����NJ��u����O��������y��=�=&C6�l��&�8��h������31I���se�;Yq�&��q�f�{����:t��r��
c4�3�L(�������l�!L�b�g�uV@��a�D�y�L��{ZD���i�5)�F�w�g� ��?������|���Jl���kOf������J��O�Vg���{*����w2gF������:)����
��v4o�����<�<b��v��������M
��MS�3������	[����N1�{p��8�4��^�u����?:�������:� �y�z{|����@�3�����L"�#R��6������K_����o~�!���]A����
'��Hv�Dld����H>06������:��>{l{+s�����X}0�����^�`BZld�\4w����MX��o�2����������x��8���I�q�Q�\�P<�4z���&����*���!��6����.�:c� 	n$�"�(�u�mYU�{�|~�������W1O�k|cz��LIE1�4z��!����X�W�2���}xY�1�����B��I�HqZ�)���P�7��n�*/���!u���ny�W��R�M*bp���$�H���|�}��W����l��~������������r�E��h�f�uJ9�%��o�WW��n�|���vzN�Y��oUI�(a�0���������������K������=��:<� �$e��B8�|S���7��������9n�TC-����K&�|x���E�flk���rL��y"Bq;M����p�-n���,�� �_=R�H�2�����T���u^v5p�3vO������OJb(N�\�Py���^uJ+L��E�te9o{r85uD���Y���0f���"��*��5��@�]�.�����=1���hY���
g�WT>���]P��������3�4=��,Vp�fu��/u.�C5q���`��W�v����wAj��������H6:�"��!>����W�l���%��Ac�pdu�H+R�q)��|�az]��S���E�����V��G�e����tLJ���;��~w�����2=h��u���w�����'���.YB.jj3�I8A��|\wV'���v����w�If&{\�:6M�%�l�8� �iy+�8s+�E����@p)n���L0�v�����v�,������\�������)xx�2�t��;
�	��6"����-L"���
�t�W�xn,����T���4����{���w�u^^deU<z�_yu�c��[�D�S��GI���@�w��:�������U��+�f��ps���%6R$������P���\���a�n�5���j�-���p�A���h�<Q3�����iuw1O�������6#�l��3>bQ��
+2���N����\���"=
��[���4=��S�DRR����7h�H|Q�8��cMk����Y�����7�^��p�=�����	%�EI���L�T$�-sH�U(����2���o����6�\c�>�.�l���K[��[�q$I�M��\�����Cz�����W�"_�}��BO�B{�W���&�s.B	4���Ec�����\]F��	�"7�r���{`��8�]TQ���t��i��G�c+UV-
Z�Ck=�&�T#�i�t�Z�X��
���c��&b&��9�d&�]?���K�=���t�Ih$�"��tSCD�D(��4�K��t�I+fx��3N�
&:,b� �@$�snC9@O�
H0{�R2X���Mv�L6�B�{�P{��y�G�������q�T�{�W�F���T��I���s�B��J/kd{Qquy�i+��4��E������@����*V;<��B��{$](��I<�K�[[����a����{���m*p�z�������bZw�Dr���zw]�{k2M��� �,sp����5��W�y����v������:��n>�����.��i��A2�%DB��^i�u!�����!p&��z��e�P��o���=Nz��l(4g�q�}$��� �"A"����"����e
}�+^����) KN�t��F���:q�X|�b����b!�"=r��=�������^�N�������e�1�"��
��^0Hy�.��
SN7+�]�L�w. .R�y��W�`i6��2J���3�#ai�@]Rs�Q����?g�2q�NJ8�3
���A������_
j���b�"���`z�e��0�$�6"���2�������v����e��������8�R�9@;��XP�x��h��}P2��a��n�[AG�8!�ak�1���5(�7��$�r����g���:�q��!���A
�i����x[�>�r��;�K/zs��-q(�q5����O$F����a 9d�^h��p��SuJ$���]�2R	-�D���}tYB�#��W�<%��U��b#����
%�L���"I�h���1	~�agb"�0$��
�w�����:�iHuD��f(��8!� 5LFgz�7����$�#Y������R�P�~���K��F��%�Y�9	x��.�����Z���S2c��s��53�#��A���e(Y��:��R��b8sU�>�b,���51���M�`0
I�f,�t��YZ�q{M�9C�)1������u���-����6w�������|�$7� 6������/�V9j�z�=&��Ra}�"b�{������������������`�b6�`U������������J�x��Q��d	���!���)��J���m1V���U�^Rti��	P���~w(h�W��H��I�UL�1���!�,���pw��t;V���#
*�'�fg_QC�$�����8a]T2#?�*w|WS�$��i�nt;
��^'�$�*/���N�I���f���bv�<b,Z��������Z��X2��'����P-�)*u�B�>�r����.�o#YUL-��y��\��������E�^���z��=cm��u����B%D���D2��p��P��=IGJ����V�	�����$mdT��	K�P������o.�p�&�!�z	a�C,�-b2�%�%DI�@1(�[;J�� ����h]pb�C��y�f!�?#���p����{P�2�mu�W;�#��I��f��Ts������K�?�3�����h�O�5� 	%$�J-;\���]-3/�jX5.P��w�K�w�2?�� To\�9�����|Q��f%��]T��>�L�����8��'q���d5NB�<p}����z�v/�j�aJ��=����8��D(����/o�;�f����a�B�q\�L��0s����X��1 �c�bL�T=��	*,&�u��n�x� ������
�FK���u`4j�8��6����H]]3fxZw��l��'�Q�����pMJf~JzH�s������q�.���
xZ�M�=�o7��:'��H���{������8�AL���>
��"����d�D�R5���.���5��V����HW02��V���)��'���j�6+V�����R	�x�G�vH��&�j�D�
RI&N;6ew��������Dv�]��0�|�P8B�^Ys��B8�Q_z�?�
�<#�

	_S0������K�m�Q�Y��n��$��zg2w�G �
�����V*�Q���?4�h��STg&�<��@�Z��$������`�F3��������A��2M6]\0���CUm�j��
E<�B�`���:&����c^���lUz�X+$e$� �M2ZB	72N���?���j����/U�X
�S�V�`3kj�����-��@0$80�
q���F\|�(�F���;���{u����3V���/����J0���>N���N2����QDd�y���D;)2O/6�4{{��
���.����A��r�1�h�O8'Q���.W����������.����`KM�Z���&H���/{zC�G�8&�:kO��������>�p�(���p$���LOy��}Vj2�o9�CU�4��|�������y�t�W�b�!@l�}�!�^��Z�8!���C�����������XTGZK�'Om(��g���b�=�E��:<�K�Xw��]p���Q�
�E�s������^�G}�+^�����@	)��������HoN�:�)����h���#?����}�j�3��H�"�Pkq��y0���z��v��8/Qt��'dqb	���O2��2�����^���DR�����S�'�d�l�8�L(�3�|�m��yyy����F|�!X2 gXo��JM��M���/.L/?�HjUAM���G����m�L�bs��4�Y�F0����P�2I���H�#�����A�M���N�^Om���t��S�QC!EY0d���]z���Y;
������q���N���==���u^��� 1r��B6�^H�F�I��V�8DL>O�C^���'��L�i��z��Ue5xA�����QB
 ��� �Uf����lI�c�N0xwAS��S:	F_w�[�JYYB�)z���hs�����M��a��h��N	F}�a���mD+��-�������j �l��kq�����p�DH�r����d��WsL���I�����s�p�"f��8�[f5������5��$a3���)*�7Zq��l����~K{T9b.���w������$M�'�p8a:K�8V����q���4'LF��\��s�I%�r�t��w�����Q���F�5��
��aJ1Z!F�c�$��1
����;�)���1YTW7��3�.q���2��wq�j���g��PH"F����/������{,P�0�M9+�MZ��� C���%"=F~����O�#;i��D;w@���q&�[hm]�(���p�(*N�0�.@J~=�(�sW]'a���8�T��/�_����tdbV	g&
� �W<�3s���J�����{���&����A�l��P���ob�L]���,(�jG`e��>���% @;x%�I�c/��P��5��l�4bp�r���N��4z.t(�1�!;=����WK��I���0%��(�h��\H�\	�O�������p�����
I�O���j>���b��d��zk�A_g�^w��qc�"C,k-�sF�j{���M^��}k��Zs��&��'�@���Z�|\U��L�:
M�%9�����4
b�����hl<.����|	��e9�`.LZ�k�{k��(_#b��J����F�>b/������k���
��81����	D]
*&�VPZ!���~sL�-���z=_�8�����<��I�Wv�^'4��4ihbGIsz�?����)���e� X?$�p�>�^��]K�������
�L����v8]K��Yd�h]��H� G��'?'��M-6P^�`@-m\M&`�#��@�c3��GK��������B������7u����(��DB�vQQ7��	w�"��6�U�H~���fre�'��V��;�2��c��e$I��#�;����]�����c�zt-J�kI".	;t0��AY�<\�A
�	J��'���?�Cq9�20D��+O���P��eFgP��0�*.��Z���{r-O=��=*' #m#A���I�4:O�4/��(��|�m_����{�~b�����������)�h�
,9�����u��aW�^&�`�f��c�Q���j�=:�-o�9{�������pZ?X�G�-�La(d�x��Fm3���%�J�����(���o������&}��B:���r�XL
G����p�+����op�����I��e���T���(f6�*�C�
k6:�3�8�"h�J�Wp�^;EsR{c\������) i�7cy':�0��`|��OJ�U�����_����s�4BM8�������p��}��	&=*m��%'��8�G\�6�4��K~{\��Z���j��!r���*����"��p�PD����j�a�Q���GG��4��t�������x�����W�k���7�	K�.�d&'�����>���AE����{hQ�������O�)�#UC�s���:J,m�D*���=��t�Q�T���g���rD3�*�T�D=��X�"���D
a���QRLYy"��_��>�<��q�r������MGp�t&�Aw?�.�|�o��9��v�7��>,n������
�x<&���XB��������i��e�R.�oK�*��NM(}��.y
���9�Gj`� j$d��63��&u�^���z(��+�jj�����`h�?l��A/>�G�*M�B�����Wd�
D�E/b�D��&�
��0Dv��D�������TB�"c
����)L�]�9����o�~�c�2�cj�^��HE�4
&d�\�Wp?���G
U"�@erB@i�����O�I{^#������q��*Ohs�P���Q�k<iki�K��e����������C����� �hv��*��%v`w��5���������v:9��2�*�>��d:T�$
��f������#5��f�m/*�%I��:y�sT�,�f,�����vr?������M��{��C�5�Fg�e%BC��4�1	@dGL�VV/+��_����SI����V�KB97eQ����_{Q��XYrK� N'�KAf%���M�}������%���A�U�P][�*��k�p�I7?�'_���yzyW�~����Z��H"/#'��Nz������A����3�����C��9���K�st����a�o���ZQm�?�#�E��uD����(+��l{�eBz��R��D���#��*�Evb�n����bS�%Y���`�B��=��"L��$DA)�V���������%q�,!r]2I	�0Y�Ci���X����?��+j��b�:����e==������	/�T�8D�M�Z����m�p�*�5�J����xhl ��Dj�t@(~,S���^z�1�x�A�=A|�����e����`�����>X@���Z<BI���TPC'f�WR���nwX��\��j9X7g�����fbS3yHK��3!�����2-���{[��Hq�x����D���r�4��b�K����iO���o�_��1��m��t0�Bi�E$�W'�hJ�������/"�9*�3M��i�����8}��������X�hW&���M����%3���L����|��>���^-�Bn{�l�[����Y�g3
���-s�����<�t����#Ne4}L�PiZ�|���?t�����X5e��3�9��f����3���#aC��8���@:z��A��b"RW�*��B�4^?������j�X2^�Ip"�Z��=�r�!��:���:p��)�*0�t�M��G��J��G�-*����V��e�P��.G�j�?�iy��0�7]���FZ��=Z��Vb�h	�Ts�X��~�����ko��I�������J���"#^9lr����xUf��)��'��
�y�M��������AZz+�'�+���	.+����`T������
�*��;����#�s�5nf�(����8��pL��I��/�{������]"��>���o�������G�;���hq�z�����$��D1W3�8T/J���X����2q���$�?�P��3Q(K��b�'�h3�q$\7)�	��F�im�X8���[bI$������~r�$�`%����h�X����P=,�M�`�C����jWl\�b�}.�j��&Vl�C��D�K�+��CGl::�Y�G.���{h�DZ9�Ub�������bM:��Y�9��	��j��!��q(��n��fN�=�U<�r%Z���4�i��d����e�J��au��9�	��mL��	�`x�0�~�/���"`������A�y��\��t���G��o�!�t�}��l[z	��mGI>a����&t�
 %h�x��;��m�����Y�(\Hh*hA�����-��;Boh�d�$b~%\D�`����O�#�s�
9?
q�u@c���l{������X�����������/:�����X#
�>&\�!�'Z:��`\F��QB�nq���P�iz	��<6������C�A0������z���&��a<�P���G_�e��������e~��R]�����e�H������X��������YY~��
�������1D!8�\�3��;������X�@|�I���n��Q��"��/dpT[���M;���}���|��m{��S��?��"���o�Q��7��1& [3z��s�?mD�b��Y�_���	��V�&�{^Q���
���w�yn���\�����+�����������#�q���DX��P��/v������'�iU�T��}��c�8i{Xo���Niv�=�����/�yz(�<���/O���t��Z{����8#A��	�js�vm$5q�����T�W9��P�U(@Ck(�dn���x�g������%�S>^'D��n�����r��]{�3U��5�jE2	��	�=%��&M��v0Ma�~Q����&F��=1.����6%�*u���M�,@��s����n��N�&!";���h�������;@�2K.��
��$��t8 y0Q�Gp+8�{&��`�$-U��&��P�V��
�U�k�
�%y�C,d��f���Q�������F�2�n"����<R"��Z��lnT(�����|�K�������w�U�|�R3MLd<��%iw�o�������\m�"]m��(�{/��r\��.��0i�Me�K!CpB?W�P��u���EC�f.&���Dug����`���p,��n7�-����&;���w���r(��I������C�C8m��r��E|�b�I�<�S��C�-������+b�SkX���r:O_����o~�!�����f�Tc������&��F���M���jV
���K��}��g(��LWd�4vau��y�v?��H3����i��w��?M7��O�����p`���fu���Ya�~Z-���I���w4]$fTD1�q(	�)e���]A�E����&(�+D|�(<45	�f��!~��$e�E~��`������*���X1��f������J����dHnh���n	��i���O�����
X��*��o�@�n��kO��.��E�W�����6���	B��	#m�]����&�G6�[��[�ch9ld:����L�y�M!��`45�)����-�P� �Q�i��X�:����"�#E�;M������{��< �N����Q
n ��vt�5�R�������0s��p�/.�4�bnfuH���t����)�g���D��!r��!*%q�nq-���=#������U@:dq����Uc9�i�%��v{��;�i7Q�6�e�$�t�j�1�L'�a�)�4���*�I;"��$��J����<n�jp�����"�$���q�@����1�a�{��h�0�.��4bM�c���B<y�H�U�HK��������c��H������ {h���`�s��A��ikwD0H"�5�u���E������;��1�kQ����2(����}���Y�m|b 0�vJ`��q��=��2D�I&��Q
���m��b���kI���z���U�H#5d1$�����>��%��&�kS�o��)R�#�v������h:[L�<�x�������V���a1��u��b��j�ks�����!�FC�����v�y/����#���v�B���Q� ���	1��E�bu��P��*S	m`:�5{�	��;Vd�!����m�^Ym<y�<b�nQ�"��������d�q��c��F�,���0K|�]vgPm����'�����V�������G��,�t�I�|d���'A;���+��~]
a�?�K�m��{��j9���:B�rH�*w��~�	�����'�5�������n�^���O@7���Q(�9q��t;����C[�����t�!$��$�$�ykQ"�{`��\�����x�RH;;�NV���P�C^�1�o9]g��o+�PNW���T��}q��]"�I���������������[8���ud2�c�^-��|���l�bW
Oz�������g�b�����6�KO�����riO��*E��:a����
E��o���NK{�P��feU�6=�Di&,��f�n~�c3_���f��=����[��\�.W�F�Z����e���xo<��V��T�T�KI�k�������}&�������eizj*�k�PF�>���`�����h��� ��y#;F-w)�����^�pN&��89�i���`�H9���V��������M��1I�h��P�v���'oH_�������_��dF�k�dF���a���6&z/������kSz�c3������K'd;�N�|�+^�����@�����
�Q>��Q�ubR}3�D��������������[��+��yCYX��Hu�%�h�@qs\s2���H�<	�����?Kc�v��qX��4�:�p�v����o���D�^5�b��>�|y��a<�Qf@?dR3�wE*z�#}�����m�GN����yu�M0����]w�w�A����������"w������=���EV:*������[����q��DVR�Q;�\h�����v���2�q<Pu�Q�G#i,�7�/�8�D���*{�g��A�8��aK���F ���&���Pb�a�81�'#��;���P����['��9>=5#mc�.dx���XW�&is+���@������U����x��XKZ�-�����C��naF�~��i=����:�3J�����aw����,u�T09�6����5�KB���e�������O��A��0��B6�R
�J+��z�U�aPv��q�����A\M�pA��D�}�~�4��`�k6W�	�������[��=`	�(!���#Lb#��@W����8�T�VcKR�t/�u��Q^�C�s8B�;&� �	|���B=��A����'���a+
��6�,�N&�m�{c�&�c2K$'�	
�#�j���Z��.n��u���lG��w]�D$i���y�BI1��2]f!~�xC7Mdc�VwG�\�g_����e��U�;���v�'5;��\���J(���G�>����N������U���>���S�������,��Zs58(��c��:� ��,��p�
�;��|y\�EZ�^\'HB��w�q���2H���rw�9����)^�j�{�G����f�(��q��$�)	���Ku�=�kk�J*�)�i�8kH B�}�)������ 2v�����p-t��Q������010=�'4��4M�/��DS"�A���7�l�8�Q�*�Vh�ZM�2���7�m�}�m+c���,���)*$&�����f������z���E���E������%������)q���9��9�q�01�4I��z�	������,V�k�c2T���mE�J����������_x�}Q���JK�Gt���5�� DV`8R�����'���V~S�����tV��b���oD<E��.
�V�><�4�i�����8������UH-mT��%��3j=�FH&��:���<����C�{�(���F$)��=�|��M:U�x/�����X��'f���Hib�
n$�.M����4���&��z&��Q�KG�������4�����2?����*d>0�31���*������W|~�$��?�G	uZ���{��|e�	)[��^1x�Uv4D�$�Z���K�����	��z�s*=J���>.��������\!��&��r37&�[���E�x/4��]��&
5�h7Ht� ���!~e������c��*�7�%�b��63�Z��x�4u�������1;���q`!�BZN����'����4����~�"KI���(:�,�4n�<��F�A��"C�S�A����NP����j�(���@�
����aooV����ru�����q���Q�sMCv����U���
�NA�#�$�[�]���J,�������F��<��Rr	Gf88#,{�4���'���Eh-�Z/a�\�5�3�>,���TU�jk�'d��p�V�9OB1d>��7��%,���$(���2�J'���A��Mr����,�r�z�1����-�|z\��z(W&�]^V �W�l�����=�}�aE&���alnY(^�^zs�k7M�v�m����1#��(�����oyQ��H{����z�X��C�8�*��jl�y��>��E#C)����|��E������~nhn�w��8p��c�2������\��y�@Z������C;a��C
���~�lG�9iQ�8qt��wN���E��M�bz�OK>�����8�"#"F����w��*��k�H/:B��[��4\��u�J(��y��<&�����x���m���h������� |2(�R29�HC��b�m}�4(K
>�s0>�R����"�j�p���_m��t�����[TZo&�X�9�q&&��*�7�nYp^:�t���^
^�!��D����;`1�����|����]PCM���!!���"	�jT�w�td�d���~�DT�V��l����u�n6�Uv\�^}
�xb������k"dKB
�����{(w�|1}P�X2����hi�"PaO;��`�b�%�
(����=AbBj+�q�`��A�v����QI�������YH�Iuv�������~}�.W�}V�A�Y������	��x��c�G��$D�c�j+}$��I�u�<��2�o��3�X�H��I0�����H,��	�A����S�	�!=�?���hG[&bo�<���T�h���	v���1�a�{s�{q����U��1
&��&��8�y0�����'/v��W�[b=��A���MJv����t���T5�$�N9��h�<�._�&��/�������[���-��_|���	���L@��� -�C^����v�����tZd,@��ERGV3��ybB��6�w�q��z��9��+�d'h�=���PR�� �><�^����2�(z�\��qgH� C)�x-Q����/�MX7���d[����K��)�RM���k�p����e�p�(��[������[�Ti���h�,	�&��qN�o�#���! ��Mbx��
d��0�"�#E,��������>d��$LZ:�a?��K��+!��
fb��+�����`=��ba���(�~�fp����v�5�aJ�?�:�7!j ,����\����1��2�3o�)�=I���� E��0oU��7�����v�i�V���my��!������2+�XY�A��Q'��1
uL:��<����z����u�gO�j��U%�4H7K�-]��q:�Qz��6Z�0�h,1�T��k`������Q�Qq*�=�5	={���yQlw)���N�O{?�c������IT����l����Q�����!�����f�1�� ��`eZ<�rm��_�msf=ao����jtiV_q6�<�}��J@'��?�}��e�����	�����+���c�������J$�&����'���8��2_���������+8�a�:���	��G�(�*�1���/�f{�JtA��bSq�{�!��L�}!	B��OG
�$����v�!�j5i
�.���WfA�������0Vs�R�D%�G��;���1?H&�>�V�x���$5g���]����=f-��H����LF���m���s��W�Ve��!$�w4����T��XR���+DQ�GXi���G��g,�HRe���8�BpUK����+i4bP�b�9OV�A�H�2��.?�M�����>�]	���y'Z;p2&�x:*��0�8�Z������b������X���sF�����:����k=�Q!:-W"���������t�,��������
�q����e%���z���J���{T)��4�4V{��Y0T^�@���;2�~e}�E�2~U��M/��:�L.w��n��=�\��/D�<�	Q��	��jp���K7�|�o��w?/��kaJ7��-�V���(�87�@6������g���{�]��l
��Uy�|{��R��q"��3����#2��Q�����q�mH�j�?��]����]�$Q^mY.@�HR����P/|��A��\��5`���$���&n^2�)M7�f�����[����g��`�[+���$���
fj�n�������W'���%�`�D��M;���4;�}������<(�������kD	s�����?-��DOg��^��1��U�p��*�����`z����
'�h�y�����������e����BEpd���H���	�.���!?�!��o�9)"�"N������?���EvRJ�	
gP=	�e�Ie9K6O��A��]z���5�4�f'����8�E��$�����"9����/�<�����fkT����<5%��yA����SO2�m�"�?�����}���b��V��������4Q)�k�f��L7�J'��@�/����:P������[gy���{l�3��hA(�g��������*����M�X�'z��dn�Px
p���FZ��g���������4TRB^q���V8���-�7#��pIT
�C�?'�����dp7?fHT&�@��%������5�7�
�����%7�2���NI�|dX����d���y��;�E�[���r�2�z�����?bH�Ro�Y�f���������
7��m���)��X��p�e~U�lU�!?���������h^�l���N�-�����t��3?I��R7X�����8��cp�
����Z���������QE:)�&�&E+(��l������^^�������U���8\!�jgy�I6���*�M��`�chBeJHQ����l(E�G��y�/v�U�;��e���{@�Q�,�����D��>|���l���v�st��
|5��$f���	������_��b�B&H�Is4	���L�U �����e�����8�Hxh,��KS	`<����R��]��������DO�g<��m��B�F��/q�>��}Q�w���9��^�!������k8
'N;sy�&=�J[gP�a���ngR���hA#B���;i�cN���o�^m��Pf��`����E�����&����V�>��Y��G�H�S�V�����$	��]��?_�O�-���^�9��W�h�@g|B�����4�Ua�<����?RMJ=����H��XD�-%�'#���:�\�m�Xo�Wdy�iIm�uhgoQ��Lwo�:�I��=�'UC�V�T��&���`����2��B�j����<��)����CP���%��
���2CT���y�kq|P���#� ��hx��Q>�8E���<�����������zWzU��������i"vC����*n��&���g,�j�./����t��f,���7�������c�v��[8E�z�������	7���v����)���8��C�H�Z1P{�i��0D����!���=Z(���B�yc]u5pN�#C�`��
w�f��_���W�dY��7'��#5OLh���~z����w�u^�����$�A����	�xD���[9��wQ	
}��tp��K���P*I����X(��]|�"��	{U%14��t2@�!�d��`��(� C���2�G@�����h$W�lR��������_���,�����6�n0���x*�sM��3�;A���Hh<RX��M��}���r}@���4�OW��xtv�&�����7�2��K�����p�<���IMi&Q�X3�7��H�<d��,����b�'�8v���:�'�-�M�����vwo{s=�
%j�)��q��n�:�,i�8�������n�uT�\���`GZt	ij0��~���W���C�e�]��Io�]�BNw�p����i3>�UDnj��9��������u��lbj	Z����*�	��-�Bu���5�1D�2�[����+r<����:�O���U�\r���e:b�N�A�)�2H����t��W������\�J�����Q�3rdG����|�@�m��D�v��:����f�k���s�vqE`��1�����\C��H��F��=����AS��^I
�������Z>���eWZe�����{���{oh�G:&��[?�>t�����\�p�����2��4J9*�~�gT`�D	UQ)3g��_C�*��!y�%����+7�#~*����5k��H���Q5�(�����qx��98]>jV!�&�� 1����kC��,�>�\s�����e3TI�������#0"l���y�����<7���6)�hUC�H���&�G1��� Q~n��1*�b"�4J��a��O�������C�������9������:q�6�=�����{�U�0g-R��2��;G����"�EHbP�W��*9>5�"#�|�����i�^].>P8�_[��i%�Q8���X����=	���{CU�%�W"a��2��f�=*���i�	�&f-*�Z��bJ���!�'�����,%T�'�� �z����WA���4���fb9��3Vp�������qE�\6���t�9��k�iUq���w�Ch,�"h}�)9�p���&�pF*T� N��0I[��o�i��]���*Af5��)�Ql�)j�D;j?���z��x��T�&F�x����	�/8��
l�-K�� ��D�%�%
������5qq��b��b�H��+_�&��X�HD<��p����j���w3��
I�J���B�����\a��|	�O�%��?�5���jt��>�b�������B[��Rh��u��n�5my��zH����V�T$�nQ�7�A;�[��!
�����d���tT���B$9�<P��iS��D'd>aI1�V0b1����.���K���]p��C�q���)>�y	a�
ib�������8lQ��T]/>V*��Uq*+g���	 �
�{Hn�]�^xU{�y��y��}�Ig9&�F��#�UV�W���/��7��j���!�Ea���
Fi�@��izS��M^��q��|����,��C���b$�&!�.��-���Whop�����}�:��0Vf�J���3��8	%��DgW�9tD���1��2[��Z���hj��Q���6J�m�,2g���t������������V/���M��(2R;j6��4��JS��?�����{��"���u��>��E���=_�D�*��z��:�fZK�����	��\T8�4�%	����s?����,5�����F� O�q�*b2&��&F9E��E.����g6��[��>Nk�td"��D�1��<����B�O�i\��sVI
�%l��H�r���O��e���s6��h,�8��C={��_����w������aqHblY�6Kb�C������e�����m���a����p��|:���U#����������	�^~��"������4�";
56��7b��v�D53tb��d�2z�5..���a*���QR�G���D���G�[���F�:��v���.�OT�����I�"#"F*r"��
f����=�#YL���MUj'�C���$9���I	{��]��������R�x DWftdJ�(��G%� ���@�?�'�B#YO<��2�sW�@4\L����'���I�Ey����6�i<z�H?`&�G=9d��C��u�&o0c'�s8^�Wy��!�Dl� ����i$iR��0u goU�ex�P����5�+;l��H�������5���cV4"6������>��C|oSrV91�,�A���!6{�z�|��l�N��q�rH-#I+�1!���pj���
v]�^KS�Vh�8,����#���y"B1��|�g���M�do�O�����p��r��[����Nv.t(�le���Z+WK����CF����P^Pj<C�S	q�����y<�������C�Xg{/�D�����G��P/��Vq�*���2EZ�F����vYK#{�<
C��jKSu����kJrtY����c���=nT2��5;���x���ih���M��c������)d!��{�����4%���y�K�*/��b��u��p]��I��U���2A��:���O��;7��9=!�!�|8;=1!d"&E$�(9;g�4�_6����E�o�m�2�t3�.r��_�b����L�X����Ba-�[�M`�UM�N�����$!�4P�)���`9��X4�$��~q`�`A������=�Fb�������:O���?���=l�+RP�N�`��TVv�D}�h�2�tB4#��i"���Y{��;�L��1�`���T5b\;b�����l�yy,���U�����IlN�s���"����@��������jy�gAd`oP�[RR��&��"{Cy"��/�_P��J(�HH�G|��p_I���E>ZJ�m���H�q1%���?
�$�������'C����G<�T�4-[���oU6R ����5��]!�0>d!�P�%��4�m+|������}-v�����
4���p�e<J��A�D+��*|��,�}���j�{9�D����j�^�CI_Qf�t(���k��iJ�7|��i�����x(�������W����&[Mc�	_�[��dC
�c3����;��������=<�P���v�4$��p�Z������	tPB8�%��X�x�>����N}pGsjB~����A�=2q!�����d�q���B�#2� �����������2��Z����W�����ICV�	�h��>�N��������4'��!t+NB���j��n����5�p�#"�38 ����"�Ww]9��q���3��]��Yb�S�����������GDt�'��HaF#|�F�����C�	�g�eA��|o2(.9�%4D�)�\�.q���2��
�{�qXZ*F�Vj��s��"#p�p������{�h�������1>.=�bpB��|C^q�4���������;�^#��5P��L���D�dx��Fz{����.�U�P�K����"cKTd�J�y0Df��Zw�#5��p~%�����[���st!�y�Ci�Y�u�)��IK����*C�-�L�.c��0L��E������
�A$�e�%�^�lF��
���c�5{�4�A��vt;��d���f�b��Jc�S��K�,�MP��b���8wCU�9�b�n��y)z����i`v��v\Du���e�n��@���j
���Q1���\�����<�/��DH0�f.�_�X������a��V(��L!J2�t���<�J	T�0X�0�7��8��"���!�cn��������?��Ub���@�?�)�� �8���w!Xe��:d��+�,m0O����9��?52��)i�x@����p�z����b���
O<�:�]@~N����cv
 F:��~�g-�{��ddU.�Cn�����Yh���x��{��d<<
3c
~��2G%N�!��������{�5��q�����A^���00����a�8�Q�hR����i�M�\�d6�����Z-��d�%�Q�G�8��:�Qf�����Ax�D��$m���-��t�����U"(3'��U���[%���58 ����:�J��
$t5��Q�U:�>?��)��Wvc�����G��!K�lp�����&���v���c7F��)'*(`�#F��F����#�����wUC�'��g ��x+B��K3�����r�MQ<Xe$cQ5�B2�b.U(��i�*4QMTT��((���jv*b�������j9�{���\�I�}	=5��$��Ii���������`s���F<h����8�F:��ok��<�������X��|
���#z���1'S5c��jj�#�UK�w�����4-@����wXuk�
H��F����fw��3�93�p��~���)�8���v8���p���A�'�$�f�CV����l��y���0�7��)5)��'�-�A�Rw�:�\���7������@��X�4����	h�+�qN*����(�Df��i�T���������m�����D���I�Z\��Dg��`D4�a�P����t��c�R�Js�1u��85�_ O�-c��J24�8��48T��N9�Ps���kV��lP�Q��%�'@�������uiy���5����P��wY�y�������!!����Y@����fZ���F�E�� mzWq�����?��]qw�wz�nQ������#A����Q��{��g6���g��ay�x\�+T���^�����VR��
�`�8�DB#T���L_��� L���D� }'d���ieN���9]���[|Z�g����H�n�v�F@<\����=���'L���0����x@�J��u��GS4,�q�����b��� ���7?�;��0]2��M�;�}�d��\0"�(��Ej��f'k-0;4�;���QD�-�P�=J-<���g�e������J��6�T��%��S_&!�Ih�"��x�^�0i�Gtw��|9�p."E���L���Ay�m��<�l�,8#'Uy�}A�wPH��rF~���n�wg�&��BR�$�u"�!��]yf<��!%��Y^D!PC!����R�"�����d���:�oU���K�=f����+�	��H�J��2�a�p�Z�>�V�r��]&����D���R����A
s�x���
�4O�����d�#R�z?��9����
��"+���D0J*������>��Q-�!���B��4D�
�Jr����|���������P2�������G4�<�Qs#C���V�5?�����
�UaNd�c>��U0�6�������Ym�f�i�����>����fB�&gB��
�YK�!��kA1�:f�ST�����
����t2��3:�������!��
G����Q 0���^��s�L��?����C����m��W��t�8}��N)����}	��o&<[_��	��;)� -����u�;����U�.n���O��J�wTIj��S�������B���r7>Z���P�N���y�Z��]=��������������|U���)YA�
(����Z[������k5)�'���������f.��u~��p�\���d�z1�����S`g�d��m��p���w�a����V}�[p�K%��t��~]�^l�X���+Fn.2���4L���`�V&(�
��LT�3��=���
mmV;�e��P%�r���+pJ4����`��i�D�3&�|d���h�l�}���C|��%�L+v@�&E�E����E�Kv������M8��jPK���D�vD�,P$#/�u�]^����K5�*�p�%�*�T����yJP�a>���D
���DSMQm)�h�����k�8��*0N����/��o7d�"��2mp=0�?!�$���iS��7+B)	��=<L,�������0DX#ki0��!MC��/��<��
�'��3!�?	N)�Vm��!l�����(6����(D�o�g���d|�]���{�31tF"�{�6�q, 6���p�	��r�\�A=������-��uuh���1����J�s"��)Kd]���G�t�J�l{����s/�j������a����I��F+J���#�X�(�J�t&csfc���@��
M��L;�$v5v���c��''�b9���S��a�q�O�N.(�: �DY4���j����+�+���EI>H-��E���D?DE�3��6��}t�{W��h���#�,'k��q0P�4��������"3n��r_*2��FS����y���?:���u���[���zu���i^l�:��+zI�������e�2{A������&6�(J�H���� /:���p���6����2���`q+M�o��e���sG��rb.X(��>���m���o:�$C��d1���@����?KB����4= ������h�]���$1��.���H5�m~���S=q�v���n��e�H���v�����J���9
>� #�`\���)#�C1������^�$2|Ru�q�v��F�Y��4D��T�����5�\�+���u��]���o�B�$=���1;C���U��A�]z�;,��LQ�|��4���D���%��K�"�������������Q����|��y\-H�ebZD�������[
7o��O��A����>�_ic�J	~�j���AE0�x}��t�����x����D��d�U(��9E\tf�X<�;�������jI��}�Dx�bn��������x(G���ag�+G"	��9�h��a,A<p���J�MG�	k�7���Q�/�D�������cX�Q�z�&t{2��8�8�5�(��6K��Ugb��������^z�U����hH�$���/�R�y8���uN�
E��@;w���*g�(�,oGN���[�%_g�~
(f0�����t����Ee(	��*jt�X���U+�1�8
��J�R+�8V�d#�Y��f_Q�����`�=���D�1�O�@V��MV�IW��2O�����c�s�����GJF"�p:�<�Ty��p��E�q�!�������8��i"ICgA����}Y��6��_��~`d`_�Z�������z��q���d�r��I������L0�*��b��r{S�����%�M�����tC���4����(�0R��&�K
f�{L����rF��H���s\/����mn���|���Ni2i�\w���)������1�4�UHf'���h�K<��$��d��Fa|�
����.�0�������>��?�T��u����9(>nsa���]�>I�Q��w��t�x����m���h����`r�Rw������x��Oj�
+c��#=(F,��f}S�>���������n6����/��G��o�f�D/t�FH��4y�D/Sv�|�A|����;���5H�n�2����!$0�I;>8E� �;��(��ApuI��t�,���*�^�l�M,u����*e@��
n`�^�����|�T.���NB�3��3�K�<R.�S|cv�|���&s���f����6���@T|��
K�'�[�������R	����ePT�\s���zc������I�E`���K<)�,]�)�q#�w��}�K��@�M�������5sB��2�sb�DO�������8��
�4�5�fl�m.O���U���N*'�H��p����8c�|�����w��~����o����_G���K���P�Me��fS���NHOx&�`�����Id>N�������$�	%�	��^�9
n��g�$�g���#9�'w<�<��	��~Z�A�3-���n�F��:G�n�����3b�(#G�g��z���*��i�6Kq��:YD�����0i�`P��(C��0?s^��v�NeW
T�l]�i������D �l���a�i�p��g�;v�c�U]�$�<�����w���}K��:e�;�d�|�%�DF1�!�r(�5�K�`����dv�;��40��X*�rn����0-�'R���}�C�&��W"_����
H(���u�u���P6������S
:/�����A>����1�f�q[�7�7c���2zx��01���lS�J�6N.G�x�?l(}�:�bW�K��d���
�Ky��U���m�������i�|3@az�	���L�i|p��)��\��uX��8NlQN�A��)d=�� �&��~����oS�y�R)Xw/��v�&����]>H9Ae��l��`/�*��=�����������w�O_�� /2��"������Xa6��T�	epQ��� H��������^�P������)���c�������vI5�
��.���{>�e0���v\J P��C���Sn��=4��Y�m���C�����Ts�y*E/�*�C"�sR������^���2���.�	��%IQ5WoZtDG:�4�����Nuy_���/�4�J]�,����09�pd�6h���/����=~:6��.���T��S�N��*�����^�D���,73�R@��m����.I��Ajy2���S���]t��T69(�J�C.T�<40�,�M'���j3����_i=�t'��(����O�h
��2�g*&����)�d#:;�Fee:�t��?u������
�r����'��B����-(�{P07���������Zv��]W�������)�|�F�li��g���C^_��$�������0��L��3 -�;�H>h�C�{L?v��3�C�=�p�$�J�	��Q#w�5r���x���ry7/�y���'��
���)1��N�)Q�����7���O8CY�MN���4 �T����D��t����?EZ��=���vC��dq�q=�d�u��<����t^`G*����Q�D��J�\�������M	���a�&�N]��v����/I�3e�����j��A*q�GM���w�ey�����T�����e�R�����(�tFY�5G��F��t�A}.��D��.�������d��?���G�A��4�f��#����m�dZ��
Oc�u��u����"	��\����$�kL����V�u�5.?�Sw��'���<����Q�#�5Q/���@W6�lT�nmN�w!���"��E"�83����l��y���xt�Z��h��v=������I!��2��z-���e��/�W��l�'�2�,aw�q0B!���L/���{.�9����#Bg�W.�~�qE�^�@`K A2�?����[|4���m{d�u�`���=�D������$��u�����l�x/���������4�yY>��������kg�J�\���������D�;R��V+�3O�����?ow�3R���-c��N�R(���������1�{���@^�C���.cbP.�#��Y���X��d+�U<_������t&�9�Z�X�,��9�F�)���P��� �K<K2)	�V*��s�I�_���'��Q�%Vn�6��_�������!Q�Y�	�{f�R�	f)&����\�$i�l�Ag(W�#���Z����������%q�o������(�7%)���@���v����@s�'%-$CZ�;�d�����T��� ���9y�s��k')v9r��`\��k���O��(��J!9�k�q�����J�R�U��-Y���9L��
}���~���d����2!���I�8��r]g�(G�^*��@ax�y�kx�q_4�������:����l�I���`�9\���q�J�cQ����L�;t�]VU/��]e
�q8V�Ur����>�g��eP�������HA%����~���~bD�r�;���4��W��XvlF(;XI������E��W�����S�����~r_���]$���n��ff`��j��:�*�2��`�x��j�z�}�$0v�`r�*�@R������\.�d�����������e���g����l���/�m��~��_���tDa��K������R�������k�)ig^ePQp2��H,T�0��mm0�Lt�vU�K�X<����Z��� �.�4�^MO�q]�m�m�u�42^�-��9�H>8ql��������~W������M���P�9'�=cTQ���"�XT�B��h�o��	5=���!�H3�*���`e��To��f[�6%�f�l�8#DkJY(Q0\�M22w�k����{��S�N��Y ���)0�P��!�"g�T� �}<]~5�e��D�R������q�+�(��2J����'!���f~f��<y&������T^]�@vy�q�a%�pn����)@WV�Ies�?���\OO�E��N��<D!S�������v@���EW�{1d.i��$|vy
�|���	�Z���p��+�q������Iw�����i%��"�
������G�,�5N/ V?n�Mwh�uYMr��e��w���bO�,?�����o�x+
��J$����e��l���jG���$���(\��[xK���1�0^��@�`67�N2���I5���
��L
�<�����%"��L�R��������r�>�?ui�JJ��A�t*�)�)8nMF��u��|.s_Y~J�����j�2�X,M6=!���
�0�7�~+x�9�AX:h�,�&z����Zq���<nz��Y��"sa��(&W������Pk�����l�(�����0�7��%~��U�-���������z��)'�Pl��Z��k�j$�2j�_����\�x	E�G����#�Cw��d�>�����$��q���2xp�	$������g�Y���4]�N��Qi�g�g���VW����]�m��m���*C���QP���H�N��&���	�6�����n
m�R�d�@���%�y��]��"�0�j�/�=zvJjz):x���n�4��fW�V#,O���0��[�e;G��R�1��8������a�]�M���i���$),G6�By���F_
,�<�
�"��;�I�w/����k����������>	\t����
��$uTR���PG=�C�����sX�0�}?�|���L]������5����3H����8���RN��2e� �?�����=�Mb����Ci����?d���������GM��3�Rc���J�+S*�q�DG�\�4��� �^�j���0�j��RvM]������T��cR<��(�R��L�(�{4�k
.9��������X��X�k���� �lF�`���V<�D$�]/x���6������D�l��i��^�S�w�8U0� .����B���<��^��\�Tc������4=�<��_��CC�h�(w
��wZ�X6b�`��=�����$��;(q��x��c�k�Vu�)q����\~�)6@�y��2�x!�n�>�r���)��~����W� ��7�%W�c������g$�S+�F���8R����:���Z��r,a"�[�l�ne~���k�+��c���k�\�2��F'kQ��{pg�[�9f�&�1	��#�������N��e�Y������r��E �P$�}��B�?��it?����:B����
t��Hf���s����U�sR��{	C�x���p&4A2y@;��\���"��b��%�YQ��WE�r�tF;��w=M�-B������stFB�]��f���4iI���F��}%r����$rL+�#��{�����rt���%���d\����gp��A��h�$yi]?�s}e�Q�[X��2�zrn������$��I����eH���U���$|1.����h�b��&z���x��[
����]�B�(�qt�Ko(��(�D�;At���3�:�#�������\��j����&iB��p&M�P�d��sq;�|�}��?�X��.�uT�k&�6��C	\\�Q�.��l���o�]�o���L��=�dTKE=���l���J�~�����vu��T7�f_^�1X�tqW)�C���l8�����=����E��)��n�kB�B�����v@��%�q^J��BO���.�E�A�0h�A�u�O#KHt�,a�"9�B�������v4u��i�7"��(uSa��+�;�G���������\�%�6)�\E�Y�^�
A4e+�r9���l0���c��y>�Yt��������*�=!��n��ly������<�l�8��jx�c��@{!����l,'|_�s��N����gs2fk�E�g�f��'���z���J�IWx
�6p��B�?� ��$9������d�dB�KFi���F�	�p�*$T7��+�YcB���r:������;�u�9{��Wp����u�U���k��i�����MnvlDAX�5�0�xl[7�m�<K�=p7^O�M�#��~!�&�)
J���
*��1���:�������v������7�&�>jE{HF����?�8�
���j�"z�:�z���|~H_�
^XQP��ttS�]����m��>�m~u��7i:�
����
nc�����M19���[:(�	�6���C��jl���7_X8MkL�AQ�$[���G"��+um ��o����5�C�R���]xW��@1����^���'P<�MX�Q��������-��g��u�W��Q�+#2�T�����(��h�P�ytpP�+�'�ax`l>#��@@�S�Fe45�*��P��<���0L�_I����B��s��zYm	z��-�e3�v���"l��v�u����Q�BihP��|$H-1a\�r������*��Z�����]�Y8Y6EW$������L3h�n���x$��3`�&)��G�2N6�Q��0�tp���X��_`Qh��Q��i������^���u����}u��mR�i���p�u��m����L�C�C��>�M2��p_����@]�bW�y�������_us|u��+���f������b�m.���k�g�g����z:��������+7���.
�
2����q��a���v�|�*������o#�r{�l!Je�	e�Ui��Z6�<U�Z�&�.pHLI}<3�%�~��\sm�41$����T���O���w�����.7��A{��^c�.8��]Wt�d��g@W]�m�]�r��@?F�Ydwi�Je�uv����g<���|4���T�~�3�����m%w����������=�R2`Dg�K���&����X[�����1���*V}�	w���m�n������}�����/7��Ff�n�����z/���[~�vz�������J�O�){�/ta	VMP:EL���5�����IO&���&Ofts%�d��"4�'aV���H��A��L�-.�������;]���K�����(j�]g�om�J�\�t��:=�@/��M����Mu������������>�V��P�gN�*�-����
�t�12(R@0�"�M�r����$D���������C2���)������^5s�8V���U��\�^->lO�	 ����hoG����`pFs�'������Kq���	/�t�&��E�BzjJ+�.�j��u����
>�H��v��Dhj�p��s��V��]�^���|����1~�\���X9������~�\->'=#��JVht�DM�R;~d%��Po+����0!��!S#��q�L����q����%/O��q�-��h�i�}x�<� ����{�Y� �+6=�.�#Gr�u[T�Rm��|K�
J�F/��H��@2fe�A>�v<6#�O�P��h*����*��Kw8�E�1�����I�����mg���m�;>���S�+��9����u�!d�y�
�r�3���]�f�����X���^8�j�^d'n����+[�l��y�?����`��.QF
�*w���\n{������T"�{���i�.�B���+-r�I�j��f���j�_�	J�F����L5��I�F46�w�m��K�T���]������*��
����4�%�L/x�DS�5P��<_A�r���b�L�����R���.��{���Q�8IM�����q�8�!�������=y��o
�P��Re��v��KI�N�z[�~��[����$�S���C����WF�2������Zww���h���`�6Ao� ���Z��vR>'�!���d.��e����5O��$�4(�mB��,��Wl��������#ZcB�q��c:W3q�n���x�c�_�����q���1�H����`~�1�{�1�����]q��Bc����?O�4��c,�dq�V�
�fL�L6	��s	�I|?-< ����\`7���38�����bC; ��Fdp�ve^~�^�b�_�����Kw!W2)OS��Sv�ub,%w�Y!�@����0�g��7Aic�!�$��0�F��+�	��?���,��v���Q	�������T\�XF��7�N��O:U�2����d@��k% �o���N�*(�S!;�G.�*ON��Q��40�=�_���� ��<s�������Su���y���[��J������ M���"�1z��K'���k���������^��Z4�C���z���V�-a���82�Lq1%��x�Jg�X�B�jo�������tWj[>��}}���j��f����WC�����uu��S?�������_��������1��	���E�p
$�g����4�w�;��������f����Bj��2�+O���T���Z��MB]����'s�MA
VPoD��@�(BPA�$�:���X��[N�d��(�n�E�O�`X�r���e�Oiw����=������]�.i��t@ygf��*%b��<�dqd�_NWd��a��uWztx�N�A����������'�E�tJ��t��a�d�:5�P�N������<�,6dx8NZ�D�p�|��)���z_7m���)�����.A7��1"�*<��+�����+�-�@������:h�X�,�#�g�E��X�O����J������I�HrS�����]���5���'�W���NWi�y���AE�H��|��v�m�(
��|"76@���q�{�*�X6�.��B����v�8�P�UP��\�������m���!����ec6��P����m����iC��%C��@���H��ZI�K(�r�tr�9
zs
?���|������\�14���������\O�����#����Aa��r�(�*��AHx�qb�I���gkWD��p@��������<�6A~C��l3J:��p����L6�gUy��
��G]T4��qsJ�������R��|�������_W�kX������0���koP����qIW6����3�G���e��-�,8*��HW���?���������n��~q��t�7`���{1����?���#@%L-5�y<�<NY��vsB�a+�|��+)�W����zWwi(I��_U�K	���6��'���z�
�1&�'>!E#e�x�1"�|�R7X��(D Y�Re�����V�6?<��:�_z���
��G���q,A�����&����^�O-����}*�`��P���6�8_mwY6���������D�k���[ri	���-������#�
��W���xL��������Y�k�A��)J��p�v�.��.�n�{.��9\��k��l�W3	$��FJ�
�d���;��4�,��O���]��������������Kh{��&
6�����a{xL�W��40��
������q���}.�	�Z���n���^|&�p�BJ?���l��Q�.��r�U�5?l�}uJ�3�:`��3�q�+������`������=��@�&��^��$D�hA,�O�bi6�|��2_�slz�{H�j�C�s��/�B��c���I���C��3h��������K.��^'A�_���2m#��8+r����e0��d5����|F8��f����b��������p�����)&�>��u���	J�2���	����=����������	/��A���tl���n�;���""�"qG�e�
q}����2H�������SW�zl����_�&�+�[���qr1.��~]=����O�	3*�F{�*���:�=�1���\mc^�������W��y�Y���l��}�>��B�i4���i�39����0�uT5��!y�.
]�����<�0�����������������A�W*�q�O�M��8n�0�Z����'�I�2:Hc���~���e����d"�i�����9Z:��i��x_5��!X��Q���0^_G?�Lj����3�_���)0A��d��re�.���0KY9Xj}$I����^Q[��A����O~~WQ>eJ�1c%���{+CJY��'����O�s��3uR-���'p��V�{'AB/(�9��?+��n{����������vI^��~��2cS��;��r�l�P�����VXk�������|N	�H.Y��b@�*$�;�	R��A������T�������30LR��/�1J���QG&3���?��E�v�������a�WF	����%�eO������h.�����'����S"M@o���R��S�'�z�y/f[we[:�����
6�����V���gpKP]�w�i.�0uJ��W����N���b�y?�����x��H�7����'c
6)�T���������5�{9Q~���>�N0��~����@ ���
p��,!j�S���������w�V�3}�,�$���mo
��V�V���X�0������OB|��z�WI:b�� �e�j]��]�R����<~L �7��UY�����Ys��q�+���q��vM}*e�~��x9�B����^@)����L"��z��J	;3����2����n(���p��38��6&���R�30J�o)P1�R��\���UW���o�^�z������tz�����G;�{H��Q�r�G��1���F^y>�S�x�7e���t^z��K
oX�-��C
*��J����]�#�����������5tE��wa���(5�G���8�U��y!���3�,����\fFX��\�>����\���T��O �h�y�zJ�u���	�Z�dP��z���6�G�[���������9B
��P�8���ye���y�����{����R^�Ky�Y�!��#�r7]a�)�\�H�����.�\f5��jc���8<��+!r�~��#Xa�"/����&P�8��E�`	{���.��/��V�H
���g�@Bb�^�J�{.���mP��f�>���"�.-z����k�hhZ����5��o���!�I/(
���r�����S9�NF��.�+���p��eF��=��k��~������[���1���p��]P"�,&WB�R��n{�.��j�r_���{�B�r�������^��b�t�,�u�������������>l��_��|�'9|���c�
�nI�5�t�rIR�'L8����tFIj/��u	,�k$4���0Py�m��gu}l�������*����]�D��yS0nV4��m.������[@
G,7����vQ�����j�m|�v��j@� R��1.GE�aA���beU.�h��O������O
O��	)�V�������5��^s����� �:��7;�����O%�����q%��}]��JM��s���M��Aq��.,�e6��V���u{���EH��<��%����f8Y�}W���DW��^��6M�]�+ec���*��B���\�{/�`#��e�p/(E��`Kg�n��^��N��1}����j�4��v�:qVO���HUq�!&z�/���<�v�$�I�>o��|Lby�r��)��w�E>�.�W�E��c���N��rE��9�f���.ij��"��N�M���y[q�KN�?6.�?]zJ�"I3(8�+x�b��@R�b���m��J���>��Cs���'7�������f"�|���[ S��$[>n�=��-����uK�O�:6o�t��u�����?�����	$wK��g��������],��;#���	�=&�NuA�I�Q�(6�e���Q�u(�3����\�I�i/�DW���|���
j�!�������:u��6{���1>�{h�����3������6^w&����A�o,���ME�ec�0[~�s�^�>_�@|������h
c7o�0�b��D�I�����T���|����[HG���)����������M���Ow�_�#H9'-1�ds��O���sL4
���t%������m���qJ�N7��t�R w��J�R��7�Z�#�Ar%	$
�fmL0X~���,n�
������6!����'�������q�=%)�k��������x����F����xn��?>���������C�I�<X1E�,�A���7�z���o��u��l]�m-X���� Ye�xEq(O�R��F��>��p�qE}	���r����<o=/���U���Q\��u�$U4�|	?��3�h���\�Z"��
>�+������d����d��z~���A�������O)������m�������9��M�tu�V2�f�:�NN'�0�
H+��������{n32;�*�rE��(��P��8�T�������d.���
^	���w����@�
������|n���aE�b$�pjr��0#L��/,dB^8�b _R��������)��c�d�M	�]�O������.�#�h�Z��G�i�`����	�X���=�$��sB~�����j�,��a�p�����v�=)$n�=�g��|<~<�\ 'Y���N�D[(���l8���+&s��=��J����������P#�:r����]�7�]�L, ���,wE%�VXi04�&��T�I���l���L+���b>���r���A?��zG�d`�o+�i��@�\��4�o�����L��6�����G�������4��p�����u}z�MK1Br�Bad�F��N�8������d�Bn��o�������b|@o��@n� �
�����g���k���J��9���+�k��l���Z��4�Cj��{���4^IT(������XX�/�H�}������ZrX� ��,$e�� �J�\�i�oH�����
��n����
ox�n�|Ap�d|�^�lR����������6�`qDyO���@� �����/?�������;�D�������|J[PX��b�>������N$����6X(0Rf<.�!`��c�i�2��d�Z}�\�4�#�$T��M��.<����En�\����}�F�r}�K��L���;�
����l�k�;v����������I�cw�T ���;g����Rf7���L5^m6w]���M�I��	����1�@�x�8�$lHi.�|�P�	���|��%�Sp��Xm�J�\
'��@e����=�e�OIYI��X���+�%���*�K{�����������9���[��a�����������L�po'�Cy��"r���]
�~	��)�r�*~T}�8Z����8�uE��\>|�q���/[�m�	���Y���E�����p^��8_���pvQ~��]���>=i�	�C��S��93cr}Ez:����N��h�U�+7����K���r��C?������P-��I�)B�8M��h<�qB:)4C��� �f�1%���6�0r�w�8���)NZW��Ge���
	c��!_PY0TU	�~��X�#�-��q���n���4��v ��l$Jv��3�GJ'�����-����'�����^>J���%��?���j��`R������4��|k�������{�������v��.�x��H�mp��"��^����d�~�N��K@mwl��q,���������`�>��	<IX�Q���A3\��$4h���^�o�>�?uu�+8.\"�u�EoB�#��O2��5�[aP|u]~�����)��"���6��Q��ch��(��w��?����d�+�v2���V�fpHmI��SC��c������	�W�|���P��3�b4�>EcD4�����+���h.�G��'�]�I}2O�������r$M��v�h�kjf��qIC�L�����z��Z�	�m������(�F"E+$UH��5�m����	�gzF���`���Z��u�b6��7_��k`��6�5�|K��Ho�0�.Mk�JLM�
��/���2h/�5�4������%�t��G32����]�pZ
p�����b�C�D6�>��	�ar@"d�T��5�G����p��,Y$���J58Y.��kD���4�K9:�����U`�Q�����F�����]k����������H�3�Me��\Lq�f.}{����C��4����k�wiSLe�7�U�� =������r��q��E��!��rp�����y6Y���!o'��)W���o��W��U/�#�d,��e4����f�����*�~8>^�s�(��
R{9��AF������h4&���!����f�j��Ms���O&���=���w9�:����%���G4�{ N�I,e��n���:�h�(�w�b�\����,{��>��m@�V���}6u��@����U������n9����uw�q�����]�US���s��p	(807�*e0<���;���~b�����IZ(6�q�b����u4�rF����3���������������a��8@i�t�;s��`���2�K�����u����g9���NUPwfqc7�I�s�3����3��/�Q�\ {L�����������r����u�n�w��,SyX�=���T�S��������C��l������� p�o�������59�t?�{zf@��	C����.��pW��=�c.?��w�uwu�M1��������C�
��_$��r�	cu�K�d� �=
G����%��v=Q�4AT�">�c	g�pV�y���&k�8�|�]�c���J}:��mS��;�30��`��h�IT�Tb���u��!��p�,���Z���_�S�`A��{�e6��u:�Xz�w�7�k3�A*-����'q/��A����m��K�lS ����j<J���H��p_�g�c����}��UW7�j�;�%<���@=$�d
��#�[�w����9��iM�|\U�xVE���$����J\?i�z�s��	C8��'�&���i��>�5�����}m��u��Y�D��c	_�B��,MU./����v��:�����������Z���`�i��<���Y]�\T����^,.�i�I�������X;���vT�s�AR��r�T��e(�X.�����@�H�R����Z�F3�� �q�si��Y	h�W��[����DEsM/H�����	�$���V��	���^G����]��SzJ%X�9E��u���8#�JgI��@�i.Y�M`�*HNCft:�DH,}�Cw���y�^�-Cy���.��FyEv \���
�,"�Y�����\�A�H;o~�;������|�7�a��e�;���|-d�i�.\����8����\�ge�����vMw����49T��q��3b�"F*9��}���b���^���u�c���i�v)���a�����8������c�aaM�TS�N�O��B�1�%��	�E�����WfR���Y���@�u{�_�R��w����;��9
�������3;���Y��F`�f�2AdF�
<���AJ��"8�Y���m���T�����V�2Og�`3�t�;8�>�M��$�kw(8]3t>�I�tmC"�2.�N���]�����z�OP"L"�z>�J6���N�(�~���:b���p��	�?���-(��d��=�K��@u~����IR���aE���:8OnddSS���o����G�-�\�hZ��d������)�#T��n#to��[���Y6����-O=�_�$����rl������X|�T��^��������tU��?%i�}��o�)��.����h�A������h��6]Z
���������=�u� �A��+]�c��H7��A0]m�+�5(���k�q���2��>b�����i0�R����8�V���%��{j�h�G?���)0�C������Of�8~��s�GP�V���3&���P��#*���Z�:x������-��	h��c��^9K�7��l6B����z�������w8Z1�':#/�&*#���E����}��������z�	@��m�z����'���
�����T6rF��������k�X}D\���b!)�����_g�F�����3+^�C��V��?����c����=���o���]����&p2g��%�u�����.��Waz�w/�EqF��H$S��d6��Y�z�y[��T�m���OID��#z��������uK������/��o.��3 �&}P,��[������D6�E�C'�i�z�HZ_m�1�|.��Q����D�
���H�����G^X��3)�!���������i����W��(��B�i�4O*o2]�F�lj7�)_s�#���w������4���C����w��8\�v��UW3�{��L�i\`p���|������H�4^)qXi`������|������(EQ`W�6����+��o�'M�k���_�����3��bx<G�'�#��G��`^I.�g�m�R�hn�C<H'x�dFT�[:���c3�a'��F��N�����)�A�{	�<!(�C�=��MH��_y�R���
9���.�_�9�^|��C��w����W?�\*QV.!�>?4��N�y��O�(����)������r�t���swwx8���f�Se�7ab�81]|�.j-����g�����h���]�<�&%)9(&�0zr�4�V��\�p�zP���}Pa����K�{db4��l���K�5u/���\j�H6'��2���%n�i��q������w���[]��_�t��{E��<�������9I��2�I}W=���$T�
�b�������%*?W�`��'(������H���	@]g�_�I:��bt�������L>������Y�{�W����Ep"�U�T��/(J����J��S��k�M�<
N�8���Q@c��'����A�}������/�]���^^S%�z2
�c���4�&R��������9���[�M��3	�$-�S�rU,��l7��(i���4�	1I!��d=T�>�^�k$3p����w�N
��
��K����3'B���o&�_�uw�~�w���l����i�3$A0��Q��K;"����{}n����p����n��_U^rb��[D�!�\io��F�}�v?��6����p��bI��>Y�E�C��W�sz�\����%r@A
�d(
T��@�H�:U�M��h��i�m�l*��j���7�� �������oi�%�4H���xA-AR)�~�����J(P,@������B}�1����}5I����Q�
�x�.�b�#&�U4h�D��~(��A��"p���y���K�E����XL��j7a�3XLx�����;!!��2,u8r�c��Ed�0&� r����y��5��R]���g�u�����������
����T����k�A��fn���89)2�\2��}��SS�W������A�'o oF��#�~��:]|4Ah�������'[
6�\�~���u�,r�>H����(���
�\����K�,������ �2��;@��(,hz�c�x6���s���j${��D��q��

 �p�M3��������EQ*����G�;����]t����EI��,K���Z�����*���A��CxY�1"{�1���2�tY����$�W]W�O�e�@�V����N&8����@z)U��~|"��C��1��h��B��e>��*����.�(X�'��e�gV2S�
�{_T���S��M(l�3�I�U��*��U����\z���g��9g��Y�	0m��NHI=�uX
]���s\�'�*��_��� �.���ZR��6�����VZy����0j`�k��v5����M�YVc�CW�������x
��#�����Z!U�(Q^#6�� @����C�dsnv�:�39��_.����!��d~7N�3�!Rhl���yo���?����u��OwI��0p7�o��B���J��_q�����{*]e�q��E����?����A)���=g��{OH"�=�����7�=�U][�6Bk��'UI
C�!�H�&0�����d��|C��L��6��Bb������2��
�?>��5�=H;�W�_�Ok���� �p�b1����_����2���p$�'|��|�U����@$4��K��]RB��F��~�<{~�n��p�	.)*xFz�b$W��rAg��������n���NY}���F(MWw"7��Y�At��?Vm��j_��j�7���iq��Po��3�������?���������V���U�n����Z6��c!��� ~���0,����+oXm��C[����Q �Y�������1m5 b�s.W��!O����Q�������9��C��:(������P h�zx�=�%������*�G�kSo��K'?o�"�n�$5�6DT��&���lf����U��*�����TZ��2���l���%	�8g�L�;k7�>����OC�l2������Q�L�r�&$�������xt��b![!iY`y���MP�F2��k�I6{nG�����Fc���������f���(���~]�`l��)���c���Q����^b7�0�vm]Lu����[t�������
��jf1t������`��\�;�����<��H2(v�27���B2,�����z�y�w*���������u9t:[Y2f���u&��F��R�GHD��a����O�����cw�<��T���J��228�������>��(\��8f��iy,���!�>���@�	�����\O����f.,���Sb2R$��v����i{L�|�d�����bN(�(�W��+{	b���b��f���l�Z��r��JC����Y|7)�s�s��Rdd��r���Ig�&���G����d\����/xS'�:�
���mP1��J��/3�<����{|�Eu��
'0�(�:`u����4R�
2��_��B�~��!
L�������3�	d5��0��2������a����p�����vY���0���������������Ire��8C��,C����k|���i�����x$$��������<���������Q(�
I���|3A�nD��/�U��gw���4Q����x�9����#E�����3b��}�������1:�h�:`���8�"^d�
�u��^���e�������b�<��j|B�~�I��Pb�d:������0N�f
0���v]>�OI����<�v��G>���I�
����;�Z5���w�7iu�F\�~G�����s�M����S3lv$O�i����x4����G����e�I ]QW�w����H@W|{�Cg�GI<�je�l>���}��5�}J0U��K���z�������������W���J;jr,C+t�oB��Iw&�>�J������g��U\�&�����P�|����rg�+*5�~�������]�����i�4��w��
�v��C@�e��Hx�.����S(����J��v��Fu�����I^p�N{�.�������r���,pDZqsinF4���� �h�Gu����s���W�xS��of���"b��������M��+�==���tl\���+�aM�,�����)e�%E��*����g���)������b 4����,$����5"��^�5Y#��,���F
��_��Omy���@����N}J])����Xu�,>����%��'P�B�(��[�g�6H��y����`.�K���y���[�3�8
���O��K�8���gx+�8����a�d���B8R3g�:6��s/*�`�P�`�+>Bd�N>��� ��H)5��}���i��c���T6���X	A�'y'v4�,�2+�7Is�;~,��[w)�D����D��j�IC���G����T~���?���2ZBZ�2!���{>��R�����a9�Gg��;}+�t��~��{Y��`E�w�T��H�D�C�+o��;nO��P����k�V�K�;���.%��4��^���c��=l���;��v�J��r�k����I����7�u�\�gfT�k�V:�P��X?����������'q����
i�<�j�\�Q�lEU.���U�]��|�j�[�(Iwj��#H��8}\D��EV�7��t�>l��W� ��~F�D&,��%W�S��c6��:�6����v]�^�j����@�_��I���e�/1�i���4���.���oE�����7�=���'@�S(/]�a�QB��\!��gSo����c
TKJ�U�G��8T���C
d�%���]��a�|N����/u ������FB��{?=�y�����N :_3M2�������P[|��_~�������r�%)��j0���gq��-8�r�UCF��>�V����}�x���mb���c{�mKS��[\,)��5�7�{|������TV���r�t���/V 
���XJ���F%g2Ws�U���r�&�6�@��<����]�(����Y��	w�t�y������7�C���n���'�r�����JR��n���2�b�F���	�^J=�h��o��+
)�|�L��/����0��4�bt��
"��9�z��q�v%Q���n{���r����G(�p�
�A2�K��a\��.����5��*���	Ed}�>�ye1�|{w`��]��jwNZ�Y�%����_�kq��A{u������= D�����|�aS&��� �����e{����:W�v�vmJ<%S~�8!8�n7�}�������V�����*�3���3�~��)_-������}�Pw���tK���K~+��8��,��\Q��[���jl�2
���R)�e+!r�d5�s��P�?�kW�_E)�\�7;��`^pUpT���X�l��I�����|�v������f�4���h`�-�\����\����`����6�:����>lj����p����T����3�$�q�D=������\�������c�����@"���8�����#q���\�$r�_�����\���j���K+D���I�,b���9)hWZd��oA������*\����+�Q�4�S�S����f��BN�G�y9��JJ������<Ejd�4p\�������c�����
/=8A Da���2��u����0W��O����
��O����c[P3�e����cW�����S��l� �RF��P(���1��*l���t���@{Z��F����W��+\����T=x;5�v��U��@W�-�;�vB_���o'������o�A�	����:��I���~1�$���jM�T������mz+���8�&���	���/:u��3Z�O�a+�r��o��2�����+%a�z��4�+
3$��)�!q$�~�v�����hu"�q��bqG������l������G���A{��h;u�5gFs��m��x@�H��P�B���v�#��+���7_7uo�N'Y������a?�j�

��R���&r�q_�X�~���fS����<)@��'��)�8���������%���=�>�k�C[����K	^�u�7�Y���w��=O�����e;���r?��;7I�%���~�>���1X{
c
Ms�y\yL�}	��lf����Fv<�!�
`�?]�/���2~4�{�����2�"FVS�[���<�B����U�i_�<�"y�/�����>���6�Vr����jz�P>���\E�<�8R�7��z���\?�������|��i�*A�:Y:�K�������2h���|�6mw�-S���h?����d�q"1�U����w�|��w�m��;�/�g	.��x��������P(,*�?�����������U�1el�;qb
�~�W.��l���o��������������Nr�RJ�q�� ��J��t��?^��#������r��b���ng&��b2d�u�[�����:�-�q�
���m�R+�r�tHR��f0��dG_M
��t���T6�N<�ej��!�H��q�<�<X��`����~�3I�Q���D8�Vi���� u���|����\���&)w&qM#3��"�4�,�&X��X���!'@�������Z=��T;�N�:i����LN�`I,��J����e��{u7���~M`�30���5�u���we>��<��k0�||nz�v�Z���|�UO;�����M�U%�B��d��z�%�_Mf��Vu�!��_o�^6y1%	����(u����T���C�A�~[�*�_�X��z_:W�M}X������
=c��3�H*,EyM���d#N;<�S�P����p%h�[IT.)�+��/<�$��6��G�HZ�C_�#�2�q��K�A��	)��KY���!����5�<5�C]�T�&�RZGh��0Qz�"\Q�39�����'A>�����;;�H���
����@�ZPd�F�m���0#D��7!���;jA���Z!����B��>�j@����@�1L�O���:�.'�����c��4]�h��N^��CIl�������O��<�n��C0�d��2�
P,n�L�e33x0i.Ps���L����t<����\�,��u__���Bu,���mpu <2�|�Ax� {���-_�w5�R���?���o�e�vr���&i����Q#c���HSzQ�9N��T�P����
>*.����
�j\Tt��!gx���9���t6��%���J�3�U��^���(��C�m?����o�8���p���j����R�@��Z����-�����1i�k���Mg<�'������x�6�`��`���5OI������.�)&�h�A�R&�c��;�Q�
��4��Z����M���P��Su����4e��s9%�D��}<�y�w��g��u���������n�x(���m��P��
�S0 �Z�������j�c���]
���Q���h��K�j8x��.��\��
�����=U��L���.��a]���L�r8��^#��=,pQ�u����������e;��F9<��aq|V��$����������a�A��O�`hb���j2H�-�y{��r����(�X���Z=��C��:m�����?_�{��7C�I#5�������Ut6ef`�1��Kw2E`�1��L]���PJJ<�,�L��n9 F�����+��7���o��������t�I�0���x�	i\Hd,Q������sz��j\����>`�~����o{O/EN���������C�m�u3V���, |p�	�9�Y���?c��>7M}�&�4�����k52&�G�RU��S��=l��o0�{��	g
�k��I����X4�KzQ�����aCRI]�3^�U��d6�4�c�>����'t��^�hcz��n����+��:z�]��C���
H��X%8�8GF{$AF*���)�G �6p cQ�������d�w��������o;��,��d23E\u�QSL�|�)W;��C�����#%���D.M����y����#u���"DNQBHR2
���%m���kdp���zz������nS�����t�dH39C�w���`���3P)��*i4d��j:r���@�J;H�f����|�N|��MUc�
F�����{���om�/��g+�1�L�/�,n"�5>zX�^����@�aw�:����!�H���c
\
�Z���^�2�l3���9������H���'9���3�.���#��i������������nb�]�����&��)�
��l�DW����:�����d�|�u���"O��.l��N�'����7:$%����m&�����h�:���E����E�$]"��J���"�����E(���m��v�X���E�%�\��c��4!�����M�>h��?%:��=!}���)�f�
�%t2 ���n������a/
Hm���P�������9�����B���i{x���������Dv�7#�������\��2Q7�`�}*n�}W"iz�!�/�#������a��?���o�n���/�����&	��n��6�Zf��q���u�J�!6�+�)0c����nG�r�l_>0���2���p���S�UW��R+�
�����b"���d�1�+n,�4�����Z�0�����z�a��Q��@���-�����K��bh]��:����?k2YR�R�Y������,q�EF��5�	�'�����f�+}���da7u38KfN��]k$M
�!|W�|��XOt6)m��<�����re�`7��d����x����Ttg
Q �<���e��~\�;����M�IJ�[����%W	���%������@�_us|u�3L2�2~4s����{@�@���"2�9�3u�W)�;�N$�ap�(�+��������n@�u��j&�G�z�/og�3%�$��A���� ��'%�2zR���n�x���B�� Zi�h�����������V���KD��SB�98L�13=�B�:*���aW�w5b��Oa�O���$w���5H���!���#���!������o�>O��r��Kq�M�d���T�����C�$�$#�*�g��v������U4hMu�+��*,RF32t�����70�H�~��<���O W��i��)�A�����z:_O��J��7#�p�:nS	2yW�3.���:����T����O�oF
A\U�����{���G�??�m�to���q�O�&��z ��"i�b��u��(����\b�C�a#	���|�.3�+�=��CB���.`Q�p��%6�K]	��x��X���5��v\���t�f�F"���@//3��?����p�\5����W�
��SDa~�b�q@�@��{Q�Z���-���(1k�y*���o����#1kH���"s`�DRK���t��0Wt%�������zE�wU�m�u����-a�.n��h����gw��M�7�e�>���$��]��E
&���-�`��� n�����.��)$a~�%J��0:��pU��5��Nm�hiFy�Me��D]!	?H��g_��^��@����;P�Z/m���d�"R�t��O�������EFm�;��75y��&�z���m/FW����2A���.�ix0e��\~+��+��<1��LD�b����`H�����3}v���+I�������q%��Lw2)j��������;�WX������4��,0�����py7��2�A���)��B��(9Z�s�\E*�e����%�K��f������C�%'�_�����R��;'.�*���yj��$k�_�d�>�*N����E�������u������7e/���Q�c�$�!�H�<��J�\�W�0tX{}�Tj�<����.(3x-����(��P�r�tNP�V
��� `B<��*�cg�8G��� w��=#�:�[��O��:��K�[��GP���5/��E�A1�<+�`P� `���O��I��H�[���sj��]� �Y��*����
��:�b��J�x)�(?��Ng�	�M|�3jg( ��pX����Z~��/>���-Hy���}�AF�Fzr)����.?����������;~N�AS:���T�@��e�^�l�F�d���mW7��������
g��o���
�dr���`�{Ed���K��1)���<M���F��0iV:m���#8�<���� �c�x%�D
�@_�y.����`J�8(LQG3W:���U>�,@��0K��2;Y1�fh��3������j	
L;Z���
��Zx.���'�������+�2�������]���<�_����sWvO���G�E���I��@�~�����r�=t�
��K�1"��5�?N9E���'�/������hK���h��N�@�F�n���g@���yn4q�,��U�`"��7�8[���!�F����n\Rp������/��E_#�:����"E�>N
��*c�q����;>n����OeS��������]!�|�4��q|�����9���W�*�~��M����s�@�@�JM�W���=���dR��]1�4	R��b���Y|&��{.�����Pb& �d��P����^
g�b�3E�:~�
K��� yku!Q�#+��d�H�%���!�p����G���&���ot^]��| 
��q�!��9�J�T6��7��*��LJ'�7��C��/`<>
fU
�����:f����=�P�'=D���������dp_�����N�/�����Q�G��(^���y
�������=R��]��xxp?����Ky0�5����Y�3BW���,��/�}�|,����.7��K�~`�T����[�n����-�;��{[�{a�3����FQ��z!p�%1^�-��7_��E���xz�8�����g���F��q����3 ]Rb
9&�L���C:��w���"N�=A,b�����q�.
�j\�4��3@~hzt�{Q��i��$L�����
���c��������T���������M�~����~���C<�5��3�����������������wPd&�I�)�)���X��	�s�D�(v��I�`g�J������G����r@��(�b������A"���9H�Y�����2�����3g��&���"h@Q���33���?,~6I��+��D�h+
�*��Kn��1}�3�*u�:�!�u����w1j����������x/����OwIt)C�5���=w�3��7���5��Y����+c����b�*R�A�$�8��cU�m��y���cd�:��T
k3cV��2-�	�}�$��nH�����B"%�)Hw6R�q{�Z��>g�T�4�B"�\��"����GOh�D �4J���-@e���%����oj��	r�"���}�������B�	���/'x\X���A���l��r��x�w����W?�\*QV�I���CS��$$�����F�H;.�2*��T�m
L��dQf�|#���y6*���P��S�L���`j���qhK��?qi&�3��-�&PQ�[D@�3���)WP�l���q/j��~Hq�������"w�d=,8��~M�]W�O�e
�L��N�BQ��{y���
Lu.s����;���%���#[�#��N���J�^Q����{qe0��m�^M=M�^q4�����8�opu�$�y�M[�=�_����0�EjE�v=�^*���� n���V
���,KZ4y�x��j������J���;�D��,
�T�)�%o��cW�����S�n�f{xL�����5VYaN,�R�V��R.�n�N��l�e�w�H3���'4�N`�k���Y�{Z��������W��B�������z�amI`;����o^P�Ju����[W����E����N�q}�+Jy<=�a�@
�<��F�l�!��C-�����\����s��Lu��F/���h���:.��������La����:�h����W�,*������<�d(�_�����I&
���@���Pu\��j�
��,����+�T���A�3��]�'��S����:��V�.���6�
�>�k��;�
���4��5�����0j�a���N�����I��rL��Lh/9#���R��v_�w��2�����9��$NC�_��$KW��7s�.�/�������J KI?�=���Ea9R�(�WBF�e�����vMw���m��7�8u��}���?j�Y�=[]�A�����Bn��{Y��������-a����C�d"m�UN��eyr���?��������lH3���n&���g������[bXY&�
C�n,��})r����w���/	�`e�)����������t6�x�3]C��iR����!q���q������5�m\i�?�Q����P�dw,5������	�
�]f�T@���yPH4��L�!�Z�8|�l��u����]���������(t��j^��3�~b����
���`�fI����9�D���X�<A��{�6*�!K�'�<@�.<���TV'W�73|�� ;���x�s��70��
L��[0��:�zK�sy>���P�P���`���Ur�C9L:wu��c�0T�L���B�k�q)���l8�W��$������n����R���g�x���O�|���
��b���
��u��K#������- E�Py�v��\wS�� M��G��w���m��X���y�����D"�!<Cn�������!<�������uu��������i�#��3��xC]\B�	�*8���lM6&e{����]�<E���0o`������E08bH�]�QvE{��������c��H�L�g��$�`�l6�(�/�2��j6@����|�.�]��F��LWL�X���d��yW������"������^6�����6GVG �;Yd�X�<�SI*�x���-4CvDJZ�K��v�m<�b�'���`��c������B��)��>�q��MKw�5�n]u���gPdf�O�r�gE���<c�L=��������������)��!����<!�7e����� ���O���ZR45��l��GTy���F���;a�����Mc�F�N���u�@6T�6��~���9��Lm��qUtb	@���������X&O��Z����k�-G��@���\F����,(����pm
R�D,9��'��0������S�[tMA��$��rp������������������D�TAG�XF��pUp$��Gu������S�����K 2@:T2�vLJ���u�3�hD�m����e~���,(3�
�`=���\>�/���I�t����bA��X.�������S������N/��g ������ �+sy>���d��A�y2V!.�<��2�K
���B�X6P�o�+�������L�!+y��*����n�����ZW�v�8����i+��K�N������ND�RTgEI�!#������Y�>U��KVI:��zo��xO�����vr4G��n�����N�����|n���`�A�������"������H�"��"
h:q`�:��I].�
(�?����6��g���@�R�:����`��U�U��rsf��t�c��|E�'�?T����8���X�)��~�4��Z]�H��\����Y��F"�?v��>Me�����+�i��s�dV2������>�P�Gr�h����(@��z�� ���G��n?<�������t�\���������I�n�0]0��'%.��fS�W�uy{Y���.`MP�,�,8��T�b�%��rv��%pbs.5�Pw��ai��	��� ���jw������{w�;r�?��%��
A�/��u��
CX������_���������'�w�f����Gj����>��)���0ou�AWx����]	.K	�B�y@�+�,t�%p(q�:W���sxRX�����v�	CbF`����MJ���Q��,��3��,���wL���&N�CZ�b�@�{���I��=Y�������r��C�A�s�Z�z�\J���P\��HUb�Az���_����A��X���dx}9oA,.�����8`�4nSI)�~U����9laRt������L�����������P�QM�e+ns9���o����MY��V��y�����c�6����/;���@J�
��tD�������*%�p�l��^��W���T�q��� ���7T��B�(��7��,h��Faq3>k�-�����/������r:�����d�B�Xl��_�9� Jq�,�����rOw��=�bR�p>�]��YT��n%���O��/�����s���nfn��R��
��F�
!h��!b\�m��'��$����M�]��	�4q)�S?���`val���~(?o���,?]��	�"�������2���*��fs8���~N$�xk�H�����rb}ov$�`�u<-H��\R�d�����:����ZL2����=��-��R�����F�����������Po���O
*0M�
19�+����h����'L6���$X�6x���
���&'���7���RKU*$���$�����x�(��R� �Q0	Cu���0��a�k��D���u2��8��H�!����W,�!�6A�\�J,�Z� �?
*r�S"�	mF���(8�X�
r'�������� w�m:,�:��x8�D%7:����
�LE4������s�t�99�68�`* �Hg�x�!�����\*���I� �z����3�8u���s��}��|�������2t$
	!�i�9�#�f�p�?� ��������My��������Ty}a3*�g���UC��)�"�2�,�L��yV�Q$P�Y)�2n$l�z��Tf�6�M�zWWi^t�x(�$�����(���lD�����M�b��Wma���t�j6��#�����A!=����v�9�H�w��~���6B�9����GI�S`����Q������������}���)� v:���s`MF%,.�;���W}��~_T�_��h�G"R�n��g>yRu�O�I��k����Z��k�|��q���]y�_��ke��aU��u%�p�{�(�����:�&�����CQ
�/}ggAPP[K{�l�@O���M�PAhft'
��Sw!���`3�����]?t�fb
 T����W�9��b+"s������n����q�+;e-D���QQQ�O�� ������c�6H����<k�R����L.O�W:��l����N
��H�=p�����������}�m���i\F��b��:�=��4|3����P.�j�+�5���������.)�A��q�����T������`�\u4;t�eu9��y�I"�aN7-:��S(�^CH�ec2(9��I��_��_E�q"���X6�^{��a������(��=X�$���L��8+(P'�X1hi~g�y�Z�ARw�/e�����i��L�O���a�)����f������h/7�"^6�bF?Sr���:	��*�����Z�T�-xQ�?��m�{{���SRl��v�l����(��QO�������+b|�AO�y	W-i���re�)��_��u��5a��=l�E�������f7?Z��Ia�z\G�.����U���E*=���h��L������vO|�{� �)�Ig3�������H�dg}e����%�V�>]���<(�C�<�tU��Hd,%��AZ����D`]	�{���$�	ND23�sY�I(@Q�3� ���/^��'�
�TK���I����E�\�B���*"��N���X��
�D�`���p������z�A�X���	H���?�&q��4�'�h���&�q�&s��7��c���^���T��i�"�;���9����������$�2����5KE'X�a�UF>!���q�{]�v%�p@r���p��P7�8_NmJ���	�g�m:/y��CK��%��������4e?L�J��~�=��f[9�Q�2=����t�7���l ��5g������������he�KK���`��R)>%�T��A�e��-��y����f��N�3sM��K��i?����T�~I��P$p������%jy�AnI�r���0�hL2��L��H|������s=�)�_EBC6��+���\)�Ki���v�����A��adj�����W��k(����f�E��{�y��j����*�c����'5��[��~+]5�o�;�E��������t6���:��0����tO��9��]-��v�y}X��m���2 �1�S��Rqxb����3�����?�IM�xl'ia4�k��d�����wo�g��$s:%B�1nN�I)V�Y	w �!T���l]�s��?��G"����F�����l���@	���QFhA���tOt%/��/����-�aO�N���������UJ� ym��e*��r�np�<�X��5M�}��O���3KC��F����Fi�<@-�n�(1HW,?c}������HiYA�A6O�<�5����le:��y_J]�^z���1\�������"�
�f��������:ln6����*�J�tv�>��T�������!i���L�R.��\v���������)�4I�m2�G�����k�\Y�r���B��>%6�l/�;?�6e�l�0,���m���.�R��~�6U��vL��Z��b���� ��a2�q���E�s���L��e��Aq�)V�<�Y�
�����g��.CN���E����,:[P���f�����U�\�*��'�:����X_o�A��������u��S�t�y*���<C���n����������y�4��,��e����fgF��U����H����v��i�N�B"�4
���"C�
"��z����{���^��J����Z�}H�0��J�\�t@%~�&DH�9�5��d��&�Z�i���}<��m�%�$�
��33 W�3��%��'��_x4��2RX�E��tF�8��uMf6��F�b��G��$��PP�5`X�RP�W,�;��n.�=+�gfl�g{)]�c��\q�������.z�u�S���_���$@�p��c���b���G������G��:�������H��'���[����X����l�=]�-�I*�
t�\4q��'ki����D(G�X�W�oeep��0�$�>�&����L`��0���_ �@����A�xb(�v��R7"�z���PY~�k:����t�|�N�,�7��ij����m>�j����y[��9*	�����,0�������?�i�7���3�Dr
Zy�����T��vss8]���l��3Pl'�3�����=��O�hazY+��6�1!�*�A8a,�L��N�����_�lH�^8�����rJ��:������C�m�Z�#]��������$��24U�?����g� ]+�
��w�/�wI7�DQ����2�!5�S���m���&m!p��-��wB#d�M��r8�^=�l�i�K�E�l����_�v�g����K��:�.��a�!��	a���"'u��6M��k��62�2�+����i�
�C3G�:s���x�l�U������%��h<��i%�`��x������>[O����N���4�i3���/��N��"�3��[��	%������=���V2�?�~�9���k����$`�i6L7��*o�U�+��>l@�O6z�����4��������L������!�@d�����M��a�N������gP����Ms*?pS�w��1QRS�~$��5E�����J=�|~'���.��K�.{
���p����
l�?��
x�:��t.���>U������jws�&M���f������HD�;���r*����C
o�|�}�K_���B����#�2�����jR����W_T�	�:����H�H�f
�K�hJ�=����jkH�05J�7�
���+�������D6*7���muv�c��%Q�]��h�gE`J�����e�q����T�l�V���zC�q��0Ww�F��h���r��'j���}G���1����	;�LG��M��`���������������-�O�U�(���F����������T�Q���(�����R�h�,dAa#I��4�"}\�Rg%M6d����\�tkUe$��l�Dy��r�������*��8��e�`(�n�����j���� l=ioT�Q���hk9���������arw{�Y5W���f{h�s�e������{\W|������M>{��'��W�d�`�(�CL���dp���d2���?�\�"`��	�YRz�G.<���`@X�>�(�b)�����e�\���7�N��YB���8��x���6����Gjd��������:4U��4������!�'
4�`��� ����Se���A����M2�r-�����T!�d�g�L�����/�V��eO�aO�*u��R %[���^��|&���A���g\M]������U�6�n=W��I�(Bj/�2����BGtk1�P��1�L,��W�/�C�+;G���M�(��0���HD��������� ]�9;�eB�hm�"i�w-fF�z���^^l��{T����|�?�M���d*h0#��!%�q�\��5�������Wo��J��][����w�j_��h�b���Y��UH��6�g�\Dw�{P��0�������(@�����;������'O�� �D��0m<��D'��4I�P���/���
�u{h��M�\y��)y*
�tw�c5��d2�����77�L��]��l�i��'Q��V�TH���f%m.���u��p9|�W�I�6����;W�TH	��l6*okX���_�����d+�;���B���K��$Z;h�/�h�9�&�(�%8Xj�	N�X�hN��>3���L���u���f]f�`������N!���$��&���@�����2�~Z<����vo(n�K(��@{������6�jE�H�:���G�ypL�Ha�I|���Z�m���2}g~�;�eF�M�i�`L���#�@�cW����>���%��������cf
�d!�BF�/E>#���'�c��f�5�B�$ Yh��w/0��L��3�%H���D�A?W�*��/R
t�m��������D.W6i�r���r�c9��u	w��D�H����+�G�],;����Su�V�6m�>l�����������+�{�p[��y�
�l����.yS��+7b �,�7��2*AQ�%�t�����-�� h��������	�f�r�����
AVy��Nb[5��vO�]�M�{r1N���W.$�sw�b�A�i���CY���i���@��h��i�SS��4���S�=�%�kR����k��v7��������>����i���c�dG,h��B]A��4Bx@L@�	���V������7���R2KR��,�UO7F�T�����&9B����v8��u���8����SJ��\x4_�)�~���md������du
&�u���T��|~$D������Hx��6�8��K����~�~
7 ��r*]����tV�_2#��������X6��zs�!+K�lK��b��y�����O�e��]A��.�O_����m�������,�h��k���)��0Df#
�9�w�)��j����C��r���I�&<����q��=�������T"����l�<i��R�(>v��SK�lt��~8=��f����� �n':@jf)�`�� NTK�����I����p�3#`RXVp<���������4�T���Y��jDS�Y6��� �:<�Hn���KH[\ra���T��?��cW5u�����n�[��p�����m���?�/�s}h'�mw���s�xN���q�#f�Dx��������*M;��P!^/5�P9��~2�����PZ�X7�'�J\���	�-��G��������U��<�f������l�N!���J#�5X~iV�]�g3E���Go�>�����]'	�Ud�,#�e��6,��>�L&H�Q*E;�3/z�A�y��_�zwQBx�s�'�V��3I�J�lF���z�y�g�o#G���&n&
�����J�����M,rw�]J���;�!�/�vgf[}���t���g����N����T���|��pV�J{A'��t�k[0�e���}��������8w>3��M��.>k�d�����O��2�|<���It���d�)��\��IZ1��d���@2��f���i��\����,��3��S��(2���-��ev���zT���U�=�z\��J>:��,,p2��	j��E;|�?�)r��$T��	���B����\N��z��{i����������P�jz��u/$�<,N���I�s\X
�]gb��A�����7�c�~��-	��J�Bn��
C�B�=6;�|e�+�k4*�!�T2v�L�#%:F�.����e�n��g&>qZ�A/�L2q@�(�����q_>h���:�j�6%F���^uD?w�K��s����)BQRve!s��0U6[����l0tp�D�����c�
&�m$���lI��������������Tm����-�h������cg��|����\�S�����/�A�����Z�1w�np�Iv`�6A�.���=E�k�LG�Q��@2���f����v����))o����3�l�C�DN� n��C����Z�%*;��M�Fh�D�]{CD.���~>������c�,�I�)}o��a�:�'�����k;�����~r�O��mR��U��zi��B
3�+�O�2XV�����yx�Y�&h��gY���$����+5��)�%�������=:s��vsG� �v����f�)g�?]A������]�U:��M> �K���g��h���p�
����a0A4C�%%cC��|hf\�5y�.��wHrO�
��H.���\����)�uG�Lu��[I�8;d"���k����	�k��Xp�
��a�Y	���7��P��
�.@���9@2Oa���i~���S_�#����38�W�h-@j���u5l�����y��9�V�4ZV�8*�#�(*V��2���bg~�f�7����Vt�F;#��t���FWk�cR�Y��L|��
Q`eE��~4�t&H�w�u����U�+p�4I���:���@��\J�.�g���=��i-|E)�E����2Ga�A��t�|U�+z��KJ|K>&�������:H�W4�����M	�h�s{����p����9��lx�'��4�\�]-dQ������0G�6WA���m���j�ne1Oc��)/]j�<���.��,O���C7w��_��NRE�J�Q���T��@�����?T��������7�����bu	�k�m.�����_ve�Y	?�nw��V I���t�[!�\)	�l�/q����������FmF!�UJV&��g�g(�����e H>��:d���O������K�[p\�a�V����n[@����������	�*��i+�i���^�4=l)�`� <kr���F��?�[6��D����z����7�V�p}9h��7 ��e�\����C�i�7�rRHs2K�������XHs�O&(�����o����O���z��='��	��3��
�q�h�����������M��r8l�)�s$gS�'���@��Frp\���AZ�_��:�S�5��5�����T����v�g(�t����T�AS�u��?T�����$?����N�K���gy��$\����N���l��Nd"�8��-m�������qj[���^�5�}ew�7�s�Xn�0�=���L'{����H������������5�����v�*����I�v�}�n��m��G4i��-�������[���h��8��W��	3wFB����Q�������=�~=I�4������`a��1�\�k���v�����}��>U�����~��o4�m	KN�tf
���KAe������:,�.I]>2�O���&��m��"~�������t�P��kj����yDF[W�������GR�A�8C�9�%�*(9U.�g���v����-$f.��b��Y�l��`t|�h	jpr0
���Ge59���w����}�|�X���j�C���{o^�#�����S�)���e%�1�o���F�����e�a�+��)�*4{M�BI,j�K�?�4��o����*A#
W��N=�7����U��n�J/�����q�J[�0!�N����l���N(�C�P�8r*B/�C�c�D�v�
Bh�CeY0$�
�"�����h�����l]6M-XRnd0�D�?��#*������� <R����E��+��@#<H6�!8�sU�K,\������"7�@�4����5)�0x�E���0�xJ�M�l�:�f�����a�Z��0�;���|<y6��C����?�5����j�Dn
��s\��7� c�0	�"�['����������z�����<��$����|����R���h�����@~��%��)�4�(UV"s�u�@_Tf��|������'y��a �U-��i���i���Db�C��U����15&� ]�Eb���+�s�M�9��($
�2����{%�)V&V��R��~�tF{�$�%W{�g���/G������FF�%_��"�>������h����wf����[#�������~�v���$�0�7��^q������ p�Rs����h�ix�<6�R^=xQ�F��F3���(���m~x���9��
f,���_��#���������;jJ��~O!S�Ak�^s ��G�����#��������~��(	p	x�K�8�E�N�<J��
�e'��7�/��)�h�/�y����#������������{pVH�<��r<�����*.��!�'O�}��Pv�3��4	v?�d,�Ea�4RN���L�����>v2�]������U="I��j���Qy�;`WO^d7���s��^Wm�;�wW��UX�Co�6K�U�����DT��X�Ess���)r�����|]2�<�I�����v+�����F�����@�\�����B�[A��;M8��~1j$B�B*���M���y�8���%�����7������N�IU���s�q
$NW���FK+�a�P�|:��N$��7?SRd��
$�|3�����n���7u�o�]�����q������,25+�
k��}��`�p�U���	$2c������m�_E�n3���;�7��Xq�>/�F�lv�\�����}4�ng�%�\U�SB���w;��_�����:l��W���a�
'NnC����1E�6j����&����:��K�w�ls�[�~s��9r&���r(\�C	��l�w���������L��+	����g�PH����2<��:�!�����M�W���{TV��f�s��H�j�M.�e����6�����4�<(��z���x<:b�+*qB%�R�KMY�!w��y���xZ1�Ss�<q�Ai!%�5��e��<�^,"'i��������5����'�m$�s����n?T�)���A8+���L��	��&�����l.��]���Uf�,S)Q�:��hBTq���+�
X��}��^�p��t�z'A�z���z�8'�����$M��'�2��	-��#5f���j����8���d�C-�e���[�Q���������w������������0x���!E,���	~^R�^�:=���S�:��T�u*�~�X10����n��	���`�s��^��>�I�H���lj
�� [���[N���!�t�������4\7�p$X�������>�8��~���\�7dR�s���
)Y \u.��{N�I���A�!Y��Ca%�)�6��frY���]Y�-�d�)�>�,���3F�\���\q�y}<���1	b�
�{q�
������8hh�\H=�v����>��|r_���(���m��Y�;�ho��4�
��Xdsn\g@�Ax����v�{H')^��	d������)t�OH~^Fg�����c��>wq������]H''T����xE�qn��U�K��q���deL.�����^�(�*�p��3��u(N�LP���Y�N�Z��3����&s3��:��Q+������*u9?��.!���g��E��E��aT�
�tt���e�}�I�[�X+9���|�� ����������$�bd�^e0�1H�t�5��w�l�����^��B%�:��(r�46#���
��if����\c�HWM*��8��\g���d���:����$%��5w���f�����s������o��5UZ	�4����{3�A���X�-���[�'T�d`A'K�R�N�vY��\��]YT����Z�S�u��C��,P����i6��.�����������o-(��'�1�.�{!qu:}2'_�����,��v�7{}�%����RnC���N�*���Dq���'���8-�
�-�b��T(;=Gr�@Z���~��s�U���;o��;��yWxjU����lj�o!�Dk����B���P�06�@�?oO5JIu;A��\��-�����2,�6�������k)���x���xy�� ��t�`��>.�d7��x��d��{&]��?�$��fi�U����G�6s��'A{����������.��9���LtSa����v#��L����?W��>��C�TV���D����q����Q����=��8�m�&g16v� 4�:�`��Y|�ZO0�2Cn�����V{�u����m���zWWIO#|)�"�3��K������J3����D��>���J�n`j�_D�������[e�����zQ�WQ���:r�`�qv��e"*���F�Z�'�a������9���x��g69��|����8�22��Db�����%�@/8�A7a����I��I�L�"�����A=b������G5Dr���A�[A��*����M����#��R�D���$$#���������"'T��qf����vvH��lrq��|.���O����6�g������Qi� ��4��^�#�v��n?<��������>��O��MC@,�!yT��|���^6�My��}J 5oej/��hi���X�v���9u���F3Tw^���x.�$�o����s{�v�e��`�����%�.<���4�Z�M�H����@.
�q���/�rY��}���BZ�9h$�;B��pf*T��0���n����Qc��i�� ������>� �}��D���IV#q����f���l�</��f�\]J,VL�<�Q-?-�N��z}�(U.��������>������s|_w�Q[K���\��ov"�e1c��^Q��%w���5I��v���]�j&>�Z���TT g�,A��]~m,�`�$	��q���r�A��|P�[��y�����('�
��g���Z|	hm���x�a��&���\'e���14a��N��(�I�}���s!8��ti����]j�
8sDY{�.w[�|>��A��Iz,5��s�(-�J������M�>������2��A�5�����h.1�F��0�����I)��q�{=5V[6����#;��A�f)��e������.����}���+P��/m}���N8�<���?D�&��i%����`�_#������I�:����o�����,�@`t����q�����T���TY�?~�Mz;���-���,L���lE�����n]�^��$��O������+�#�~(�|�����j�����AR��A�@P���
,�^O��w���Q�����?c�����;�� /��5����n����s�Bj=&{H^PK�Gq����}�{ ��o�(��V�u	���a���g2��J��^��^�l�O��i����4&VH|��������9�E�����M���r��@���Vi��������P����^��WMS�;���i�4�F���6lq�a���W^~,��S>�l����|l�(O]>�R������]>Ps����'�
Jmq��`��:���8�����+������^�����'�J�?�S:t��������S�V��RQB���j����S��V��G��K"P���B����Pe�r�/o�K;�'.|H(:���Gzft8�-`�$:����V��N�WJ��T�Y(e��	�������$+��XN��5d�y7d�k�Ib�-�;��`�W���|�z��$#x����
u��	^e��ZU:�4��{�
i��
c�a����5�~�������7oJ%���%���������!]A�R�u�~A�r	���Zi�K`��a&�Vd��A_��"���8�	v����*��%=������y1qJ��)�uFM\K3�}����'�\�������E�{��=:����oK��7e��e�2�D�v�%m���x�h[n;�a���p�Ge�8���������S�ZY�1kY��������;�s.��QT�Ht�'H�$������g��yy#aX�m>cCu��rT�.�%`&��k�^��
�|Bf%���	0Sf,'m=$�2��#D.��WF��T\'o�m���%�8�&.�ts���Y0������y��t�GaA�i������������o�M{�I�@�t^����?G;����>����YW�]	R	�.-�"+E���dR�{6���X�*���	 Vg��)w�a3��p������fR��F\X���!FT��u:����K���$�')x\�:"�����W>S�jlj��JF��+}�A�~�$%k�������I�8fQ���3z����������<i��\8=�.��d��r�7�u�T����4�d8���{r�8����u�X:�rIr�����������������>)1���^T���D�ff�o�kx�Vo�{?b��6I0������h,��#
�J�Q6C��k�m����m�|]p��� �Y$
��w-��e�ve��l:�jL��s@a��]Id�|.?:GAun0C{�W$�z�o������������"���f����L8�����E�r�U�eZ|O������]����U� -���C�taq��$����>�J/�x>�Oh�����]D/:@7$AV�T��T�3��d??N�3*H���-��{F9u�W��G?!�.�,��|�|�i����r�r����n����@�~w��K��j��k���zK�R��]������fJ�@j����c"1�0����8l�
m�Z�h��������N����.�����4�.���p�{>~5��K[���s�e_})�����j�d��2�[>3��Eg���>�]V$�K��A��L�*�9���VI�P���p5����|i!�f�7���[G[�z����$�����_��u'#��W�6��d�`��C������'�����wR���v�\6}�.\B2H����|V��Oc��z�DQ,alXY6l����e��^�>�]�������k�����;���H����ZK�h�R���i6
�p�~�_pS��YP�>*uD
���+y��������kQICD�j�($v�$��g�__��94���n:&��� w��-|N���Ppm8���o���+����=;w�(�jI�d�g�R���]07?��m��S�pl�$H����M�hp�	@��5�,��N~���7Lz���������-���l��������~�����J��F�]����<�i	����u�����\?�����j�4�q�������.Wv*�����f2�>�K��������+8��d�����,O|6��y\5����d�a��'����1�����GY��� ��8�~������NEI�<S7IYHT�����a�U������WQ*�|�R���F�u���m�|`H�sO���~b�����M�MB�L
��`�OB���������]\�_M\�.��K �����O�j	]������l^��N,*��%�<�VK��i�������G-r�AFZ�~Z�Ry�~���XN�I��4ej���D�� _I�~��/�;�5���>�tOf�&s(l�v�NV����*K����,�t�� �'r� ��O7��<A&����E��l�.?��,��o��9	b�gh�m�f ���,��X�A�w��I����=�&+���i'����F��������zHht��j2���a����>��%5�]x2Q�K���s�f��C�����R��r2����]�6Hr����)2�f M=����U5%)��5^���Z��~W7��y]����>�t�U���2�]}�8���9��v�u�o��$�� <�&����|��}��u�
9�F��,���,%��C�Q�89��I r	���3��Z"V\�B jv��'��{ ��f�� V"��G�!�0����)����(�{��u�w��"�5��������d����l����|jX�5u�-�
yB
Pn��������;�����!�� ��X6x���C�Z�
������V��R/����:�W��$���^�f��G�2��B5��;���z���AcUjS�'A�3j��O��'������1����C���v��+G1u�vp�1��L��b��������tf0O~�Gw�h��\���@FY��!���e��U�\Y�K��N����I4E�������7�br��8���|���U� �
(�����/Y/3e�^nw������ ��R�{�����X��BM&���E��E'"L�b�"��������G��~�$hL�����$�R���{��I:�J���cG�9l[�e��������pwL	�q�J��Eu��<H�C��� =���j�sY�v�vM��[����.d�Q7�[�#����M�����
�M��Ke�M�(���\u�r��<^.�|�Ti�`y�V>���P,9	A�m�L����D"_����4A_��+�j68���P\����H!q��C��K"z!��0�`a4l�f��Xq�b���5���_PK��]�g��
PK�0O��]�g��
 ��worker_aborts_perf.svgUT
o�]s�]o�]ux��PKd�
v2-0011-BGWorkers-pool-for-streamed-transactions-apply-wi.patchtext/x-patch; name=v2-0011-BGWorkers-pool-for-streamed-transactions-apply-wi.patchDownload
From 7ffeeb69fd626dbcc35ca13d97a8407573ea6d4a Mon Sep 17 00:00:00 2001
From: Alexey Kondratov <kondratov.aleksey@gmail.com>
Date: Wed, 28 Aug 2019 15:26:50 +0300
Subject: [PATCH v2 11/11] BGWorkers pool for streamed transactions apply
 without spilling on disk

---
 src/backend/postmaster/bgworker.c        |    3 +
 src/backend/postmaster/pgstat.c          |    3 +
 src/backend/replication/logical/proto.c  |   17 +-
 src/backend/replication/logical/worker.c | 1783 +++++++++++-----------
 src/include/pgstat.h                     |    1 +
 src/include/replication/logicalproto.h   |    4 +-
 src/include/replication/logicalworker.h  |    1 +
 7 files changed, 936 insertions(+), 876 deletions(-)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index f5db5a8c4a..6860df07ca 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -129,6 +129,9 @@ static const struct
 	},
 	{
 		"ApplyWorkerMain", ApplyWorkerMain
+	},
+	{
+		"LogicalApplyBgwMain", LogicalApplyBgwMain
 	}
 };
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e5a4d147a7..b32994784f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3637,6 +3637,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
 			event_name = "Hash/GrowBuckets/Reinserting";
 			break;
+		case WAIT_EVENT_LOGICAL_APPLY_WORKER_READY:
+			event_name = "LogicalApplyWorkerReady";
+			break;
 		case WAIT_EVENT_LOGICAL_SYNC_DATA:
 			event_name = "LogicalSyncData";
 			break;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 4bec9fe8b5..954ce7343a 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -789,14 +789,11 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendint64(out, txn->commit_time);
 }
 
-TransactionId
+void
 logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
 {
-	TransactionId	xid;
 	uint8			flags;
 
-	xid = pq_getmsgint(in, 4);
-
 	/* read flags (unused for now) */
 	flags = pq_getmsgbyte(in);
 
@@ -807,8 +804,6 @@ logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
 	commit_data->commit_lsn = pq_getmsgint64(in);
 	commit_data->end_lsn = pq_getmsgint64(in);
 	commit_data->committime = pq_getmsgint64(in);
-
-	return xid;
 }
 
 void
@@ -823,13 +818,3 @@ logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 	pq_sendint32(out, xid);
 	pq_sendint32(out, subxid);
 }
-
-void
-logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
-							 TransactionId *subxid)
-{
-	Assert(xid && subxid);
-
-	*xid = pq_getmsgint(in, 4);
-	*subxid = pq_getmsgint(in, 4);
-}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ca632b7dc4..ab43b12985 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -92,11 +92,16 @@
 #include "rewrite/rewriteHandler.h"
 
 #include "storage/bufmgr.h"
+// #include "storage/condition_variable.h"
+#include "storage/dsm.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
+#include "storage/shm_mq.h"
+#include "storage/shm_toc.h"
+#include "storage/spin.h"
 
 #include "tcop/tcopprot.h"
 
@@ -115,6 +120,54 @@
 #include "utils/syscache.h"
 
 #define NAPTIME_PER_CYCLE 1000	/* max sleep time between cycles (1s) */
+#define PG_LOGICAL_APPLY_SHM_MAGIC 0x79fb2447 // TODO Consider change
+
+typedef struct ParallelState
+{
+	slock_t	mutex;
+	// ConditionVariable cv;
+	bool	attached;
+	bool	ready;
+	bool	finished;
+	Oid		database_id;
+	Oid		authenticated_user_id;
+	Oid		subid;
+	Oid		stream_xid;
+	uint32	n;
+} ParallelState;
+
+typedef struct WorkerState
+{
+	TransactionId			 xid;
+	BackgroundWorkerHandle	*handle;
+	shm_mq_handle			*mq_handle;
+	dsm_segment				*dsm_seg;
+	ParallelState volatile	*pstate;
+} WorkerState;
+
+/* Apply workers hash table (initialized on first use) */
+static HTAB *ApplyWorkersHash = NULL;
+static WorkerState **ApplyWorkersIdleList = NULL;
+static uint32 pool_size = 10; /* MaxConnections default? */
+static uint32 nworkers = 0;
+static uint32 nfreeworkers = 0;
+
+/* Fields valid only for apply background workers */
+bool isLogicalApplyWorker = false;
+volatile ParallelState *MyParallelState = NULL;
+
+/* Worker setup and interactions */
+static void setup_dsm(WorkerState *wstate);
+static void setup_background_worker(WorkerState *wstate);
+static void cleanup_background_worker(dsm_segment *seg, Datum arg);
+static void handle_sigterm(SIGNAL_ARGS);
+
+static bool check_worker_status(WorkerState *wstate);
+static void wait_for_worker(WorkerState *wstate);
+static void wait_for_worker_to_finish(WorkerState *wstate);
+
+static WorkerState * find_or_start_worker(TransactionId xid, bool start);
+static void stop_worker(WorkerState *wstate);
 
 typedef struct FlushPosition
 {
@@ -143,47 +196,13 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
-/* fields valid only when processing streamed transaction */
+/* Fields valid only when processing streamed transaction */
 bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
-static int	stream_fd = -1;
-
-typedef struct SubXactInfo
-{
-	TransactionId xid;			/* XID of the subxact */
-	off_t		offset;			/* offset in the file */
-}			SubXactInfo;
-
-static uint32 nsubxacts = 0;
-static uint32 nsubxacts_max = 0;
-static SubXactInfo * subxacts = NULL;
-static TransactionId subxact_last = InvalidTransactionId;
-
-static void subxact_filename(char *path, Oid subid, TransactionId xid);
-static void changes_filename(char *path, Oid subid, TransactionId xid);
-
-/*
- * Information about subtransactions of a given toplevel transaction.
- */
-static void subxact_info_write(Oid subid, TransactionId xid);
-static void subxact_info_read(Oid subid, TransactionId xid);
-static void subxact_info_add(TransactionId xid);
-
-/*
- * Serialize and deserialize changes for a toplevel transaction.
- */
-static void stream_cleanup_files(Oid subid, TransactionId xid);
-static void stream_open_file(Oid subid, TransactionId xid, bool first);
-static void stream_write_change(char action, StringInfo s);
-static void stream_close_file(void);
-
-/*
- * Array of serialized XIDs.
- */
-static int	nxids = 0;
-static int	maxnxids = 0;
-static TransactionId	*xids = NULL;
+static TransactionId current_xid = InvalidTransactionId;
+static TransactionId prev_xid = InvalidTransactionId;
+static uint32 nchanges = 0;
 
 static bool handle_streamed_transaction(const char action, StringInfo s);
 
@@ -199,6 +218,16 @@ static volatile sig_atomic_t got_SIGHUP = false;
 /* prototype needed because of stream_commit */
 static void apply_dispatch(StringInfo s);
 
+// /* Debug only */
+// static void
+// iter_sleep(int seconds)
+// {
+// 	for (int i = 0; i < seconds; i++)
+// 	{
+// 		pg_usleep(1 * 1000L * 1000L);
+// 	}
+// }
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -250,6 +279,107 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Look up worker inside ApplyWorkersHash for requested xid.
+ * Throw error if not found or start a new one if start=true is passed.
+ */
+static WorkerState *
+find_or_start_worker(TransactionId xid, bool start)
+{
+	bool found;
+	WorkerState *entry = NULL;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* First time through, initialize apply workers hashtable */
+	if (ApplyWorkersHash == NULL)
+	{
+		HASHCTL		ctl;
+
+		MemSet(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(TransactionId);
+		ctl.entrysize = sizeof(WorkerState);
+		ctl.hcxt = ApplyContext; /* Allocate ApplyWorkersHash in the ApplyContext */
+		ApplyWorkersHash = hash_create("logical apply workers hash", 8,
+									 &ctl,
+									 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	Assert(ApplyWorkersHash != NULL);
+
+	/*
+	 * Find entry for requested transaction.
+	 */
+	entry = hash_search(ApplyWorkersHash, &xid, HASH_FIND, &found);
+
+	if (!found && start)
+	{
+		/* If there is at least one worker in the idle list, then take one. */
+		if (nfreeworkers > 0)
+		{
+			char action = 'R';
+
+			Assert(ApplyWorkersIdleList != NULL);
+
+			entry = ApplyWorkersIdleList[nfreeworkers - 1];
+			if (!hash_update_hash_key(ApplyWorkersHash,
+									  (void *) entry,
+									  (void *) &xid))
+				elog(ERROR, "could not reassign apply worker #%u entry from xid %u to xid %u",
+													entry->pstate->n, entry->xid, xid);
+
+			entry->xid = xid;
+			entry->pstate->finished = false;
+			entry->pstate->stream_xid = xid;
+			shm_mq_send(entry->mq_handle, 1, &action, false);
+
+			ApplyWorkersIdleList[--nfreeworkers] = NULL;
+		}
+		else
+		{
+			/* No entry in hash and no idle workers. Create a new one. */
+			entry = hash_search(ApplyWorkersHash, &xid, HASH_ENTER, &found);
+			entry->xid = xid;
+			setup_background_worker(entry);
+
+			if (nworkers == pool_size)
+			{
+				ApplyWorkersIdleList = repalloc(ApplyWorkersIdleList, pool_size + 10);
+				pool_size += 10;
+			}
+		}
+	}
+	else if (!found && !start)
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+				errmsg("could not find logical apply worker for xid %u", xid)));
+	else
+		elog(DEBUG5, "there is an existing logical apply worker for xid %u", xid);
+
+	Assert(entry != NULL);
+
+	return entry;
+}
+
+/*
+ * Gracefully teardown apply worker.
+ */
+static void
+stop_worker(WorkerState *wstate)
+{
+	/*
+	 * Sending zero-length data to worker in order to stop it.
+	 */
+	shm_mq_send(wstate->mq_handle, 0, NULL, false);
+
+	elog(LOG, "detaching DSM of apply worker #%u for xid %u",
+									wstate->pstate->n, wstate->xid);
+	dsm_detach(wstate->dsm_seg);
+
+	/* Delete worker entry */
+	(void) hash_search(ApplyWorkersHash, &wstate->xid, HASH_REMOVE, NULL);
+}
+
 /*
  * Handle streamed transactions.
  *
@@ -262,12 +392,12 @@ static bool
 handle_streamed_transaction(const char action, StringInfo s)
 {
 	TransactionId xid;
+	WorkerState *entry;
 
 	/* not in streaming mode */
-	if (!in_streamed_transaction)
+	if (!in_streamed_transaction || isLogicalApplyWorker)
 		return false;
 
-	Assert(stream_fd != -1);
 	Assert(TransactionIdIsValid(stream_xid));
 
 	/*
@@ -278,11 +408,16 @@ handle_streamed_transaction(const char action, StringInfo s)
 
 	Assert(TransactionIdIsValid(xid));
 
-	/* Add the new subxact to the array (unless already there). */
-	subxact_info_add(xid);
+	/*
+	 * Find worker for requested xid.
+	 */
+	entry = find_or_start_worker(stream_xid, false);
 
-	/* write the change to the current file */
-	stream_write_change(action, s);
+	// elog(LOG, "sending message of length=%d and raw=%s, action=%s", s->len, s->data, (char *) &action);
+	shm_mq_send(entry->mq_handle, s->len, s->data, false);
+	nchanges += 1;
+
+	// iter_sleep(3600);
 
 	return true;
 }
@@ -643,7 +778,8 @@ apply_handle_origin(StringInfo s)
 static void
 apply_handle_stream_start(StringInfo s)
 {
-	bool		first_segment;
+	bool		 first_segment;
+	WorkerState *entry;
 
 	Assert(!in_streamed_transaction);
 
@@ -652,17 +788,16 @@ apply_handle_stream_start(StringInfo s)
 
 	/* extract XID of the top-level transaction */
 	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+	nchanges = 0;
 
-	/* open the spool file for this transaction */
-	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+	/* Find worker for requested xid */
+	entry = find_or_start_worker(stream_xid, true);
 
-	/*
-	 * if this is not the first segment, open existing file
-	 *
-	 * XXX Note that the cleanup is performed by stream_open_file.
-	 */
-	if (!first_segment)
-		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+	SpinLockAcquire(&entry->pstate->mutex);
+	entry->pstate->ready = false;
+	SpinLockRelease(&entry->pstate->mutex);
+
+	elog(LOG, "starting streaming of xid %u", stream_xid);
 
 	pgstat_report_activity(STATE_RUNNING, NULL);
 }
@@ -673,16 +808,19 @@ apply_handle_stream_start(StringInfo s)
 static void
 apply_handle_stream_stop(StringInfo s)
 {
+	WorkerState *entry;
+	char action = 'E';
+
 	Assert(in_streamed_transaction);
 
-	/*
-	 * Close the file with serialized changes, and serialize information about
-	 * subxacts for the toplevel transaction.
-	 */
-	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
-	stream_close_file();
+	/* Find worker for requested xid */
+	entry = find_or_start_worker(stream_xid, false);
+
+	shm_mq_send(entry->mq_handle, 1, &action, false);
+	wait_for_worker(entry);
 
 	in_streamed_transaction = false;
+	elog(LOG, "stopped streaming of xid %u, %u changes streamed", stream_xid, nchanges);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
@@ -695,96 +833,67 @@ apply_handle_stream_abort(StringInfo s)
 {
 	TransactionId xid;
 	TransactionId subxid;
+	WorkerState *entry;
 
 	Assert(!in_streamed_transaction);
 
-	logicalrep_read_stream_abort(s, &xid, &subxid);
-
-	/*
-	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
-	 * just delete the files with serialized info.
-	 */
-	if (xid == subxid)
+	if(isLogicalApplyWorker)
 	{
-		char		path[MAXPGPATH];
+		subxid = pq_getmsgint(s, 4);
 
-		/*
-		 * XXX Maybe this should be an error instead? Can we receive abort for
-		 * a toplevel transaction we haven't received?
-		 */
+		ereport(LOG,
+				(errcode_for_file_access(),
+				errmsg("[Apply BGW #%u] aborting current transaction xid=%u, subxid=%u",
+				MyParallelState->n, GetCurrentTransactionIdIfAny(), GetCurrentSubTransactionId())));
 
-		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		if (subxid == stream_xid)
+			AbortCurrentTransaction();
+		else
+		{
+			char *spname = (char *) palloc(64 * sizeof(char));
+			sprintf(spname, "savepoint_for_xid_%u", subxid);
 
-		if (unlink(path) < 0)
-			ereport(ERROR,
+			ereport(LOG,
 					(errcode_for_file_access(),
-					 errmsg("could not remove file \"%s\": %m", path)));
+					errmsg("[Apply BGW #%u] rolling back to savepoint %s", MyParallelState->n, spname)));
 
-		subxact_filename(path, MyLogicalRepWorker->subid, xid);
-
-		if (unlink(path) < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not remove file \"%s\": %m", path)));
+			RollbackToSavepoint(spname);
+			CommitTransactionCommand();
+			// RollbackAndReleaseCurrentSubTransaction();
 
-		return;
+			pfree(spname);
+		}
 	}
 	else
 	{
-		/*
-		 * OK, so it's a subxact. We need to read the subxact file for the
-		 * toplevel transaction, determine the offset tracked for the subxact,
-		 * and truncate the file with changes. We also remove the subxacts
-		 * with higher offsets (or rather higher XIDs).
-		 *
-		 * We intentionally scan the array from the tail, because we're likely
-		 * aborting a change for the most recent subtransactions.
-		 *
-		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
-		 * would allow us to use binary search here.
-		 *
-		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
-		 * order, i.e. from the inner-most subxact (when nested)? In which
-		 * case we could simply check the last element.
-		 */
+		xid = pq_getmsgint(s, 4);
+		subxid = pq_getmsgint(s, 4);
 
-		int64		i;
-		int64		subidx;
-		bool		found = false;
-		char		path[MAXPGPATH];
+		/* Find worker for requested xid */
+		entry = find_or_start_worker(stream_xid, false);
 
-		subidx = -1;
-		subxact_info_read(MyLogicalRepWorker->subid, xid);
+		elog(LOG, "processing abort request of streamed transaction xid %u, subxid %u",
+			xid, subxid);
+		shm_mq_send(entry->mq_handle, s->len, s->data, false);
 
-		/* FIXME optimize the search by bsearch on sorted data */
-		for (i = nsubxacts; i > 0; i--)
+		if (subxid == stream_xid)
 		{
-			if (subxacts[i - 1].xid == subxid)
-			{
-				subidx = (i - 1);
-				found = true;
-				break;
-			}
-		}
-
-		/* We should not receive aborts for unknown subtransactions. */
-		Assert(found);
+			char action = 'F';
+			shm_mq_send(entry->mq_handle, 1, &action, false);
+			// shm_mq_send(entry->mq_handle, 0, NULL, false);
 
-		/* OK, truncate the file at the right offset. */
-		Assert((subidx >= 0) && (subidx < nsubxacts));
+			wait_for_worker_to_finish(entry);
 
-		changes_filename(path, MyLogicalRepWorker->subid, xid);
+			elog(LOG, "adding finished apply worker #%u for xid %u to the idle list",
+												entry->pstate->n, entry->xid);
+			ApplyWorkersIdleList[nfreeworkers++] = entry;
 
-		if (truncate(path, subxacts[subidx].offset))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not truncate file \"%s\": %m", path)));
+			// elog(LOG, "detaching DSM of apply worker for xid=%u\n", entry->xid);
+			// dsm_detach(entry->dsm_seg);
 
-		/* discard the subxacts added later */
-		nsubxacts = subidx;
-
-		/* write the updated subxact list */
-		subxact_info_write(MyLogicalRepWorker->subid, xid);
+			// /* Delete worker entry */
+			// (void) hash_search(ApplyWorkersHash, &xid, HASH_REMOVE, NULL);
+		}
 	}
 }
 
@@ -794,159 +903,56 @@ apply_handle_stream_abort(StringInfo s)
 static void
 apply_handle_stream_commit(StringInfo s)
 {
-	int			fd;
 	TransactionId xid;
-	StringInfoData s2;
-	int			nchanges;
-
-	char		path[MAXPGPATH];
-	char	   *buffer = NULL;
+	WorkerState *entry;
 	LogicalRepCommitData commit_data;
 
-	MemoryContext oldcxt;
-
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	/* open the spool file for the committed transaction */
-	changes_filename(path, MyLogicalRepWorker->subid, xid);
-
-	elog(DEBUG1, "replaying changes from file '%s'", path);
-
-	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (fd < 0)
+	if (isLogicalApplyWorker)
 	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
-	}
-
-	/* XXX Should this be allocated in another memory context? */
+		// logicalrep_read_stream_commit(s, &commit_data);
 
-	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
-
-	buffer = palloc(8192);
-	initStringInfo(&s2);
-
-	MemoryContextSwitchTo(oldcxt);
-
-	ensure_transaction();
-
-	/*
-	 * Make sure the handle apply_dispatch methods are aware we're in a remote
-	 * transaction.
-	 */
-	in_remote_transaction = true;
-	pgstat_report_activity(STATE_RUNNING, NULL);
-
-	/*
-	 * Read the entries one by one and pass them through the same logic as in
-	 * apply_dispatch.
-	 */
-	nchanges = 0;
-	while (true)
+		CommitTransactionCommand();
+	}
+	else
 	{
-		int			nbytes;
-		int			len;
-
-		/* read length of the on-disk record */
-		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
-		nbytes = read(fd, &len, sizeof(len));
-		pgstat_report_wait_end();
-
-		/* have we reached end of the file? */
-		if (nbytes == 0)
-			break;
-
-		/* do we have a correct length? */
-		if (nbytes != sizeof(len))
-		{
-			int			save_errno = errno;
-
-			CloseTransientFile(fd);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file: %m")));
-			return;
-		}
-
-		Assert(len > 0);
+		char action = 'F';
 
-		/* make sure we have sufficiently large buffer */
-		buffer = repalloc(buffer, len);
-
-		/* and finally read the data into the buffer */
-		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
-		if (read(fd, buffer, len) != len)
-		{
-			int			save_errno = errno;
-
-			CloseTransientFile(fd);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file: %m")));
-			return;
-		}
-		pgstat_report_wait_end();
+		Assert(!in_streamed_transaction);
 
-		/* copy the buffer to the stringinfo and call apply_dispatch */
-		resetStringInfo(&s2);
-		appendBinaryStringInfo(&s2, buffer, len);
+		xid = pq_getmsgint(s, 4);
+		logicalrep_read_stream_commit(s, &commit_data);
 
-		/* Ensure we are reading the data into our memory context. */
-		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+		elog(DEBUG1, "received commit for streamed transaction %u", xid);
 
-		apply_dispatch(&s2);
+		/* Find worker for requested xid */
+		entry = find_or_start_worker(xid, false);
 
-		MemoryContextReset(ApplyMessageContext);
+		/* Send commit message */
+		shm_mq_send(entry->mq_handle, s->len, s->data, false);
 
-		MemoryContextSwitchTo(oldcxt);
+		/* Notify worker, that we are done with this xact */
+		shm_mq_send(entry->mq_handle, 1, &action, false);
 
-		nchanges++;
+		wait_for_worker_to_finish(entry);
 
-		if (nchanges % 1000 == 0)
-			elog(DEBUG1, "replayed %d changes from file '%s'",
-				 nchanges, path);
+		elog(LOG, "adding finished apply worker #%u for xid %u to the idle list",
+											entry->pstate->n, entry->xid);
+		ApplyWorkersIdleList[nfreeworkers++] = entry;
 
 		/*
-		 * send feedback to upstream
-		 *
-		 * XXX Probably should send a valid LSN. But which one?
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
 		 */
-		send_feedback(InvalidXLogRecPtr, false, false);
-	}
-
-	CloseTransientFile(fd);
-
-	/*
-	 * Update origin state so we can restart streaming from correct
-	 * position in case of crash.
-	 */
-	replorigin_session_origin_lsn = commit_data.end_lsn;
-	replorigin_session_origin_timestamp = commit_data.committime;
-
-	CommitTransactionCommand();
-	pgstat_report_stat(false);
-
-	store_flush_position(commit_data.end_lsn);
-
-	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
-		 nchanges, path);
+		replorigin_session_origin_lsn = commit_data.end_lsn;
+		replorigin_session_origin_timestamp = commit_data.committime;
 
-	in_remote_transaction = false;
-	pgstat_report_activity(STATE_IDLE, NULL);
+		pgstat_report_stat(false);
 
-	/* unlink the files with serialized changes and subxact info */
-	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+		store_flush_position(commit_data.end_lsn);
 
-	pfree(buffer);
-	pfree(s2.data);
+		in_remote_transaction = false;
+		pgstat_report_activity(STATE_IDLE, NULL);
+	}
 }
 
 /*
@@ -965,6 +971,8 @@ apply_handle_relation(StringInfo s)
 	if (handle_streamed_transaction('R', s))
 		return;
 
+	// iter_sleep(3600);
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -1407,6 +1415,38 @@ apply_dispatch(StringInfo s)
 {
 	char		action = pq_getmsgbyte(s);
 
+	if (isLogicalApplyWorker)
+	{
+		/*
+		 * Inside logical apply worker we can figure out that new subtransaction
+		 * was started if new change arrived with different xid. In that case we
+		 * can define named savepoint, so that we were able to commit/rollback it
+		 * separately later.
+		 * 
+		 * Special case is if the first change comes from subtransuction, then
+		 * we check that current_xid differs from stream_xid.
+		 */
+		current_xid = pq_getmsgint(s, 4);
+
+		if (current_xid != stream_xid
+			&& ((TransactionIdIsValid(prev_xid) && current_xid != prev_xid)
+				|| !TransactionIdIsValid(prev_xid)))
+		{
+			char *spname = (char *) palloc(64 * sizeof(char));
+			sprintf(spname, "savepoint_for_xid_%u", current_xid);
+
+			elog(LOG, "[Apply BGW #%u] defining savepoint %s", MyParallelState->n, spname);
+
+			DefineSavepoint(spname);
+			CommitTransactionCommand();
+			// BeginInternalSubTransaction(NULL);
+		}
+
+		prev_xid = current_xid;
+	}
+	// else
+	// 	elog(LOG, "Logical worker: applying dispatch for action=%s", (char *) &action);
+
 	switch (action)
 	{
 			/* BEGIN */
@@ -1435,6 +1475,7 @@ apply_dispatch(StringInfo s)
 			break;
 			/* RELATION */
 		case 'R':
+			// elog(LOG, "%s worker: applying dispatch for action=R", isLogicalApplyWorker ? "Apply" : "Logical");
 			apply_handle_relation(s);
 			break;
 			/* TYPE */
@@ -1565,12 +1606,18 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 static void
 worker_onexit(int code, Datum arg)
 {
-	int	i;
+	HASH_SEQ_STATUS status;
+	WorkerState *entry;
 
-	elog(LOG, "cleanup files for %d transactions", nxids);
-
-	for (i = nxids-1; i >= 0; i--)
-		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i]);
+	if (ApplyWorkersHash != NULL)
+	{
+		hash_seq_init(&status, ApplyWorkersHash);
+		while ((entry = (WorkerState *) hash_seq_search(&status)) != NULL)
+		{
+			stop_worker(entry);
+		}
+		hash_seq_term(&status);
+	}
 }
 
 /*
@@ -1593,6 +1640,8 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
+	ApplyWorkersIdleList = palloc(sizeof(WorkerState *) * pool_size);
+
 	for (;;)
 	{
 		pgsocket	fd = PGINVALID_SOCKET;
@@ -1904,8 +1953,9 @@ maybe_reread_subscription(void)
 	Subscription *newsub;
 	bool		started_tx = false;
 
+	// TODO Probably we have to handle subscription reread in apply workers too.
 	/* When cache state is valid there is nothing to do here. */
-	if (MySubscriptionValid)
+	if (MySubscriptionValid || isLogicalApplyWorker)
 		return;
 
 	/* This function might be called inside or outside of transaction. */
@@ -2039,608 +2089,50 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
-/*
- * subxact_info_write
- *	  Store information about subxacts for a toplevel transaction.
- *
- * For each subxact we store offset of it's first change in the main file.
- * The file is always over-written as a whole, and we also include CRC32C
- * checksum of the information.
- *
- * XXX We should only store subxacts that were not aborted yet.
- *
- * XXX Maybe we should only include the checksum when the cluster is
- * initialized with checksums?
- *
- * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
- */
+/* SIGHUP: set flag to reload configuration at next convenient time */
 static void
-subxact_info_write(Oid subid, TransactionId xid)
+logicalrep_worker_sighup(SIGNAL_ARGS)
 {
-	int			fd;
-	char		path[MAXPGPATH];
-	uint32		checksum;
-	Size		len;
-
-	Assert(TransactionIdIsValid(xid));
-
-	subxact_filename(path, subid, xid);
-
-	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
-	if (fd < 0)
-	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create file \"%s\": %m",
-						path)));
-		return;
-	}
-
-	len = sizeof(SubXactInfo) * nsubxacts;
-
-	/* compute the checksum */
-	INIT_CRC32C(checksum);
-	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
-	COMP_CRC32C(checksum, (char *) subxacts, len);
-	FIN_CRC32C(checksum);
-
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
-
-	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
-	{
-		int			save_errno = errno;
+	int			save_errno = errno;
 
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not write to file \"%s\": %m",
-						path)));
-		return;
-	}
+	got_SIGHUP = true;
 
-	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
-	{
-		int			save_errno = errno;
+	/* Waken anything waiting on the process latch */
+	SetLatch(MyLatch);
 
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not write to file \"%s\": %m",
-						path)));
-		return;
-	}
+	errno = save_errno;
+}
 
-	if ((len > 0) && (write(fd, subxacts, len) != len))
-	{
-		int			save_errno = errno;
+/* Logical Replication Apply worker entry point */
+void
+ApplyWorkerMain(Datum main_arg)
+{
+	int			worker_slot = DatumGetInt32(main_arg);
+	MemoryContext oldctx;
+	char		originname[NAMEDATALEN];
+	XLogRecPtr	origin_startpos;
+	char	   *myslotname;
+	WalRcvStreamOptions options;
 
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not write to file \"%s\": %m",
-						path)));
-		return;
-	}
+	/* Attach to slot */
+	logicalrep_worker_attach(worker_slot);
 
-	pgstat_report_wait_end();
+	/* Setup signal handling */
+	pqsignal(SIGHUP, logicalrep_worker_sighup);
+	pqsignal(SIGTERM, die);
+	BackgroundWorkerUnblockSignals();
 
 	/*
-	 * We don't need to fsync or anything, as we'll recreate the files after a
-	 * crash from scratch. So just close the file.
+	 * We don't currently need any ResourceOwner in a walreceiver process, but
+	 * if we did, we could call CreateAuxProcessResourceOwner here.
 	 */
-	CloseTransientFile(fd);
 
-	/*
-	 * But we free the memory allocated for subxact info. There might be one
-	 * exceptional transaction with many subxacts, and we don't want to keep
-	 * the memory allocated forewer.
-	 *
-	 */
-	if (subxacts)
-		pfree(subxacts);
+	/* Initialise stats to a sanish value */
+	MyLogicalRepWorker->last_send_time = MyLogicalRepWorker->last_recv_time =
+		MyLogicalRepWorker->reply_time = GetCurrentTimestamp();
 
-	subxacts = NULL;
-	subxact_last = InvalidTransactionId;
-	nsubxacts = 0;
-	nsubxacts_max = 0;
-}
-
-/*
- * subxact_info_read
- *	  Restore information about subxacts of a streamed transaction.
- *
- * Read information about subxacts into the global variables, and while
- * reading the information verify the checksum.
- *
- * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
- *
- * XXX Do we need to allocate it in TopMemoryContext?
- */
-static void
-subxact_info_read(Oid subid, TransactionId xid)
-{
-	int			fd;
-	char		path[MAXPGPATH];
-	uint32		checksum;
-	uint32		checksum_new;
-	Size		len;
-	MemoryContext oldctx;
-
-	Assert(TransactionIdIsValid(xid));
-	Assert(!subxacts);
-	Assert(nsubxacts == 0);
-	Assert(nsubxacts_max == 0);
-
-	subxact_filename(path, subid, xid);
-
-	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (fd < 0)
-	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
-		return;
-	}
-
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
-
-	/* read the checksum */
-	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not read file \"%s\": %m",
-						path)));
-		return;
-	}
-
-	/* read number of subxact items */
-	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not read file \"%s\": %m",
-						path)));
-		return;
-	}
-
-	pgstat_report_wait_end();
-
-	len = sizeof(SubXactInfo) * nsubxacts;
-
-	/* we keep the maximum as a power of 2 */
-	nsubxacts_max = 1 << my_log2(nsubxacts);
-
-	/* subxacts are long-lived */
-	oldctx = MemoryContextSwitchTo(TopMemoryContext);
-	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
-	MemoryContextSwitchTo(oldctx);
-
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
-
-	if ((len > 0) && ((read(fd, subxacts, len)) != len))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not read file \"%s\": %m",
-						path)));
-		return;
-	}
-
-	pgstat_report_wait_end();
-
-	/* recompute the checksum */
-	INIT_CRC32C(checksum_new);
-	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
-	COMP_CRC32C(checksum_new, (char *) subxacts, len);
-	FIN_CRC32C(checksum_new);
-
-	if (checksum_new != checksum)
-		ereport(ERROR,
-				(errmsg("checksum failure when reading subxacts")));
-
-	CloseTransientFile(fd);
-}
-
-/*
- * subxact_info_add
- *	  Add information about a subxact (offset in the main file).
- *
- * XXX Do we need to allocate it in TopMemoryContext?
- */
-static void
-subxact_info_add(TransactionId xid)
-{
-	int64		i;
-
-	/*
-	 * If the XID matches the toplevel transaction, we don't want to add it.
-	 */
-	if (stream_xid == xid)
-		return;
-
-	/*
-	 * In most cases we're checking the same subxact as we've already seen in
-	 * the last call, so make ure just ignore it (this change comes later).
-	 */
-	if (subxact_last == xid)
-		return;
-
-	/* OK, remember we're processing this XID. */
-	subxact_last = xid;
-
-	/*
-	 * Check if the transaction is already present in the array of subxact. We
-	 * intentionally scan the array from the tail, because we're likely adding
-	 * a change for the most recent subtransactions.
-	 *
-	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
-	 * would allow us to use binary search here.
-	 */
-	for (i = nsubxacts; i > 0; i--)
-	{
-		/* found, so we're done */
-		if (subxacts[i - 1].xid == xid)
-			return;
-	}
-
-	/* This is a new subxact, so we need to add it to the array. */
-
-	if (nsubxacts == 0)
-	{
-		MemoryContext oldctx;
-
-		nsubxacts_max = 128;
-		oldctx = MemoryContextSwitchTo(TopMemoryContext);
-		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
-		MemoryContextSwitchTo(oldctx);
-	}
-	else if (nsubxacts == nsubxacts_max)
-	{
-		nsubxacts_max *= 2;
-		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
-	}
-
-	subxacts[nsubxacts].xid = xid;
-	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
-
-	nsubxacts++;
-}
-
-/* format filename for file containing the info about subxacts */
-static void
-subxact_filename(char *path, Oid subid, TransactionId xid)
-{
-	char		tempdirpath[MAXPGPATH];
-
-	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
-
-	/*
-	 * We might need to create the tablespace's tempfile directory, if no
-	 * one has yet done so.
-	 *
-	 * Don't check for error from mkdir; it could fail if the directory
-	 * already exists (maybe someone else just did the same thing).  If
-	 * it doesn't work then we'll bomb out when opening the file
-	 */
-	mkdir(tempdirpath, S_IRWXU);
-
-	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
-			 tempdirpath, subid, xid);
-}
-
-/* format filename for file containing serialized changes */
-static void
-changes_filename(char *path, Oid subid, TransactionId xid)
-{
-	char		tempdirpath[MAXPGPATH];
-
-	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
-
-	/*
-	 * We might need to create the tablespace's tempfile directory, if no
-	 * one has yet done so.
-	 *
-	 * Don't check for error from mkdir; it could fail if the directory
-	 * already exists (maybe someone else just did the same thing).  If
-	 * it doesn't work then we'll bomb out when opening the file
-	 */
-	mkdir(tempdirpath, S_IRWXU);
-
-	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
-			 tempdirpath, subid, xid);
-}
-
-/*
- * stream_cleanup_files
- *	  Cleanup files for a subscription / toplevel transaction.
- *
- * Remove files with serialized changes and subxact info for a particular
- * toplevel transaction. Each subscription has a separate set of files.
- *
- * Note: The files may not exists, so handle ENOENT as non-error.
- *
- * TODO: Add missing_ok flag to specify in which cases it's OK not to
- * find the files, and when it's an error.
- */
-static void
-stream_cleanup_files(Oid subid, TransactionId xid)
-{
-	int			i;
-	char		path[MAXPGPATH];
-	bool		found = false;
-
-	subxact_filename(path, subid, xid);
-
-	if ((unlink(path) < 0) && (errno != ENOENT))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not remove file \"%s\": %m", path)));
-
-	changes_filename(path, subid, xid);
-
-	if ((unlink(path) < 0) && (errno != ENOENT))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not remove file \"%s\": %m", path)));
-
-	/*
-	 * Cleanup the XID from the array - find the XID in the array and
-	 * remove it by shifting all the remaining elements. The array is
-	 * bound to be fairly small (maximum number of in-progress xacts,
-	 * so max_connections + max_prepared_transactions) so simply loop
-	 * through the array and find index of the XID. Then move the rest
-	 * of the array by one element to the left.
-	 *
-	 * Notice we also call this from stream_open_file for first segment
-	 * of each transaction, to deal with possible left-overs after a
-	 * crash, so it's entirely possible not to find the XID in the
-	 * array here. In that case we don't remove anything.
-	 *
-	 * XXX Perhaps it'd be better to handle this automatically after a
-	 * restart, instead of doing it over and over for each transaction.
-	 */
-	for (i = 0; i < nxids; i++)
-	{
-		if (xids[i] == xid)
-		{
-			found = true;
-			break;
-		}
-	}
-
-	if (!found)
-		return;
-
-	/*
-	 * Move the last entry from the array to the place. We don't keep
-	 * the streamed transactions sorted or anything - we only expect 
-	 * a few of them in progress (max_connections + max_prepared_xacts)
-	 * so linear search is just fine.
-	 */
-	xids[i] = xids[nxids-1];
-	nxids--;
-}
-
-/*
- * stream_open_file
- *	  Open file we'll use to serialize changes for a toplevel transaction.
- *
- * Open a file for streamed changes from a toplevel transaction identified
- * by stream_xid (global variable). If it's the first chunk of streamed
- * changes for this transaction, perform cleanup by removing existing
- * files after a possible previous crash.
- *
- * This can only be called at the beginning of a "streaming" block, i.e.
- * between stream_start/stream_stop messages from the upstream.
- */
-static void
-stream_open_file(Oid subid, TransactionId xid, bool first_segment)
-{
-	char		path[MAXPGPATH];
-	int			flags;
-
-	Assert(in_streamed_transaction);
-	Assert(OidIsValid(subid));
-	Assert(TransactionIdIsValid(xid));
-	Assert(stream_fd == -1);
-
-	/*
-	 * If this is the first segment for this transaction, try removing
-	 * existing files (if there are any, possibly after a crash).
-	 */
-	if (first_segment)
-	{
-		MemoryContext	oldcxt;
-
-		/* XXX make sure there are no previous files for this transaction */
-		stream_cleanup_files(subid, xid);
-
-		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
-
-		/*
-		 * We need to remember the XIDs we spilled to files, so that we can
-		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
-		 *
-		 * The number of XIDs we may need to track is fairly small, because
-		 * we can only stream toplevel xacts (so limited by max_connections
-		 * and max_prepared_transactions), and we only stream the large ones.
-		 * So we simply keep the XIDs in an unsorted array. If the number of
-		 * xacts gets large for some reason (e.g. very high max_connections),
-		 * a more elaborate approach might be better - e.g. sorted array, to
-		 * speed-up the lookups.
-		 */
-		if (nxids == maxnxids)	/* array of XIDs is full */
-		{
-			if (!xids)
-			{
-				maxnxids = 64;
-				xids = palloc(maxnxids * sizeof(TransactionId));
-			}
-			else
-			{
-				maxnxids = 2 * maxnxids;
-				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
-			}
-		}
-
-		xids[nxids++] = xid;
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-
-	changes_filename(path, subid, xid);
-
-	elog(DEBUG1, "opening file '%s' for streamed changes", path);
-
-	/*
-	 * If this is the first streamed segment, the file must not exist, so
-	 * make sure we're the ones creating it. Otherwise just open the file
-	 * for writing, in append mode.
-	 */
-	if (first_segment)
-		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
-	else
-		flags = (O_WRONLY | O_APPEND | PG_BINARY);
-
-	stream_fd = OpenTransientFile(path, flags);
-
-	if (stream_fd < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
-}
-
-/*
- * stream_close_file
- *	  Close the currently open file with streamed changes.
- *
- * This can only be called at the beginning of a "streaming" block, i.e.
- * between stream_start/stream_stop messages from the upstream.
- */
-static void
-stream_close_file(void)
-{
-	Assert(in_streamed_transaction);
-	Assert(TransactionIdIsValid(stream_xid));
-	Assert(stream_fd != -1);
-
-	CloseTransientFile(stream_fd);
-
-	stream_xid = InvalidTransactionId;
-	stream_fd = -1;
-}
-
-/*
- * stream_write_change
- *	  Serialize a change to a file for the current toplevel transaction.
- *
- * The change is serialied in a simple format, with length (not including
- * the length), action code (identifying the message type) and message
- * contents (without the subxact TransactionId value).
- *
- * XXX The subxact file includes CRC32C of the contents. Maybe we should
- * include something like that here too, but doing so will not be as
- * straighforward, because we write the file in chunks.
- */
-static void
-stream_write_change(char action, StringInfo s)
-{
-	int			len;
-
-	Assert(in_streamed_transaction);
-	Assert(TransactionIdIsValid(stream_xid));
-	Assert(stream_fd != -1);
-
-	/* total on-disk size, including the action type character */
-	len = (s->len - s->cursor) + sizeof(char);
-
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
-
-	/* first write the size */
-	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not serialize streamed change to file: %m")));
-
-	/* then the action */
-	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not serialize streamed change to file: %m")));
-
-	/* and finally the remaining part of the buffer (after the XID) */
-	len = (s->len - s->cursor);
-
-	if (write(stream_fd, &s->data[s->cursor], len) != len)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not serialize streamed change to file: %m")));
-
-	pgstat_report_wait_end();
-}
-
-/* SIGHUP: set flag to reload configuration at next convenient time */
-static void
-logicalrep_worker_sighup(SIGNAL_ARGS)
-{
-	int			save_errno = errno;
-
-	got_SIGHUP = true;
-
-	/* Waken anything waiting on the process latch */
-	SetLatch(MyLatch);
-
-	errno = save_errno;
-}
-
-/* Logical Replication Apply worker entry point */
-void
-ApplyWorkerMain(Datum main_arg)
-{
-	int			worker_slot = DatumGetInt32(main_arg);
-	MemoryContext oldctx;
-	char		originname[NAMEDATALEN];
-	XLogRecPtr	origin_startpos;
-	char	   *myslotname;
-	WalRcvStreamOptions options;
-
-	/* Attach to slot */
-	logicalrep_worker_attach(worker_slot);
-
-	/* Setup signal handling */
-	pqsignal(SIGHUP, logicalrep_worker_sighup);
-	pqsignal(SIGTERM, die);
-	BackgroundWorkerUnblockSignals();
-
-	/*
-	 * We don't currently need any ResourceOwner in a walreceiver process, but
-	 * if we did, we could call CreateAuxProcessResourceOwner here.
-	 */
-
-	/* Initialise stats to a sanish value */
-	MyLogicalRepWorker->last_send_time = MyLogicalRepWorker->last_recv_time =
-		MyLogicalRepWorker->reply_time = GetCurrentTimestamp();
-
-	/* Load the libpq-specific functions */
-	load_file("libpqwalreceiver", false);
+	/* Load the libpq-specific functions */
+	load_file("libpqwalreceiver", false);
 
 	/* Run as replica session replication role. */
 	SetConfigOption("session_replication_role", "replica",
@@ -2798,3 +2290,580 @@ IsLogicalWorker(void)
 {
 	return MyLogicalRepWorker != NULL;
 }
+
+/*
+ * Apply Background Worker main loop.
+ */
+void
+LogicalApplyBgwMain(Datum main_arg)
+{
+	volatile ParallelState *pst;
+
+	dsm_segment			*seg;
+	shm_toc				*toc;
+	PGPROC				*registrant;
+	shm_mq				*mq;
+	shm_mq_handle		*mqh;
+	shm_mq_result		 shmq_res;
+	// ConditionVariable	 cv;
+	LogicalRepWorker	 lrw;
+	MemoryContext		 oldcontext;
+
+	MemoryContextSwitchTo(TopMemoryContext);
+
+	/* Load the subscription into persistent memory context. */
+	ApplyContext = AllocSetContextCreate(TopMemoryContext,
+										 "ApplyContext",
+										 ALLOCSET_DEFAULT_SIZES);
+
+	oldcontext = MemoryContextSwitchTo(ApplyContext);
+
+	/*
+	 * Init the ApplyMessageContext which we clean up after each replication
+	 * protocol message.
+	 */
+	ApplyMessageContext = AllocSetContextCreate(ApplyContext,
+												"ApplyMessageContext",
+												ALLOCSET_DEFAULT_SIZES);
+
+	isLogicalApplyWorker = true;
+
+	/*
+	 * Establish signal handlers.
+	 *
+	 * We want CHECK_FOR_INTERRUPTS() to kill off this worker process just as
+	 * it would a normal user backend.  To make that happen, we establish a
+	 * signal handler that is a stripped-down version of die().
+	 */
+	pqsignal(SIGTERM, handle_sigterm);
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Connect to the dynamic shared memory segment.
+	 *
+	 * The backend that registered this worker passed us the ID of a shared
+	 * memory segment to which we must attach for further instructions.  In
+	 * order to attach to dynamic shared memory, we need a resource owner.
+	 * Once we've mapped the segment in our address space, attach to the table
+	 * of contents so we can locate the various data structures we'll need to
+	 * find within the segment.
+	 */
+	CurrentResourceOwner = ResourceOwnerCreate(NULL, "Logical apply worker");
+	seg = dsm_attach(DatumGetInt32(main_arg));
+	if (seg == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("unable to map dynamic shared memory segment")));
+	toc = shm_toc_attach(PG_LOGICAL_APPLY_SHM_MAGIC, dsm_segment_address(seg));
+	if (toc == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("bad magic number in dynamic shared memory segment")));
+
+	/*
+	 * Acquire a worker number.
+	 *
+	 * By convention, the process registering this background worker should
+	 * have stored the control structure at key 0.  We look up that key to
+	 * find it.  Our worker number gives our identity: there may be just one
+	 * worker involved in this parallel operation, or there may be many.
+	 */
+	pst = shm_toc_lookup(toc, 0, false);
+	MyParallelState = pst;
+
+	SpinLockAcquire(&pst->mutex);
+	pst->attached = true;
+	SpinLockRelease(&pst->mutex);
+
+	/*
+	 * Attach to the message queue.
+	 */
+	mq = shm_toc_lookup(toc, 1, false);
+	shm_mq_set_receiver(mq, MyProc);
+	mqh = shm_mq_attach(mq, seg, NULL);
+
+	/* Restore database connection. */
+	BackgroundWorkerInitializeConnectionByOid(pst->database_id,
+											  pst->authenticated_user_id, 0);
+
+	/*
+	 * Set the client encoding to the database encoding, since that is what
+	 * the leader will expect.
+	 */
+	SetClientEncoding(GetDatabaseEncoding());
+
+	lrw.subid = pst->subid;
+	MyLogicalRepWorker = &lrw;
+
+	stream_xid = pst->stream_xid;
+
+	StartTransactionCommand();
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	// PushActiveSnapshot(GetTransactionSnapshot());
+
+	MySubscription = GetSubscription(MyLogicalRepWorker->subid, true);
+
+	/*
+	 * Indicate that we're fully initialized and ready to begin the main part
+	 * of the parallel operation.
+	 *
+	 * Once we signal that we're ready, the user backend is entitled to assume
+	 * that our on_dsm_detach callbacks will fire before we disconnect from
+	 * the shared memory segment and exit.  Generally, that means we must have
+	 * attached to all relevant dynamic shared memory data structures by now.
+	 */
+	SpinLockAcquire(&pst->mutex);
+	pst->ready = true;
+	// cv = pst->cv;
+	// if (pst->workers_ready == pst->workers_total)
+	// {
+	//	 registrant = BackendPidGetProc(MyBgworkerEntry->bgw_notify_pid);
+	//	 if (registrant == NULL)
+	//	 {
+	//		 elog(DEBUG1, "registrant backend has exited prematurely");
+	//		 proc_exit(1);
+	//	 }
+	//	 SetLatch(&registrant->procLatch);
+	// }
+	SpinLockRelease(&pst->mutex);
+	elog(LOG, "[Apply BGW #%u] started", pst->n);
+
+	registrant = BackendPidGetProc(MyBgworkerEntry->bgw_notify_pid);
+	SetLatch(&registrant->procLatch);
+
+	for (;;)
+	{
+		void *data;
+		Size  len;
+		StringInfoData s;
+		MemoryContext	oldctx;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Ensure we are reading the data into our memory context. */
+		oldctx = MemoryContextSwitchTo(ApplyMessageContext);
+
+		shmq_res = shm_mq_receive(mqh, &len, &data, false);
+
+		if (shmq_res != SHM_MQ_SUCCESS)
+			break;
+
+		if (len == 0)
+		{
+			elog(LOG, "[Apply BGW #%u] got zero-length message, stopping", pst->n);
+			break;
+		}
+		else
+		{
+			s.cursor = 0;
+			s.maxlen = -1;
+			s.data = (char *) data;
+			s.len = len;
+
+			/*
+			 * We use first byte of message for additional communication between
+			 * main Logical replication worker and Apply BGWorkers, so if it
+			 * differs from 'w', then process it first.
+			 */
+			switch (pq_getmsgbyte(&s))
+			{
+				/* Stream stop */
+				case 'E':
+				{
+					in_remote_transaction = false;
+
+					SpinLockAcquire(&pst->mutex);
+					pst->ready = true;
+					SpinLockRelease(&pst->mutex);
+					SetLatch(&registrant->procLatch);
+
+					elog(LOG, "[Apply BGW #%u] ended processing streaming chunk, waiting on shm_mq_receive", pst->n);
+
+					continue;
+				}
+				/* Reassign to the new transaction */
+				case 'R':
+				{
+					elog(LOG, "[Apply BGW #%u] switching from processing xid %u to xid %u",
+											pst->n, stream_xid, pst->stream_xid);
+					stream_xid = pst->stream_xid;
+
+					StartTransactionCommand();
+					BeginTransactionBlock();
+					CommitTransactionCommand();
+					StartTransactionCommand();
+
+					MySubscription = GetSubscription(MyLogicalRepWorker->subid, true);
+
+					continue;
+				}
+				/* Finished processing xact */
+				case 'F':
+				{
+					elog(LOG, "[Apply BGW #%u] finished processing xact %u", pst->n, stream_xid);
+
+					MemoryContextSwitchTo(ApplyContext);
+
+					CommitTransactionCommand();
+					EndTransactionBlock();
+					CommitTransactionCommand();
+
+					SpinLockAcquire(&pst->mutex);
+					pst->finished = true;
+					SpinLockRelease(&pst->mutex);
+
+					continue;
+				}
+				default:
+					break;
+			}
+
+			pq_getmsgint64(&s); // Read LSN info
+			pq_getmsgint64(&s); // TODO Do we need to process it here again somehow?
+			pq_getmsgint64(&s);
+
+			/*
+			 * Make sure the handle apply_dispatch methods are aware we're in a remote
+			 * transaction.
+			 */
+			in_remote_transaction = true;
+			pgstat_report_activity(STATE_RUNNING, NULL);
+
+			elog(DEBUG5, "[Apply BGW #%u] applying dispatch for action=%s",
+									pst->n, (char *) &s.data[s.cursor]);
+			apply_dispatch(&s);
+		}
+
+		MemoryContextSwitchTo(oldctx);
+		MemoryContextReset(ApplyMessageContext);
+	}
+
+	CommitTransactionCommand();
+	EndTransactionBlock();
+	CommitTransactionCommand();
+
+	MemoryContextSwitchTo(oldcontext);
+	MemoryContextReset(ApplyContext);
+
+	SpinLockAcquire(&pst->mutex);
+	pst->finished = true;
+	// if (pst->workers_finished == pst->workers_total)
+	// {
+	//	 registrant = BackendPidGetProc(MyBgworkerEntry->bgw_notify_pid);
+	//	 if (registrant == NULL)
+	//	 {
+	//		 elog(DEBUG1, "registrant backend has exited prematurely");
+	//		 proc_exit(1);
+	//	 }
+	//	 SetLatch(&registrant->procLatch);
+	// }
+	SpinLockRelease(&pst->mutex);
+
+	elog(LOG, "[Apply BGW #%u] exiting", pst->n);
+
+	/* Signal main process that we are done. */
+	// ConditionVariableBroadcast(&cv);
+	SetLatch(&registrant->procLatch);
+
+	/*
+	 * We're done.  Explicitly detach the shared memory segment so that we
+	 * don't get a resource leak warning at commit time.  This will fire any
+	 * on_dsm_detach callbacks we've registered, as well.  Once that's done,
+	 * we can go ahead and exit.
+	 */
+	dsm_detach(seg);
+	proc_exit(0);
+}
+
+/*
+ * When we receive a SIGTERM, we set InterruptPending and ProcDiePending just
+ * like a normal backend.  The next CHECK_FOR_INTERRUPTS() will do the right
+ * thing.
+ */
+static void
+handle_sigterm(SIGNAL_ARGS)
+{
+	int save_errno = errno;
+
+	SetLatch(MyLatch);
+
+	if (!proc_exit_inprogress)
+	{
+		InterruptPending = true;
+		ProcDiePending = true;
+	}
+
+	errno = save_errno;
+}
+
+/*
+ * Set up a dynamic shared memory segment.
+ *
+ * We set up a control region that contains a ParallelState,
+ * plus one region per message queue. There are as many message queues as
+ * the number of workers.
+ */
+static void
+setup_dsm(WorkerState *wstate)
+{
+	shm_toc_estimator	 e;
+	int					 toc_key = 0;
+	Size				 segsize;
+	dsm_segment			*seg;
+	shm_toc				*toc;
+	ParallelState		*pst;
+	shm_mq				*mq;
+	int64				 queue_size = 160000000; /* 16 MB for now */
+
+	/* Ensure a valid queue size. */
+	if (queue_size < 0 || ((uint64) queue_size) < shm_mq_minimum_size)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("queue size must be at least %zu bytes",
+						shm_mq_minimum_size)));
+	if (queue_size != ((Size) queue_size))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("queue size overflows size_t")));
+
+	/*
+	 * Estimate how much shared memory we need.
+	 *
+	 * Because the TOC machinery may choose to insert padding of oddly-sized
+	 * requests, we must estimate each chunk separately.
+	 *
+	 * We need one key to register the location of the header, and we need
+	 * nworkers keys to track the locations of the message queues.
+	 */
+	shm_toc_initialize_estimator(&e);
+	shm_toc_estimate_chunk(&e, sizeof(ParallelState));
+	shm_toc_estimate_chunk(&e, (Size) queue_size);
+
+	shm_toc_estimate_keys(&e, 1 + 1);
+	segsize = shm_toc_estimate(&e);
+
+	/* Create the shared memory segment and establish a table of contents. */
+	seg = dsm_create(shm_toc_estimate(&e), 0);
+	toc = shm_toc_create(PG_LOGICAL_APPLY_SHM_MAGIC, dsm_segment_address(seg),
+						 segsize);
+
+	/* Set up the header region. */
+	pst = shm_toc_allocate(toc, sizeof(ParallelState));
+	SpinLockInit(&pst->mutex);
+	pst->attached = false;
+	pst->ready = false;
+	pst->finished = false;
+	pst->database_id = MyDatabaseId;
+	pst->subid = MyLogicalRepWorker->subid;
+	pst->stream_xid = stream_xid;
+	pst->authenticated_user_id = GetAuthenticatedUserId();
+	pst->n = nworkers + 1;
+	// ConditionVariableInit(&pst->cv);
+
+	shm_toc_insert(toc, toc_key++, pst);
+
+	/* Set up one message queue per worker, plus one. */
+	mq = shm_mq_create(shm_toc_allocate(toc, (Size) queue_size),
+						(Size) queue_size);
+	shm_toc_insert(toc, toc_key++, mq);
+	shm_mq_set_sender(mq, MyProc);
+
+	/* Attach the queues. */
+	wstate->mq_handle = shm_mq_attach(mq, seg, wstate->handle);
+
+	/* Return results to caller. */
+	wstate->dsm_seg = seg;
+	wstate->pstate = pst;
+}
+
+/*
+ * Register background workers.
+ */
+static void
+setup_background_worker(WorkerState *wstate)
+{
+	MemoryContext		oldcontext;
+	BackgroundWorker	worker;
+
+	elog(LOG, "setting up apply worker #%u", nworkers + 1);
+
+	/*
+	 * TOCHECK: We need the worker_state object and the background worker handles to
+	 * which it points to be allocated in TopMemoryContext rather than
+	 * ApplyMessageContext; otherwise, they'll be destroyed before the on_dsm_detach
+	 * hooks run.
+	 */
+	oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+	setup_dsm(wstate);
+
+	/*
+	 * Arrange to kill all the workers if we abort before all workers are
+	 * finished hooking themselves up to the dynamic shared memory segment.
+	 *
+	 * If we die after all the workers have finished hooking themselves up to
+	 * the dynamic shared memory segment, we'll mark the two queues to which
+	 * we're directly connected as detached, and the worker(s) connected to
+	 * those queues will exit, marking any other queues to which they are
+	 * connected as detached.  This will cause any as-yet-unaware workers
+	 * connected to those queues to exit in their turn, and so on, until
+	 * everybody exits.
+	 *
+	 * But suppose the workers which are supposed to connect to the queues to
+	 * which we're directly attached exit due to some error before they
+	 * actually attach the queues.  The remaining workers will have no way of
+	 * knowing this.  From their perspective, they're still waiting for those
+	 * workers to start, when in fact they've already died.
+	 */
+	on_dsm_detach(wstate->dsm_seg, cleanup_background_worker,
+				  PointerGetDatum(wstate));
+
+	/* Configure a worker. */
+	MemSet(&worker, 0, sizeof(BackgroundWorker));
+
+	worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
+		BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_ConsistentState;
+	worker.bgw_restart_time = BGW_NEVER_RESTART;
+	worker.bgw_notify_pid = MyProcPid;
+	sprintf(worker.bgw_library_name, "postgres");
+	sprintf(worker.bgw_function_name, "LogicalApplyBgwMain");
+
+	worker.bgw_main_arg = UInt32GetDatum(dsm_segment_handle(wstate->dsm_seg));
+
+	/* Register the workers. */
+	snprintf(worker.bgw_name, BGW_MAXLEN,
+			"logical replication apply worker #%u for subscription %u",
+										nworkers + 1, MySubscription->oid);
+	if (!RegisterDynamicBackgroundWorker(&worker, &wstate->handle))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+					errmsg("could not register background process"),
+					errhint("You may need to increase max_worker_processes.")));
+
+	/* All done. */
+	MemoryContextSwitchTo(oldcontext);
+
+	/* Wait for worker to become ready. */
+	wait_for_worker(wstate);
+
+	/*
+	 * Once we reach this point, all workers are ready.  We no longer need to
+	 * kill them if we die; they'll die on their own as the message queues
+	 * shut down.
+	 */
+	cancel_on_dsm_detach(wstate->dsm_seg, cleanup_background_worker,
+						 PointerGetDatum(wstate));
+
+	nworkers += 1;
+}
+
+static void
+cleanup_background_worker(dsm_segment *seg, Datum arg)
+{
+	WorkerState *wstate = (WorkerState *) DatumGetPointer(arg);
+
+	TerminateBackgroundWorker(wstate->handle);
+}
+
+static void
+wait_for_worker(WorkerState *wstate)
+{
+	bool result = false;
+
+	for (;;)
+	{
+		// ConditionVariable cv;
+		bool ready;
+
+		/* If the worker is ready, we have succeeded. */
+		SpinLockAcquire(&wstate->pstate->mutex);
+		ready = wstate->pstate->ready;
+		// cv = wstate->pstate->cv;
+		SpinLockRelease(&wstate->pstate->mutex);
+		if (ready)
+		{
+			result = true;
+			break;
+		}
+
+		/* If any workers (or the postmaster) have died, we have failed. */
+		if (!check_worker_status(wstate))
+		{
+			result = false;
+			break;
+		}
+
+		/* Wait for the workers to wake us up. */
+		// ConditionVariableSleep(&cv, WAIT_EVENT_LOGICAL_APPLY_WORKER_READY);
+
+		/* Wait to be signalled. */
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
+							WAIT_EVENT_LOGICAL_APPLY_WORKER_READY);
+
+		/* Reset the latch so we don't spin. */
+		ResetLatch(MyLatch);
+
+		/* An interrupt may have occurred while we were waiting. */
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	// ConditionVariableCancelSleep();
+
+	if (!result)
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+				 errmsg("one or more background workers failed to start")));
+}
+
+static bool
+check_worker_status(WorkerState *wstate)
+{
+	BgwHandleStatus status;
+	pid_t			pid;
+
+	status = GetBackgroundWorkerPid(wstate->handle, &pid);
+	if (status == BGWH_STOPPED || status == BGWH_POSTMASTER_DIED)
+		return false;
+
+	/* Otherwise, things still look OK. */
+	return true;
+}
+
+static void
+wait_for_worker_to_finish(WorkerState *wstate)
+{
+	elog(LOG, "waiting for apply worker #%u to finish processing xid %u",
+										wstate->pstate->n, wstate->xid);
+
+	for (;;)
+	{
+		// ConditionVariable cv;
+		bool finished;
+
+		/* If the worker is finished, we have succeeded. */
+		SpinLockAcquire(&wstate->pstate->mutex);
+		finished = wstate->pstate->finished;
+		// cv = wstate->pstate->cv;
+		SpinLockRelease(&wstate->pstate->mutex);
+		if (finished)
+		{
+			break;
+		}
+
+		/* Wait for the workers to wake us up. */
+		// ConditionVariableSleep(&cv, WAIT_EVENT_LOGICAL_APPLY_WORKER_READY);
+
+		/* Wait to be signalled. */
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
+							WAIT_EVENT_LOGICAL_APPLY_WORKER_READY);
+
+		/* Reset the latch so we don't spin. */
+		ResetLatch(MyLatch);
+
+		/* An interrupt may have occurred while we were waiting. */
+		CHECK_FOR_INTERRUPTS();
+	}
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 3a89e23488..7c72db9e83 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -819,6 +819,7 @@ typedef enum
 	WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
 	WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
 	WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
+	WAIT_EVENT_LOGICAL_APPLY_WORKER_READY,
 	WAIT_EVENT_LOGICAL_SYNC_DATA,
 	WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
 	WAIT_EVENT_MQ_INTERNAL,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 802275311d..afb15c2736 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -122,12 +122,10 @@ extern TransactionId logicalrep_read_stream_stop(StringInfo in);
 
 extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn);
-extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+extern void logicalrep_read_stream_commit(StringInfo out,
 					   LogicalRepCommitData *commit_data);
 
 extern void logicalrep_write_stream_abort(StringInfo out,
 							  TransactionId xid, TransactionId subxid);
-extern void logicalrep_read_stream_abort(StringInfo in,
-							 TransactionId *xid, TransactionId *subxid);
 
 #endif							/* LOGICALREP_PROTO_H */
diff --git a/src/include/replication/logicalworker.h b/src/include/replication/logicalworker.h
index e9524aefd9..30ad40247d 100644
--- a/src/include/replication/logicalworker.h
+++ b/src/include/replication/logicalworker.h
@@ -13,6 +13,7 @@
 #define LOGICALWORKER_H
 
 extern void ApplyWorkerMain(Datum main_arg);
+extern void LogicalApplyBgwMain(Datum main_arg);
 
 extern bool IsLogicalWorker(void);
 
-- 
2.17.1

#69Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Alexey Kondratov (#68)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 16.09.2019 19:54, Alexey Kondratov wrote:

On 30.08.2019 18:59, Konstantin Knizhnik wrote:

I think that instead of defining savepoints it is simpler and more
efficient to use

BeginInternalSubTransaction +
ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction

as it is done in PL/pgSQL (pl_exec.c).
Not sure if it can pr

Both BeginInternalSubTransaction and DefineSavepoint use
PushTransaction() internally for a normal subtransaction start. So
they seems to be identical from the performance perspective, which is
also stated in the comment section:

Yes, definitely them are using the same mechanism and most likely
provides similar performance.
But BeginInternalSubTransaction does not require to generate some
savepoint name which seems to be redundant in this case.

Anyway, I've performed a profiling of my apply worker (flamegraph is
attached) and it spends the vast amount of time (>90%) applying
changes. So the problem is not in the savepoints their-self, but in
the fact that we first apply all changes and then abort all the work.
Not sure, that it is possible to do something in this case.

Looks like the only way to increase apply speed is to do it in parallel:
make it possible to concurrently execute non-conflicting transactions.

#70Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Konstantin Knizhnik (#69)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Sep 16, 2019 at 10:29:18PM +0300, Konstantin Knizhnik wrote:

On 16.09.2019 19:54, Alexey Kondratov wrote:

On 30.08.2019 18:59, Konstantin Knizhnik wrote:

I think that instead of defining savepoints it is simpler and more
efficient to use

BeginInternalSubTransaction +
ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction

as it is done in PL/pgSQL (pl_exec.c).
Not sure if it can pr

Both BeginInternalSubTransaction and DefineSavepoint use
PushTransaction() internally for a normal subtransaction start. So
they seems to be identical from the performance perspective, which
is also stated in the comment section:

Yes, definitely them are using the same mechanism and most likely
provides similar performance.
But BeginInternalSubTransaction does not require to generate some
savepoint name which seems to be redundant in this case.

Anyway, I've performed a profiling of my apply worker (flamegraph is
attached) and it spends the vast amount of time (>90%) applying
changes. So the problem is not in the savepoints their-self, but in
the fact that we first apply all changes and then abort all the
work. Not sure, that it is possible to do something in this case.

Looks like the only way to increase apply speed is to do it in
parallel: make it possible to concurrently execute non-conflicting
transactions.

True, although it seems like a massive can of worms to me. I'm not aware
a way to identify non-conflicting transactions in advance, so it would
have to be implemented as optimistic apply, with a detection and
recovery from conflicts.

I'm not against doing that, and I'm willing to spend some time on revies
etc. but it seems like a completely separate effort.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#71Amit Kapila
amit.kapila16@gmail.com
In reply to: Alvaro Herrera (#66)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Sep 3, 2019 at 4:30 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

In the interest of moving things forward, how far are we from making
0001 committable? If I understand correctly, the rest of this patchset
depends on https://commitfest.postgresql.org/24/944/ which seems to be
moving at a glacial pace (or, actually, slower, because glaciers do
move, which cannot be said of that other patch.)

I am not sure if it is completely correct that the other part of the
patch is dependent on that CF entry. I have studied both the threads
(not every detail) and it seems to me it is dependent on one of the
patches from that series which handles concurrent aborts. It is patch
0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.Jan4.patch
from what the Nikhil has posted on that thread [1]/messages/by-id/CAMGcDxcBmN6jNeQkgWddfhX8HbSjQpW=Uo70iBY3P_EPdp+LTQ@mail.gmail.com. Am, I wrong?

So IIUC, the problem of concurrent aborts is that if we allow catalog
scans for in-progress transactions, then we might get wrong answers in
cases where somebody has performed Alter-Abort-Alter which is clearly
explained with an example in email [2]/messages/by-id/EEBD82AA-61EE-46F4-845E-05B94168E8F2@postgrespro.ru. To solve that problem Nikhil
seems to have written a patch [1]/messages/by-id/CAMGcDxcBmN6jNeQkgWddfhX8HbSjQpW=Uo70iBY3P_EPdp+LTQ@mail.gmail.com which detects these concurrent
aborts during a system table scan and then aborts the decoding of such
a transaction.

Now, the problem is that patch has written considering 2PC
transactions and might not deal with all cases for in-progress
transactions especially when sub-transactions are involved as alluded
by Arseny Sher [3]/messages/by-id/87a7py4iwl.fsf@ars-thinkpad. So, the problem seems to be for cases when some
sub-transaction aborts, but the main transaction still continued and
we try to decode it. Nikhil's patch won't be able to deal with it
because I think it just checks top-level xid whereas for this we need
to check all-subxids which I think is possible now as Tomas seems to
have written WAL for each xid-assignment. It might or might not be
the best solution to check the status of all-subxids, but I think
first we need to agree that the problem is just for concurrent aborts
and that we can solve it by using some part of the technology being
developed as part of patch "Logical decoding of two-phase
transactions" (https://commitfest.postgresql.org/24/944/) rather than
the entire patchset.

I hope I am not saying something very obvious here and it helps in
moving this patch forward.

Thoughts?

[1]: /messages/by-id/CAMGcDxcBmN6jNeQkgWddfhX8HbSjQpW=Uo70iBY3P_EPdp+LTQ@mail.gmail.com
[2]: /messages/by-id/EEBD82AA-61EE-46F4-845E-05B94168E8F2@postgrespro.ru
[3]: /messages/by-id/87a7py4iwl.fsf@ars-thinkpad

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#72Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#67)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Sep 3, 2019 at 4:16 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Sep 02, 2019 at 06:06:50PM -0400, Alvaro Herrera wrote:

In the interest of moving things forward, how far are we from making
0001 committable? If I understand correctly, the rest of this patchset
depends on https://commitfest.postgresql.org/24/944/ which seems to be
moving at a glacial pace (or, actually, slower, because glaciers do
move, which cannot be said of that other patch.)

I think 0001 is mostly there. I think there's one bug in this patch
version, but I need to check and I'll post an updated version shortly if
needed.

Did you get a chance to work on 0001?  I have a few comments on that patch:
1.
+ *   To limit the amount of memory used by decoded changes, we track memory
+ *   used at the reorder buffer level (i.e. total amount of memory), and for
+ *   each toplevel transaction. When the total amount of used memory exceeds
+ *   the limit, the toplevel transaction consuming the most memory is either
+ *   serialized or streamed.

Do we need to mention 'streamed' as part of this patch? It seems to
me that this is an independent patch which can be committed without
patches that stream the changes. So, we can remove it from here and
other places where it is used.

2.
+ *   deserializing and applying very few changes). We probably to give more
+ *   memory to the oldest subtransactions.

/We probably to/
It seems some word is missing after probably.

3.
+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
+ *
+ * XXX With many subtransactions this might be quite slow, because we'll have
+ * to walk through all of them. There are some options how we could improve
+ * that: (a) maintain some secondary structure with transactions sorted by
+ * amount of changes, (b) not looking for the entirely largest transaction,
+ * but e.g. for transaction using at least some fraction of the memory limit,
+ * and (c) evicting multiple transactions at once, e.g. to free a given portion
+ * of the memory limit (e.g. 50%).
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTXN(ReorderBuffer *rb)

What is the guarantee that after evicting largest transaction, we
won't immediately hit the memory limit? Say, all of the transactions
are of almost similar size which I don't think is that uncommon a
case. Instead, the strategy mentioned in point (c) or something like
that seems more promising. In that strategy, there is some risk that
it might lead to many smaller disk writes which we might want to
control via some threshold (like we should not flush more than N
xacts). In this, we also need to ensure that the total memory freed
must be greater than the current change.

I think we have some discussion around this point but didn't reach any
conclusion which means some more brainstorming is required.

4.
+int logical_work_mem; /* 4MB */

What this 4MB in comments indicate?

5.
+/*
+ * Check whether the logical_work_mem limit was reached, and if yes pick
+ * the transaction tx should spill its data to disk.

The second part of the sentence "pick the transaction tx should spill"
seems to be incomplete.

Apart from this, I see that Peter E. has raised some other points on
this patch which are not yet addressed as those also need some
discussion, so I will respond to those separately with my opinion.

These comments are based on the last patch posted by you on this
thread [1]/messages/by-id/76fc440e-91c3-afe2-b78a-987205b3c758@2ndquadrant.com. You might have fixed some of these already, so ignore if
that is the case.

[1]: /messages/by-id/76fc440e-91c3-afe2-b78a-987205b3c758@2ndquadrant.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#73Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#71)
14 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Hi,

Attached is an updated patch series, rebased on current master. It does
fix one memory accounting bug in ReorderBufferToastReplace (the code was
not properly updating the amount of memory).

I've also included the patch series with decoding of 2PC transactions,
which this depends on. This way we have a chance of making the cfbot
happy. So parts 0001-0004 and 0009-0014 are "this" patch series, while
0005-0008 are the extra pieces from the other patch.

I've done it like this because the initial parts are independent, and so
might be committed irrespectedly of the other patch series. In practice
that's only reasonable for 0001, which adds the memory limit - the rest
is infrastucture for the streaming of in-progress transactions.

On Wed, Sep 25, 2019 at 06:55:01PM +0530, Amit Kapila wrote:

On Tue, Sep 3, 2019 at 4:30 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

In the interest of moving things forward, how far are we from making
0001 committable? If I understand correctly, the rest of this patchset
depends on https://commitfest.postgresql.org/24/944/ which seems to be
moving at a glacial pace (or, actually, slower, because glaciers do
move, which cannot be said of that other patch.)

I am not sure if it is completely correct that the other part of the
patch is dependent on that CF entry. I have studied both the threads
(not every detail) and it seems to me it is dependent on one of the
patches from that series which handles concurrent aborts. It is patch
0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.Jan4.patch
from what the Nikhil has posted on that thread [1]. Am, I wrong?

You're right - the part handling aborts is the only part required. There
are dependencies on some other changes from the 2PC patch, but those are
mostly refactorings that can be undone (e.g. switch from independent
flags to a single bitmap in reorderbuffer).

So IIUC, the problem of concurrent aborts is that if we allow catalog
scans for in-progress transactions, then we might get wrong answers in
cases where somebody has performed Alter-Abort-Alter which is clearly
explained with an example in email [2]. To solve that problem Nikhil
seems to have written a patch [1] which detects these concurrent
aborts during a system table scan and then aborts the decoding of such
a transaction.

Now, the problem is that patch has written considering 2PC
transactions and might not deal with all cases for in-progress
transactions especially when sub-transactions are involved as alluded
by Arseny Sher [3]. So, the problem seems to be for cases when some
sub-transaction aborts, but the main transaction still continued and
we try to decode it. Nikhil's patch won't be able to deal with it
because I think it just checks top-level xid whereas for this we need
to check all-subxids which I think is possible now as Tomas seems to
have written WAL for each xid-assignment. It might or might not be
the best solution to check the status of all-subxids, but I think
first we need to agree that the problem is just for concurrent aborts
and that we can solve it by using some part of the technology being
developed as part of patch "Logical decoding of two-phase
transactions" (https://commitfest.postgresql.org/24/944/) rather than
the entire patchset.

I hope I am not saying something very obvious here and it helps in
moving this patch forward.

No, that's a good question, and I'm not sure what the answer is at the
moment. My understanding was that the infrastructure in the 2PC patch is
enough even for subtransactions, but I might be wrong. I need to think
about that for a while.

Maybe we should focus on the 0001 part for now - it can be committed
indepently and does provide useful feature.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Add-logical_work_mem-to-limit-ReorderBuffer-20190926.patch.gzapplication/gzipDownload
0002-Immediately-WAL-log-assignments-20190926.patch.gzapplication/gzipDownload
0003-Issue-individual-invalidations-with-wal_lev-20190926.patch.gzapplication/gzipDownload
0004-Extend-the-output-plugin-API-with-stream-me-20190926.patch.gzapplication/gzipDownload
0005-Cleaning-up-of-flags-in-ReorderBufferTXN-st-20190926.patch.gzapplication/gzipDownload
0006-Support-decoding-of-two-phase-transactions--20190926.patch.gzapplication/gzipDownload
0007-Gracefully-handle-concurrent-aborts-of-unco-20190926.patch.gzapplication/gzipDownload
0008-Teach-test_decoding-plugin-to-work-with-2PC-20190926.patch.gzapplication/gzipDownload
0009-Implement-streaming-mode-in-ReorderBuffer-20190926.patch.gzapplication/gzipDownload
0010-Add-support-for-streaming-to-built-in-repli-20190926.patch.gzapplication/gzipDownload
���]0010-Add-support-for-streaming-to-built-in-repli-20190926.patch�<is�����_1�VlR���dmh�IX%K.����K�pI�A��aJ9��v�N����}���Ii�������=w������.7y�iZ�����������z��+C�k4�[�a�|��>k6_�?�j6���@���K�g?���i�<���,�������E�J8,�*k��F���2x���R��	tJ���_�^�_����?2��P:���i2?\�\/`3�c~�qmi9s�L-;�Y����-�\�T��L�������V�;������_3��@3X�RL7�����9�����p��	D�W��	?�1������l1�2p
��2�e�����=�y)�y�2
hi���q0�9~�3��[�/F�[
�e^��K'W���4�J|���%2���*���0��j.*86k����#H����J�w��u����������I�I ��q�YH�h��������{U��V�������n�����W�mG��Z���,�����=J�%<�].�x����]�x�2��g�����	�N�n�4��M���`a7@�a������	&��^2]�e[WD��2�s�i�.C�M�)Z��V���r�z��!��
��3gZ�a�u.�Bv���^p�#~d�$M�y\k��R�V+u��{����54;��
v���B�u|���O����"D�����J������K
�.�X��h�~�Ce�
b�i,ci�1&����hq���7!��Z�������
��W��5���@
��`����2�4l-t�E�K��)���K����!yn�S��g��`
O�P��X�e�����Q��Z����:�$�gtD���h�=����Y�a�&�j��,^;�'�g���D����B�Y�;���i��q�J�� ����lU�T��C}e�p�Qi�����1����'��4m�Nr:�":��r�UM��!��8he�yD�g�;��j��eC���Z��@��G9�|Z��Lns�P��D��M�n�:���|�"qH!'�����@:;�S*A
1c���
���;�{�B�a��3gz�w�6���2���~�+�{L��^��u�}�jJ�]���_C
�������sy7y7����s�hN��ak���x�yP*��~9o�`.���f�v�����$��.�P�CI�B�d
�� ������Pn�����F�����������	���v���3�P]%V�X��j�H���#'\�P��+Q�k�un[��f_`�W��&,�"j	�#�M�u��WE�� �N�T����^�V0L���/7���.�U�	\=�`#�B|��
�����MP�{�O8��j��v��J�������;�a��1�}�����'�g+�-������������ry7M��P��������:(�C�.N#M�����X^�2V>W�Bw]�k�y����7b�(%��&L
��v���5� ��4�t����pC�LS�aB��TMrt��d����\���A�-C1M
}ob�1���&/4�
u�g��,�c�����K��l>EH����4pi#��H#'��ce���n�p����)��8�~�}?8��}���o��z]S��6��q���=�|�����[�w��G����)�����3�5-[Z������`:�'�w���^#���!��E�ieadk��BkJ���N����hJ@�q����3&
v��F���t��_j01�Qx�tt��\Uw(z�|T�R*�t�����_�w���vfg��wP,P�hTx��>�O
�`�~6&�2��������K��D�HDJ%�s��8U�4�`D��q��X�{�$i�D1�����	�Fn����qnQ�uAv"�]�zY-�r��3�S,k7D4�8�V7P�GM�����6�QNl
����,X���g��l6������I
�
��#S����1W*/���S��L���?�����g����\�l��AE�`�!��V�+��l�h��<2O������[e����3��qB_��].��	2�&Vb?kv�$Q�
���|�1�j�u>�a���2�<BG���(%p1
�\���R��
�F=�9=����b�5������.6�2�<L��|y{5V'7?��'W���L4`���������c������87|���!7x8�&f�:���8b�_�����D���N�3h�:��qL=#�J�zt�~p0&��X���k�EA��w\a�_�&{��em��DF�y������>�U 0��mA�5��<n���A�Fy,�Exn"l.�06�7�L����t���in�=��6�G����;��7���e�Al9��O��5������3��O����c����x�M,;����� �gG	>�y-D�v����=P��,v#��r(
F/�8��8K"��#��#���'zF}���EQ��%h7���*��H�8�}sO�j�E�"S���E2��x��cp���flcSFF��<C���s��V;��N{�J��:U����|DA��e��K5W7�IU�jL���&�P~���R3�O�@�cAL��������9�`�������`�D��N����O�.rKC��4�|�&�+���k++~��y���d,���t��UE��#I��4�UE�iN{XU:�@��e#�hIl�"������� ,����?����8>
��G!����8�d|<nR1f�f���k����>��<���F���Iu�!L����?�*��Z��\g:��	R��6��j�g��nW����T����@�21q�[f���'��_B��6���/��#�m?g��\�J���c�g���'f]L�"�4�q_��a,k����i�X�J`1c�����u��R�:1%���S
��"��?)�"�|`��Xp������(?�tZ�$Op�9uG��=J��VD�9hQ=2���]�Q`����vU$J��*�Y��VY�al]�+:��?!��������6o��z��5�mC�n_{+����VE����av�o��!��s�k�
T�-��18}0�ek�r�����T�4���F�����t�J����� �����cOd�=��r}���(]�8��a|���GW�H_Y�w��L�]��{)�4�q�<����=��F�_���|]N�����>hQ{�BvQ0
a����D�(;����h5���E��e�l�y�X%�k�d���q�I�;��6{Wo�z=�+�C]���Y_in��C{���C)����
����Z�s��Up=/�� p<�N�^�d],*P�V+��>��{g��_K���$�
�>�S_~�<N�eL��@�}��\�l��TY�x�M�^�M=^$��j�	@��k3B�����g�{��#�u�<$+b/]��q�F ������$�de�U�,��Vp�z�=>�#O�Z���+�D{MhU!�y��I���,�M����n![5[�~Gk�����I�	��D��x���r��_�_�R?H�Q:�y����>m����E�=t@S&6�����yq���8�U2�>l��L"�h��H[��Q�w��V���+;��v��S/1<�E�n�oh��aa�S
]]{�Z^Yf�7�l	���
�*��N?�����f��]����8;�7�6>�<�����?�'BP�������|>��&I(N�iF!Q��K*��U�����qs���3���
n����G�������k�����tNU��U�=��i���?���N*�����V�l�!F$IM�G%����N�fi�;�/w���.w�B[������+!e�}s���,v�v8�{�qX�(�>"h<�"v�$U�L�4Sg�IR���D]�D@��+�U�lrs?����<��+�~�"�3�M���p���8�������*/��]e?rm5�e&��i�	J�}IM�����,3.j�����W��x�\x�(�����Q]/�l���aJ����9�B�����6���������on�����M����k����+��L�?���\�����p'7W��T�������L�X�=��!����^z\�}���Xf1�B�2��B����E��$|tF���1�u�CI��BJj7�6I�6�W�7��e�����~�7l*	�A�������ws"�v������3\B���m�m�m�3���eK/����8�������9��J<h��� �m�l)��}9ju[�{r��)r�+�C�j���U�WSz�]�j|=~�����E�l!ux6I:a����W9_��G��J�.�a6w5�~��z�\��2�^;l���s���tE��k��_����5���M3��"brf��t���x|��p~�%"�5�L�kh���\�2��i�����q�vsH�ou�E�����1�������e�4�����LJ21/���2:�.^��Cg3[���HqnI����v�YUzEFS$2�I���zMo�<'���a�(d�<QO�!>aM�W�
�E6 >��s=�WD�i#�J	V.tZ[�HN�����b��-1A{���}F�b:�|�j�<a
��V'�B�!aB��4��	0'�a�>���{
��%,����������(�t���������y������:�p�=�<c�����5|����E3>���K/��u�i�"���sy���R�����4+~]
�����R��2z�a�����2,�\ZU_i��_���A�9��W2���S��MkK/+n�~��O��������x����H��e����x��Fb����T��T�GR`3�Jh��1(�(��.ty�Z�v� ��������O�Lm���&o}=��Y1�����W��}�R2��h���M9�,�H��:K8�!=P��!`,�9�_�.:�=��W
���LJ��e{�o�q���k|s����RwV����K{�P�8�XP���g2�V:
��I������HTl�YS���L�X���o�N�
�T���+5�%�$�JF��d����F�{�]g�%�&�)"����uFJ�g�� 4="	��h��f���]�s1���!�S>�����l��<I��;���	�rR���7`��"M*�01��c���~�
a�R������c)5�����G������X��J����@�S���Q5�];�����Gon�/��g�C���D�/x"{�*��_m��=+����s3b�C7���b>0����v+��g��&=����=�Y���f��k������7[��y��_Lr�M�\�s�K��n��G���k�6��X������f��&�,Z�����]IJ���
,�T����-3���0_�_`�3T����f��S�_����M�F7��K�������$1���.{�A���9����sqkD���={�:QW��+t���p����:�����P���2`���b��d;��w�E7���4���;�Y���28k������;�������!�����E��U�C���j�B�,4?�K^�GnKw�.h;�2���� "{ya����Yx��,o�(����bsD_\����Y�|/V3`�[�R&���V�	M���h*!�}0������,W�Rr�ET��8Y/B����q�<n���i�8��i��������s��>�E��-� c�D�$@]�I
dO���{HO�����q+w�K���� �i��f�x��@��i���t�<��S�3��+-���M��C��������58�<-�$����k ��a�g��x\�1�F�^#�%�B�����2�����d1lr����o�j������BO�d������` 6+�Iz2YZ%��z-��d�t���};��S%	�����e@�:�}���g#�V��[����7�q�5��]3+�������t�5�Wx8�47)�~{'7�Q�?X�0�%��xw���~����]�	����$�y�}��F��������z��T_� �wc��A&L�|wg�N�4h!-�31�!r�����Sx ���\y�.��}�Y��A��s�	��.+�Y�.�2�����8'C�H�,R�'�s���/b�<~�O/��Y�R�����QK�3��ri�I�c�vX:����"k����>�g��d�5��~�o���Y�I�iZY
�m��O�V�p#X�`��n�b
T_7LG���_�p:���$+��2G7l=����k�mH���5�THw�����\a�,g�w]�����#:CXc��23�uT���v��T��BI>$w����V �(a���]l!�81������@�bz��e��^�k�<��mi�i���n?�x�����=B�6�#�4�*�Fj�H��p
=����g���C���%m/+��8���l����+�aP�8����3���a���G��$�;�N�k���u��M�'$�J��`{�W�#��`�K��^x�B�0�<H�����d�b�� ��Q��52\�n����_7�Q������?j^�y������b^�=�x���Dm���_��/w8����6j�T�;�B�upAXz�;aaJ��S�\H9�Y�e�P��k��uJT������W���$��3O��������wF:B@@��C����Y�hu��fe����������nbx.;���c?R$�����Z:�)�4'�?3�oi�G���8�Q���s�f�-������R%��<K�}�_�8d��]4��2k����v'��<d,���D���b������v
��H���Lf��*����������{���*,����$R%$V]']�D�R��cD����d�M���a�_5�I��"����Q���vfA�3����(_��!I'H�f�����D��c��O��$:g��N��&�8��9�t>0?X�Huz~���D��k�|����|��e��Qa��3�"��:H,�;�W�B
��}�����z���W+��'�I�Xj����"r�������������`P>_5*�����4���.�����x�~����0Q�2��x�3��
 �%$���x��'C�U�~�M~���Nh���H�� \(��0;x�H�kf��ssyqz����$�^r�p?���
��a��������o�����������{��c�����;��Y�hB%��Y���������9WeMh�������A9��_�qG������g�����L�|�N�Y8[��y��oO���#aO�}R]���,�5��������,�3����x�sS�=����-�Q��'��K�
e�"���Ce��w{?��9��|����3\�0�w��4�+-Te%aQ�(������bd2@���_��7P�Ac����7��:�3B��������A$���Z5�V��~LI)��e�a�m�
'i�I�B �������Y�:�������Z@=jL��U�o:���T!��`���!����v��Vd�j1��MI��J"�������q�'@�x��������"=;~��������W�`��J+��M|����5VA ����<������)"?O�~S,��Kh���q�#��[��/KM�9������bJ�d5�^?����k�`�@��
c��c�����i��="2<���� :�� ��Zj�7����Q�kp���L��C7n�J���=�i��J~Te�b�T���r�l�/N8>�j>�z'��N���(a���Bp28����m��G�k�e���DQ-�c���N�'�D��8���X.����3�T�@����	q��r�/ow-�����|��T��T�a������`~,-e�z�#������f4��[�U0Uj�,�:�������������eXT'��m�wt��T$*�i���<oT����L;��+���0���f�4�|���H�E������<��c_	&�e������
��buE 0��7����'��H��9���t	=|%w.&��J��\&�J���2h>Q	�.z��=E 5�P�b���o%B,�,E����}B������V�7���g�m����?����*��2�l�0�M
t<8|��M���k1z�u����Y���X#�����������*|�M�x�@�N�1�w��y
c&���N�.��5�NO�����i�>:�;�'.E�v��G��w�'o������l�s��^���6����S��j�q5d�U�Y�����6��`����}s��\���n���We�����I��
F�6����s�j#������.���L�5p����u���e�
�BL\��Ws/
P��lvk��J��J:Xt��r��~�,<��k��n�����D��q�m�36�WlJks V���EiB��������K����I�O�A�K^?Y��=���b#��&�&���c��� u��z1eS���W�]jx���N�����+��s�	0n^�g�;/�[��
�����2�J�E��?$'
�����^:��d1Z��A����h9I_:�#���t�:"�������S+)"�{Q�#,5R�xr���P��V��O�Z�XB���������%x�4�@�>1
����k�R�+eC7T�V��>�q/5�����
�����Ns�E�^eYs��#l,�!���r^��aRo�
PH��Iu�wW��0Q�lg��&z���W�R
&�l{8�w,^���\�{m.1�;����3���y����j�����jq+����N��#�F�
���c�C�j$Qd8������l�o�-����Ao��v|q���$.7c8����Z����U-?,)��dY��{�N�l�����3�'X^gZ&������������CS�,��@w��wl$/D�3����������90�� �����Y��`�-��Z����]W����)�����Lb8��RB��������,�������K�'FaU�,��5��1�F��\���A�XV��g+�ETZ
���Y�������u)|���*�%����hz�Q	YHY���"z���
���������u�t{�@�
��W)�X������gNMT>�����O9��g�C����?�3�����'d*[�=;;���q�(���.1���<c����q^�'��i�z���_k�����  ���5A�b�=���O7A���Vd�{���?��d�8?���/R����hl{��Gz��<�\e�����������V}���C3�
A&��M���J����j2N3���'K���3i=|2f)�o�$^��m��>�h�u�G&�����\*�=���1��<�#Y��.��8�'Y���B/P��6���frhU{{%��/��A+_��.iQ!�&a5%����0_�hE����������q���S�_�5���^�]�X��]g��o�5R; ]J(���K%�]�/����~1���
8�_����0M��>���B�N�6	w�>QP`��-�E B��v��������k����.���Q�t�������Z@A����L��R�\L/m�=d������HD�R�c\��@@�x��'oln2�2m�[��|n���Z����<j�B�}f���t���R��5/������1&;}���WP������
](��gN�uIoEX�����v��=tv�Th'�bV	��2�@���rN*:.�@`M-2��:����vmc���9�w,17����0�Ly\���N��[T��<��>'y~���'��+7��$�Z��Da�Z���8_�#`Ew�}��<�����Q����WJ�e���G�l�C���s����)��������>5E~�VS���,��Q�6�T��������H%O��G|��F8(��mJ���xe8:I���;�J�M@;Jb1���;�':F�Y�V��&/��5w(��+��1�����B�>��e��2�� \(�$N�*���y���"������4q���|Mn|�����b��r���YQ
JY��<�Q��G�]6��T�7�#�g�/�cr�[���v��I
����8����o$9�D����y����L���0�������;�|8�8n
��%����c�D��GJ��)�Xk}�O�,�v�PP�Y.������}E�}V"�9���+���U���ntCeOC����`.��XY��1������}�m;���X<���L���0���a��������r@���Z��w6�8��
��k�s��E�^�[����!��T�H"Ax~�v(�>��YbU9�������$��lf��/���"�����y���LJy����9wa��	�����[��aL���7��<s�Yv{ gb��}U.?���[: ��:j�1!��l��w��������5��6+������t��M��C�_�����i����?���C��9O���;~����eE�S��'Ei���+��,�Zo.�Q�K}���(e�����t��x��Y�S�CK
��[����je��z���A�-�
k�u�&|���R�B4L���LA|��,�8�����\��X'�a��:���2�!^W��	�����M����PM�/�kg�J�V��?�]���lSx�� �:E���\j,:T��&
����J������
*.��65V�Pa���0�'3�K�����7QZ��Z��!����p�3�_*��x����B���L�=T2��Z���g
�AE����9B($3f+�*"��_'���P��X��� 2'&ET3�p���P��0�	��\r=uk�A]�����I�S�S,Y��2D>����N[3�?f�qc�s���9pc�&����Nq�1�MP�'�O����l���q�N����&�� �J�Uy��K�mt��v:;f�����`@W�P.�t��Ec�����	�&

:[5����C�9������.67������*~���N�4!���A^�������Zpqx�C��D�6�5���&�Dc���"Y�Z���mY�C�U0;F�����0��Cof�%|w�19�����x-���������������Q������X� X&8����Z��`�$ �Z@�a�<�7p)����M�6�9���H	L�2����>@��x�XJCqAq]���=��������z�H4�Hq��Xde�.p�
+�F)������ed��k��*��-��!��h��Z_4�����29$�!p�+�A�6������������$�T�4�)f�SOf�D�c�D���HT�dz
eY��"��~.0&t� ����k�a��c�A��6��	0����P���n���2(VH8���a�^X�����v29��!	s�'������Dl������jz1����*hA�rc
$��%v
�8�������Ul��k�7��@!@%�P���4m�r\jW���Oo�%�D��Ll>E���Z���n<�� ��o6���GU��ZWa�e�~��A@n���34�M|�P����'w� J�jcb����?�S��C��z�DEY���5N�@�C@�4`��� =�Sn�$����	c��!�i���R��I���E�
}���4P���J��@�3�������
���F���%nG:����KH�LDMwLZ�!z�Da����.U��\�����ZR&����|t!W$���C4:PX��a����rgU�3�H��	q��F�#'���%��hj���LAb�k��,K�OJ����hO���Gv�<j-��p���������0S�i+�������~����bx�"��;Ec�����w��*tE5$��c�u8��kh;��8����E����,W�c3y�p��9�L8y�y���b��`;Ij���
zY9�U�f�������>%"��j?�Xh��{vV4j����d��:[�_��*�1����V2�l�$�� �o���+�fYX���8�q��;>���3Q�	+&�x���q��D=]�M{ f��"bZ\$�R[��8I��
4�����A�x!��-����+�^�t���U�d�t3;�w��c��#�uS9m2j�z�b������r�[�o���������^CC��bAi�T(�m�!�K�Cs^&��Y7��7���,
j�gCPH� 0iB4z���G�G�.�I�u���,-W"A|���N�*�T��t8eV�1�k=<<���c(T!I��T$��!e��6HK3��+!���.,#�1�^<b����f��-��tx��
]pj��o�0\J�����G�+�8�L���e�K!������;!�
�;:�R ^k�
��.;��F��o�����b�"t���,K5i�F���`[�D����j��!�_
��J�
�������^��X� ����?�C�(��:$s��Xt�����$�x@0
��|���O�#m<��k�,
^���?����L������$me��������>*g��U
c�a���6, TdQz�����%���t�`�I��pQ��D#��P
����" J��|V�xz;������\����{gg���yi^���Fn�u@-�������Z^#���2�*0���3r*Y�2��7���M�E����@IU�yo��~�%��rM�"M�����U���|c��g
P(��r����S������8jL7�)U�����{�;��x���Du3+���)[�B�G$�*_��	�}�v� ����v3�6'��q�Y�[]���U��D���M��4*���+��V)6	����Q���.Y�G�0�ct��-U{��
+x;�s�Be�A���-��P���lcj6A���������0����za�������
��*��+�m)�_��A�����C��=B�~���Cu���4{���{��5,�p���#�XW��%d���-�(Y}�h�-��(
��9��Q��S�A�\��U�E!^UX�Q�:�����_x
qa~�O��G��V�$�P���
��N�4�3Dd��*�K��=���G�|I�n�9���Vd#9H6t
n<�@W'p����\T9_y����[�T��3�9��l��)�>6)>��8���V-K_�W���:$�����!�?�_���<y�n���qK��7��-��N���+����%�x4rO�|�,���������1�n^%x�v5�y��"X��r�9����O?4�z&�����>���=���w0�_JK~{������V{�nm�,/�w����������~�Z-/��{�q}�Vb�\i��u�5�$�[p��5T1Z���O5} ����<08��z&40'���48�7�:}���pn96�����<K;{n:;�(�51���O�g�O�����@=P���(L��F3�p����TT�GX�i�B����*��%-�5I��u�(ja�����1p�3��
�����Em�����I�`9h	�A@�XO���c��@����x/d'���[��7�h>�e��p�v�e0R���(~T����:���|��*q8��ea\��bJ������
����K�����m"�L��n��X�A#�����P��&�j�����8��7��A�Y��Q5��?�	i���d��Z�/�!j����%'���'g70{XDA
2�����]�&*aT���C��]]�X�=	�n�R.|���K���B�:"~*e*f�}���;Bu�E�g����H��`�:�{�pp0e
�!���L5y�����{<���'�R	����#��]M��nV����p�f�80�:���4R�"�S�G,���M��:��E�����S�����u@�]��9w4�1���}�f�%|����H�R��M������\%���.~X�
���>A�)rR�BZ(�[���E�n�a]]F�!�#��"����L�qE����G�h2����~i
#�r���c
�3P�p�}R�]�g����Y2���*��w���2/0��y�*�xP�t J��/�������nh��5\�3� ���R�j�K|����r�
��3��J_�*d����B�Y�`�=hYP�\v�M�l0��s��=����^����&�e���J!���%���j����*g��%�V����V~U�t���F7�9�v+�����+���j)����=�_,�� a�37��H����4-�o���38��vo���	�����Uem���:�B��Q/�yr��s������y$�� ���:���Ik��)=���������d�.0��X��*���9k���N�d�B,K���Z���G'?�4�����^�7���������v��1	MG�\��ml��5nb(���������	JG�	�S+�(��"���.��g1mqE��7.�4���xP�e�Ez��BGVN��:��%���ge4�����Z���Z2��s�A��|6�a�����8�[�(���`Q66kT�������K�\�Dq;�����/�*�z���E�n0x$wK�U9���F���x�i�U�%J���B����b]�,|��&��R[#�9�U6sn�[��=�����y���`��`nw��;�qMU�@O�2���h�Zh'N�������R����R������Ni���00!C��R�(h����b%��,��M4)�i�����i���l��|��r���\x8�}����yZb^���]3��2��F���e
��AN��[��_S�.��4�s'�)�}���+wQ��%�����������P��S�oOJ��)�&��Z�� \��$����������*�
q\C��s9^�X�����4O\
���������=�c9�*0�y��5	~������%��L���{�Z�����@wj>�|I�������������D���5���v�0!�e���U[��a41+�����X��a�v�������+����V@�$��8���Z*��SS%��T����+!v���k����w8���0�<�#{�U�75�".� ���J��Z��Zz4��*���q�lp����5��*i�	��Q��N�kq��X�����2^s�`�P���5e��xD@�<:NF1%5%���D�j�8�Wr/�
)(n�A?N&����K��t����/p)�^�2������L%1L��C�Y�
H[�6�Z�O�RZ�@t_��������T����o�k���n��{��Z��3/x��q��0�s�R Q����?�hu�"r���F�lu��)�y10}�d,��9[$<���(0���*������ ,��dw�F#��F#�4V��&�B��{�AyFC����������(^����0
�z#�8�%r�)���2�1?�������o.��V���/���qC�3p��w��ov��������;�4�P
���t�e���1��KX�3�Fa!�����+�����<���V4oo=��D��	s����n4<N����H�KoH'�e�4�S���G��	��������V�`r�t�������]xHQT@�
vp1`�pe� A��0+�e\�Q
���Y*���zZ���(x�dI����%,���\J�#,4�!��S���������~������A?�8<�|1W��L]�=�Yc���s�;��v���DU�F�e.'�E2���_�l��!�����jcU<��Cx��r�$=���qe������T�T���L~�]|�}+zb�}�b#���!��F���;��~���*.�z�HY�"����B��HH��6���t�m�mgo�����Z�����+j���S��\E��mZ���k�2���-����	��
��q��} ��v�<�!�����"���b���t�\�����ON0VH��g���c =,�rl\? ���Fm�N8{����Py ���w����D�kqug�Fa"��*$��}�L��b1g{�]%�}X�fb<4sB�cA&gtx��(��6}�I�L�\C��������]?g���Wjg�m��b�"1�
V����T����sb�
��a7:�DG���r+<[���/�������d�"����4Q��U�M�Tg�y�H:`�n��f-P�Q��	���C��&u]`;Y�{:1�a��G��=�~������Q��B������Bb��'��h��u�)�b�H��[]���a?���	�g��'P��a�E1��>��"rS��Vz�d;��=p�y^�R��|ULS,>�pqhSxh@^l�E[2[�]��X�>O�53E�O�1rV[��)�8�.�l^�#nJ������#�0f��N���
��v��#]�O�N�{@��8X����_]������
����jk$��RK�����y��v1F����'; �m��?�DIAUaG)�K�5��l��A>q�$:;�^�{�}H���n�
�'���� ��m��wv�'���2Ha1����,4�E�����F�����V8��hd��bn���w������qt��w��c������x��8��&;}A>}�%)7gnU�Vs��a�_�T�R�1� �T��t���"�g��T*�}�R���b�N����-�6{~�P�T ^j�lo��N��C��?U�V���ii����$��7:��Z}���������Yo_�;�%����'��OR�����B|��M�e���w_�Rk��m�1/��&�u����� D��M����Ev-D<��bQZ�N������y�}��[�Z��'w�`|�df�B�"��"��q���J�A���,��b��>�Jel�������Sn)y�Ed���[
P����������z���n��������jc��]������i�~����Sq�Q����<7�\ ey?��u@d��e�
�I��&jF�uY�?�������S�=�M9��	����_t<��T���B�����a0{/B�R��0�eQjvHG�o%SI�7I�(�lc��n#h��`�:%^dE�V�Mry�F�d�s:M9
���h�����������v�B'�ght��dc��Q���{��.��O�T2��6a$5;_\]eG�j������������������g��?�=��A�*A������jp�+A�(P b�:�:��9NO�lD/����F���
��dZ����O��F�$�2Wq����D�K��y5R�)�N���e��`�@�p���N���
����6ON/���E�"����7��^f�D��M��$�,,�\�7=c��q�d%J������q*���V�����l���"���Y����V����TP���N���u<�xS��'�S/��;��A.��,&2���B��#�e�c4C�����d����XN�U�OD 8����F}g=���~�z�����ZG���mJ��q&nA���wL�j��/�y�_�q���������p`\}�!B���}\s?����hJ)k�.7�d�.�xt�$���]z�MU�o�-����m/~Pa�����^�����Zm�[��N��J��F�����!��/	��^�D(/��������
����IK���F��x�"��w���'�j��V/���I�P��#jM�+��7���j���|=�Z�E���tc�%(.B�tUq��?�h�#a�2���
�u�{^��gb�������)���r��P����]27���@f�E08�x����o���7���}
�]�,]�����#e��6�q���Y��?�2~��-+���Yu��e���
Z�-�����J�-!1�q���VN���(��
T �	��C�$b�sN���S�Ff�����f������	�����%�@��%��l�����{��/��A���K��S�$��0m�����_g_��H���T��s&QqA���?�}~�K�z��0��1^AEMKw���������%�:����TD�%(���KK)���
�]RE�n'�2��/+�b��D��5�1&g��R?�[�L��+���c��������u���K�k_��Y����6/��q;�D��Z\�f�����'N�&s�L;��k?���V�d��E-9#�h�h��?���[��O/0s�n�t8��9.fV�r�G���fO1�!
@n{�rf�7�.%�o��+Z�i�m��!c����1(7�����r�D����������U��[���)�Q��re�}VA��{f��3E��X��NR1a_	�;&����=��]�,2�2;���Wi���d3�1��L��7jO� _d@bv�-SB?�����yN{�(�sN������y�#7$y���\�+��oe�����yAn/��W�>>��t��fA���U��o�������M����>�$c���1u�H%��U���6����\��hD<����q����$<B������g�t�����|���-��/�a��*^�
��q�
�w����f�=��{wA�?�>J:��?�����c�j�����6G��&�eV2�E` ��������)��M8A��+��[*C�
�������w����������r��u`$�jX��0�*F7t�(X��1���w�kC����"�X�=T��?��1��<���j��������/`���E�����^����R;��U�(G���@��+���u�����~��.�\��������&j-Hf��7�����~���M�~�7@,3F���#����o���
K����U�8�.��(l��T��0�a���2������Y��3G'���B\�*u���~�Z*
���*.Te��gr{����z��K�D0�,�T�d���
F���c�[F�	ZhC��wvv���������Q%�Tn1�W��b�!���z>S/������� ���o�������z�p��K�������~
������HC����	�A�h��Y��R36�4��������Y$�s�F2�H�PEF��C���$�����Ef��K���}���:^��u[�!�����!9D�,����*�Cd])0��"��*������{0�T�e&������c��\�,$:gU�Uk�[[��	Axbx�����tZ~������-r�l�E��p��� ��	���q����(R�,EA�PA�!0dB*��V	x[�0�<�)�
���o�N�_/���kA��Q�_��t'��u�D��D�(����
��AZ�q"���$Ym]��-��BT �U�Z��9���gu�'�������]�}�S����������C+���Z��{I�����`�RM*��U��T��;�R�~v��'���6!Tf-���_�37�p0����L1[������:����A�}��f]b���������F�j��n��tq����"�P�{�/AO�om������m$}�_���wr�!�%T�����'����i��U�sq9�	j���i��z������v���Bo��bX_mj�9�P-�q�f���t�'��Ig|��t�'��Ig|��3��OW�TF����x�����)�8H��1��U`�(c�	h<�`�4�?e�e��?e�c
�������I}ss�w������p��N�Ao_�pRWF�N�;�h�yvZ���zx������H��gKv6f��h���1�/c'i<�I��$Ov�';�m'���E�~D����d	y��<�%d����ZB� �_
*�;�T����w��gQl���5�ql{��������8~������v
��:w�.������G���HT�/�~U��Tj�Ae��ufX������`@l#q���KP[pj����X���i�4�B�
�w���!}���5����+��$��S0�����~��o��E�AZa#�����I�a�[�j��WQ�*��f��{�#���_��z���Q��/e���'E�I�|R$��'E�I���)�S)��U���jT}
�������{�?���Tmm�������!��z��5��mj�h�qfA����u�x��h�<����?]w�����O����e��N�;��T���Z�����j���:�?��w[SZ�5�Vt~�A��Ik)��qD��9�I/z�����'��I/z�����'��o�=��mfWVV���/kF-e_V}�����fT�����S4�����������*[�`�m���[��_B��e����/����B/���f
0011-Track-statistics-for-streaming-spilling-20190926.patch.gzapplication/gzipDownload
0012-Enable-streaming-for-all-subscription-TAP-t-20190926.patch.gzapplication/gzipDownload
0013-BUGFIX-set-final_lsn-for-subxacts-before-cl-20190926.patch.gzapplication/gzipDownload
0014-Add-TAP-test-for-streaming-vs.-DDL-20190926.patch.gzapplication/gzipDownload
#74Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#72)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Sep 26, 2019 at 06:58:17PM +0530, Amit Kapila wrote:

On Tue, Sep 3, 2019 at 4:16 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Sep 02, 2019 at 06:06:50PM -0400, Alvaro Herrera wrote:

In the interest of moving things forward, how far are we from making
0001 committable? If I understand correctly, the rest of this patchset
depends on https://commitfest.postgresql.org/24/944/ which seems to be
moving at a glacial pace (or, actually, slower, because glaciers do
move, which cannot be said of that other patch.)

I think 0001 is mostly there. I think there's one bug in this patch
version, but I need to check and I'll post an updated version shortly if
needed.

Did you get a chance to work on 0001?  I have a few comments on that patch:
1.
+ *   To limit the amount of memory used by decoded changes, we track memory
+ *   used at the reorder buffer level (i.e. total amount of memory), and for
+ *   each toplevel transaction. When the total amount of used memory exceeds
+ *   the limit, the toplevel transaction consuming the most memory is either
+ *   serialized or streamed.

Do we need to mention 'streamed' as part of this patch? It seems to
me that this is an independent patch which can be committed without
patches that stream the changes. So, we can remove it from here and
other places where it is used.

You're right - this patch should not mention streaming because the parts
adding that capability are later in the series. So it can trigger just
the serialization to disk.

2.
+ *   deserializing and applying very few changes). We probably to give more
+ *   memory to the oldest subtransactions.

/We probably to/
It seems some word is missing after probably.

Yes.

3.
+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
+ *
+ * XXX With many subtransactions this might be quite slow, because we'll have
+ * to walk through all of them. There are some options how we could improve
+ * that: (a) maintain some secondary structure with transactions sorted by
+ * amount of changes, (b) not looking for the entirely largest transaction,
+ * but e.g. for transaction using at least some fraction of the memory limit,
+ * and (c) evicting multiple transactions at once, e.g. to free a given portion
+ * of the memory limit (e.g. 50%).
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTXN(ReorderBuffer *rb)

What is the guarantee that after evicting largest transaction, we
won't immediately hit the memory limit? Say, all of the transactions
are of almost similar size which I don't think is that uncommon a
case.

Not sure I understand - what do you mean 'immediately hit'?

We do check the limit after queueing a change, and we know that this
change is what got us over the limit. We pick the largest transaction
(which has to be larger than the change we just entered) and evict it,
getting below the memory limit again.

The next change can get us over the memory limit again, of course, but
there's not much we could do about that.

Instead, the strategy mentioned in point (c) or something like
that seems more promising. In that strategy, there is some risk that
it might lead to many smaller disk writes which we might want to
control via some threshold (like we should not flush more than N
xacts). In this, we also need to ensure that the total memory freed
must be greater than the current change.

I think we have some discussion around this point but didn't reach any
conclusion which means some more brainstorming is required.

I agree it's worth investigating, but I'm not sure it's necessary before
committing v1 of the feature. I don't think there's a clear winner
strategy, and the current approach works fairly well I think.

The comment is concerned with the cost of ReorderBufferLargestTXN with
many transactions, but we can only have certain number of top-level
transactions (max_connections + certain number of not-yet-assigned
subtransactions). And 0002 patch essentially gets rid of the subxacts
entirely, further reducing the maximum number of xacts to walk.

4.
+int logical_work_mem; /* 4MB */

What this 4MB in comments indicate?

Sorry, that's a mistake.

5.
+/*
+ * Check whether the logical_work_mem limit was reached, and if yes pick
+ * the transaction tx should spill its data to disk.

The second part of the sentence "pick the transaction tx should spill"
seems to be incomplete.

Yeah, that's a poor wording. Will fix.

Apart from this, I see that Peter E. has raised some other points on
this patch which are not yet addressed as those also need some
discussion, so I will respond to those separately with my opinion.

OK, thanks.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#75Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tomas Vondra (#73)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 2019-Sep-26, Tomas Vondra wrote:

Hi,

Attached is an updated patch series, rebased on current master. It does
fix one memory accounting bug in ReorderBufferToastReplace (the code was
not properly updating the amount of memory).

Cool.

Can we aim to get 0001 pushed during this commitfest, or is that a lost
cause?

The large new comment in reorderbuffer.c says that a transaction might
get spilled *or streamed*, but surely that second thing is not correct,
since before the subsequent patches it's not possible to stream
transactions that have not yet finished?

How certain are you about the approach to measure memory used by a
reorderbuffer transaction ... does it not cause a measurable performance
drop? I wonder if it would make more sense to use a separate contexts
per transaction and use context-level accounting (per the patch Jeff
Davis posted elsewhere for hash joins ... though I see now that that
only works fot aset.c, not other memcxt implementations), or something
like that.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#76Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#75)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 2019-Sep-26, Alvaro Herrera wrote:

How certain are you about the approach to measure memory used by a
reorderbuffer transaction ... does it not cause a measurable performance
drop? I wonder if it would make more sense to use a separate contexts
per transaction and use context-level accounting (per the patch Jeff
Davis posted elsewhere for hash joins ... though I see now that that
only works fot aset.c, not other memcxt implementations), or something
like that.

Oh, I just noticed that that patch was posted separately in its own
thread, and that that improved version does include support for other
memory context implementations. Excellent.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#77Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alvaro Herrera (#76)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Sep 26, 2019 at 04:36:20PM -0300, Alvaro Herrera wrote:

On 2019-Sep-26, Alvaro Herrera wrote:

How certain are you about the approach to measure memory used by a
reorderbuffer transaction ... does it not cause a measurable performance
drop? I wonder if it would make more sense to use a separate contexts
per transaction and use context-level accounting (per the patch Jeff
Davis posted elsewhere for hash joins ... though I see now that that
only works fot aset.c, not other memcxt implementations), or something
like that.

Oh, I just noticed that that patch was posted separately in its own
thread, and that that improved version does include support for other
memory context implementations. Excellent.

Unfortunately, that won't fly, for two simple reasons:

1) The memory accounting patch is known to perform poorly with many
child contexts - this was why array_agg/string_agg were problematic,
before we rewrote them not to create memory context for each group.

It could be done differently (eager accounting) but then the overhead
for regular/common cases (with just a couple of contexts) is higher. So
that seems like a much inferior option.

2) We can't actually have a single context per transaction. Some parts
(REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID) of a transaction are not
evicted, so we'd have to keep them in a separate context.

It'd also mean higher allocation overhead, because now we can reuse
chunks cross-transaction. So one transaction commits or gets serialized,
and we reuse the chunks for something else. With per-transaction
contexts we'd lose some of this benefit - we could only reuse chunks
within a transaction (i.e. large transactions that get spilled to disk)
but not across commits.

I don't have any numbers, of course, but I wouldn't be surprised if it
was significant e.g. for small transactions that don't get spilled. And
creating/destroying the contexts is not free either, I think.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#78Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alvaro Herrera (#75)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Sep 26, 2019 at 04:33:59PM -0300, Alvaro Herrera wrote:

On 2019-Sep-26, Tomas Vondra wrote:

Hi,

Attached is an updated patch series, rebased on current master. It does
fix one memory accounting bug in ReorderBufferToastReplace (the code was
not properly updating the amount of memory).

Cool.

Can we aim to get 0001 pushed during this commitfest, or is that a lost
cause?

It's tempting. The patch has been in the queue for quite a bit of time,
and I think it's solid (at least 0001). I'll address the comments from
Peter's review about separating the GUC etc. and polish it a bit more.
If I manage to do that by Monday, I'll consider pushing it.

If anyone feels I shouldn't do that, let me know.

The one open question pointed out by Amit is how the patch picks the
trasction for eviction. My feeling is that's fine and if needed can be
improved later if necessary, but I'll try to construct a worst case
(max_connections xacts, each with 64 subxact) to verify.

The large new comment in reorderbuffer.c says that a transaction might
get spilled *or streamed*, but surely that second thing is not correct,
since before the subsequent patches it's not possible to stream
transactions that have not yet finished?

True. That's a residue of reordering the patch series repeatedly, I
think. I'll fix that while polishing the patch.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#79Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#74)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Sep 27, 2019 at 12:06 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Thu, Sep 26, 2019 at 06:58:17PM +0530, Amit Kapila wrote:

3.
+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
+ *
+ * XXX With many subtransactions this might be quite slow, because we'll have
+ * to walk through all of them. There are some options how we could improve
+ * that: (a) maintain some secondary structure with transactions sorted by
+ * amount of changes, (b) not looking for the entirely largest transaction,
+ * but e.g. for transaction using at least some fraction of the memory limit,
+ * and (c) evicting multiple transactions at once, e.g. to free a given portion
+ * of the memory limit (e.g. 50%).
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTXN(ReorderBuffer *rb)

What is the guarantee that after evicting largest transaction, we
won't immediately hit the memory limit? Say, all of the transactions
are of almost similar size which I don't think is that uncommon a
case.

Not sure I understand - what do you mean 'immediately hit'?

We do check the limit after queueing a change, and we know that this
change is what got us over the limit. We pick the largest transaction
(which has to be larger than the change we just entered) and evict it,
getting below the memory limit again.

The next change can get us over the memory limit again, of course,

Yeah, this is what I want to say when I wrote that it can immediately hit again.

but
there's not much we could do about that.

Instead, the strategy mentioned in point (c) or something like
that seems more promising. In that strategy, there is some risk that
it might lead to many smaller disk writes which we might want to
control via some threshold (like we should not flush more than N
xacts). In this, we also need to ensure that the total memory freed
must be greater than the current change.

I think we have some discussion around this point but didn't reach any
conclusion which means some more brainstorming is required.

I agree it's worth investigating, but I'm not sure it's necessary before
committing v1 of the feature. I don't think there's a clear winner
strategy, and the current approach works fairly well I think.

The comment is concerned with the cost of ReorderBufferLargestTXN with
many transactions, but we can only have certain number of top-level
transactions (max_connections + certain number of not-yet-assigned
subtransactions). And 0002 patch essentially gets rid of the subxacts
entirely, further reducing the maximum number of xacts to walk.

That would be good, but I don't understand how. The second patch will
update the subxacts in top-level ReorderBufferTxn, but it won't remove
it from hash table. It also doesn't seem to be caring for considering
the size of subxacts in top-level xact, so not sure how will it reduce
the number of xacts to walk. I might be missing something here. Can
you explain a bit how 0002 patch would help in reducing the maximum
number of xacts to walk?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#80Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Eisentraut (#16)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 1/3/18 14:53, Tomas Vondra wrote:

I don't see the need to tie this setting to maintenance_work_mem.
maintenance_work_mem is often set to very large values, which could
then have undesirable side effects on this use.

Well, we need to pick some default value, and we can either use a fixed
value (not sure what would be a good default) or tie it to an existing
GUC. We only really have work_mem and maintenance_work_mem, and the
walsender process will never use more than one such buffer. Which seems
to be closer to maintenance_work_mem.

Pretty much any default value can have undesirable side effects.

Let's just make it an independent setting unless we know any better. We
don't have a lot of settings that depend on other settings, and the ones
we do have a very specific relationship.

Moreover, the name logical_work_mem makes it sound like it's a logical
version of work_mem. Maybe we could think of another name.

I won't object to a better name, of course. Any proposals?

logical_decoding_[work_]mem?

Having a separate variable for this can give more flexibility, but
OTOH it will add one more knob which user might not have a good idea
to set. What are the problems we see if directly use work_mem for
this case?

If we can't use work_mem, then I think the name proposed by you
(logical_decoding_work_mem) sounds good to me.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#81Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#73)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Sep 26, 2019 at 11:37 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Wed, Sep 25, 2019 at 06:55:01PM +0530, Amit Kapila wrote:

On Tue, Sep 3, 2019 at 4:30 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

In the interest of moving things forward, how far are we from making
0001 committable? If I understand correctly, the rest of this patchset
depends on https://commitfest.postgresql.org/24/944/ which seems to be
moving at a glacial pace (or, actually, slower, because glaciers do
move, which cannot be said of that other patch.)

I am not sure if it is completely correct that the other part of the
patch is dependent on that CF entry. I have studied both the threads
(not every detail) and it seems to me it is dependent on one of the
patches from that series which handles concurrent aborts. It is patch
0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.Jan4.patch
from what the Nikhil has posted on that thread [1]. Am, I wrong?

You're right - the part handling aborts is the only part required. There
are dependencies on some other changes from the 2PC patch, but those are
mostly refactorings that can be undone (e.g. switch from independent
flags to a single bitmap in reorderbuffer).

So IIUC, the problem of concurrent aborts is that if we allow catalog
scans for in-progress transactions, then we might get wrong answers in
cases where somebody has performed Alter-Abort-Alter which is clearly
explained with an example in email [2]. To solve that problem Nikhil
seems to have written a patch [1] which detects these concurrent
aborts during a system table scan and then aborts the decoding of such
a transaction.

Now, the problem is that patch has written considering 2PC
transactions and might not deal with all cases for in-progress
transactions especially when sub-transactions are involved as alluded
by Arseny Sher [3]. So, the problem seems to be for cases when some
sub-transaction aborts, but the main transaction still continued and
we try to decode it. Nikhil's patch won't be able to deal with it
because I think it just checks top-level xid whereas for this we need
to check all-subxids which I think is possible now as Tomas seems to
have written WAL for each xid-assignment. It might or might not be
the best solution to check the status of all-subxids, but I think
first we need to agree that the problem is just for concurrent aborts
and that we can solve it by using some part of the technology being
developed as part of patch "Logical decoding of two-phase
transactions" (https://commitfest.postgresql.org/24/944/) rather than
the entire patchset.

I hope I am not saying something very obvious here and it helps in
moving this patch forward.

No, that's a good question, and I'm not sure what the answer is at the
moment. My understanding was that the infrastructure in the 2PC patch is
enough even for subtransactions, but I might be wrong.

I also think the patch that handles concurrent aborts should be
sufficient, but that need to be integrated with your patch. Earlier,
I thought we need to check whether any of the subtransaction is
aborted as mentioned by Arseny Sher, but now after thinking again
about that problem, it seems that checking only the status current
subtransaction should be sufficient. Because, if the user does
Rollback to Savepoint concurrently which aborts multiple
subtransactions, the latest one must be aborted as well which is what
I think we want to detect. Once we detect that we have two options
(a) restart the decode of that transaction by removing changes of all
subxacts or (b) somehow mark the transaction such that it gets decoded
only at the commit time.

Maybe we should focus on the 0001 part for now - it can be committed
indepently and does provide useful feature.

If that can be done sooner, then it is fine, but otherwise, preparing
the patches on top of HEAD can facilitate the review of those.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#82Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#80)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Sep 27, 2019 at 02:33:32PM +0530, Amit Kapila wrote:

On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 1/3/18 14:53, Tomas Vondra wrote:

I don't see the need to tie this setting to maintenance_work_mem.
maintenance_work_mem is often set to very large values, which could
then have undesirable side effects on this use.

Well, we need to pick some default value, and we can either use a fixed
value (not sure what would be a good default) or tie it to an existing
GUC. We only really have work_mem and maintenance_work_mem, and the
walsender process will never use more than one such buffer. Which seems
to be closer to maintenance_work_mem.

Pretty much any default value can have undesirable side effects.

Let's just make it an independent setting unless we know any better. We
don't have a lot of settings that depend on other settings, and the ones
we do have a very specific relationship.

Moreover, the name logical_work_mem makes it sound like it's a logical
version of work_mem. Maybe we could think of another name.

I won't object to a better name, of course. Any proposals?

logical_decoding_[work_]mem?

Having a separate variable for this can give more flexibility, but
OTOH it will add one more knob which user might not have a good idea
to set. What are the problems we see if directly use work_mem for
this case?

IMHO it's similar to autovacuum_work_mem - we have an independent
setting, but most people use it as -1 so we use maintenance_work_mem as
a default value. I think it makes sense to do the same thing here.

It does ad an extra knob anyway (I don't think we should just use
maintenance_work_mem directly, the user should have an option to
override it when needed). But most users will not notice.

FWIW I don't think we should use work_mem, maintenace_work_mem seems
somewhat more appropriate here (not related to queries, etc.).

If we can't use work_mem, then I think the name proposed by you
(logical_decoding_work_mem) sounds good to me.

Yes, that name seems better.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#83Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#82)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Sep 27, 2019 at 4:55 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Fri, Sep 27, 2019 at 02:33:32PM +0530, Amit Kapila wrote:

On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 1/3/18 14:53, Tomas Vondra wrote:

I don't see the need to tie this setting to maintenance_work_mem.
maintenance_work_mem is often set to very large values, which could
then have undesirable side effects on this use.

Well, we need to pick some default value, and we can either use a fixed
value (not sure what would be a good default) or tie it to an existing
GUC. We only really have work_mem and maintenance_work_mem, and the
walsender process will never use more than one such buffer. Which seems
to be closer to maintenance_work_mem.

Pretty much any default value can have undesirable side effects.

Let's just make it an independent setting unless we know any better. We
don't have a lot of settings that depend on other settings, and the ones
we do have a very specific relationship.

Moreover, the name logical_work_mem makes it sound like it's a logical
version of work_mem. Maybe we could think of another name.

I won't object to a better name, of course. Any proposals?

logical_decoding_[work_]mem?

Having a separate variable for this can give more flexibility, but
OTOH it will add one more knob which user might not have a good idea
to set. What are the problems we see if directly use work_mem for
this case?

IMHO it's similar to autovacuum_work_mem - we have an independent
setting, but most people use it as -1 so we use maintenance_work_mem as
a default value. I think it makes sense to do the same thing here.

It does ad an extra knob anyway (I don't think we should just use
maintenance_work_mem directly, the user should have an option to
override it when needed). But most users will not notice.

FWIW I don't think we should use work_mem, maintenace_work_mem seems
somewhat more appropriate here (not related to queries, etc.).

I have the same concern for using maintenace_work_mem as Peter E.
which is that the value of maintenace_work_mem will generally be
higher which is suitable for its current purpose, but not for the
purpose this patch is using. AFAIU, at this stage we want a better
memory accounting system for logical decoding and we are not sure what
is a good value for this variable. So, I think using work_mem or
maintenace_work_mem should serve the purpose. Later, if we have
requirements from people to have better control over the memory
required for this purpose then we can introduce a new variable.

I understand that currently work_mem is primarily tied with memory
used for query workspaces, but it might be okay to extend it for this
purpose. Another point is that the default for that sound to be more
appealing for this case. I can see the argument against it which is
having a separate variable will make the things look clean and give
better control. So, if we can't convince ourselves for using
work_mem, we can introduce a new guc variable and keep the default as
4MB or work_mem.

I feel it is always tempting to introduce a new guc for the different
tasks unless there is an exact match, but OTOH, having lesser guc's
has its own advantage which is that people don't have to bother about
a new setting which they need to tune and especially for which they
can't decide with ease. I am not telling that we should not introduce
new guc when it is required, but just to give more thought before
doing so.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#84Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#73)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Sep 26, 2019 at 11:37 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

Attached is an updated patch series, rebased on current master. It does
fix one memory accounting bug in ReorderBufferToastReplace (the code was
not properly updating the amount of memory).

Few comments on 0001
1.
I am getting below linking error in pgoutput when compiling the patch
on my windows system:
pgoutput.obj : error LNK2001: unresolved external symbol _logical_work_mem

You need to use PGDLLIMPORT for logical_work_mem.

2. After, I fixed above and tried some basic test, it fails with below
callstack:
postgres.exe!ExceptionalCondition(const char *
conditionName=0x00d92854, const char * errorType=0x00d928bc, const
char * fileName=0x00d92e60,
int lineNumber=2148) Line 55
postgres.exe!ReorderBufferChangeMemoryUpdate(ReorderBuffer *
rb=0x02693390, ReorderBufferChange * change=0x0269dd38, bool
addition=true) Line 2148
postgres.exe!ReorderBufferQueueChange(ReorderBuffer * rb=0x02693390,
unsigned int xid=525, unsigned __int64 lsn=36083720,
ReorderBufferChange
* change=0x0269dd38) Line 635
postgres.exe!DecodeInsert(LogicalDecodingContext * ctx=0x0268ef80,
XLogRecordBuffer * buf=0x012cf718) Line 716 + 0x24 bytes C
postgres.exe!DecodeHeapOp(LogicalDecodingContext * ctx=0x0268ef80,
XLogRecordBuffer * buf=0x012cf718) Line 437 + 0xd bytes C
postgres.exe!LogicalDecodingProcessRecord(LogicalDecodingContext *
ctx=0x0268ef80, XLogReaderState * record=0x0268f228) Line 129
postgres.exe!pg_logical_slot_get_changes_guts(FunctionCallInfoBaseData
* fcinfo=0x02688680, bool confirm=true, bool binary=false) Line 307
postgres.exe!pg_logical_slot_get_changes(FunctionCallInfoBaseData *
fcinfo=0x02688680) Line 376

Basically, the assert added by you in ReorderBufferChangeMemoryUpdate
failed. Then, I explored a bit and it seems that you have missed
assigning a value to txn, a new variable added by this patch in
structure ReorderBufferChange:
@@ -77,6 +82,9 @@ typedef struct ReorderBufferChange
/* The type of change. */
enum ReorderBufferChangeType action;

+ /* Transaction this change belongs to. */
+ struct ReorderBufferTXN *txn;
3.
@@ -206,6 +206,17 @@ CREATE SUBSCRIPTION <replaceable
class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>work_mem</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          Limits the amount of memory used to decode changes on the
+          publisher.  If not specified, the publisher will use the default
+          specified by <varname>logical_work_mem</varname>.
+         </para>
+        </listitem>
+       </varlistentry>

I don't see any explanation of how this will be useful? How can a
subscriber predict the amount of memory required by a publisher for
decoding? This is more unpredictable because when initially the
changes are recorded in ReorderBuffer, it doesn't even filter
corresponding to any publisher. Do we really need this? I think
giving more knobs to the user is helpful when they can someway know
how to use it. In this case, it is not clear whether the user can
ever use this.

4. Can we some way expose the memory consumed by ReorderBuffer? If
so, we might be able to write some tests covering new functionality.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#85Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#83)
14 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Sep 28, 2019 at 01:36:46PM +0530, Amit Kapila wrote:

On Fri, Sep 27, 2019 at 4:55 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Fri, Sep 27, 2019 at 02:33:32PM +0530, Amit Kapila wrote:

On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 1/3/18 14:53, Tomas Vondra wrote:

I don't see the need to tie this setting to maintenance_work_mem.
maintenance_work_mem is often set to very large values, which could
then have undesirable side effects on this use.

Well, we need to pick some default value, and we can either use a fixed
value (not sure what would be a good default) or tie it to an existing
GUC. We only really have work_mem and maintenance_work_mem, and the
walsender process will never use more than one such buffer. Which seems
to be closer to maintenance_work_mem.

Pretty much any default value can have undesirable side effects.

Let's just make it an independent setting unless we know any better. We
don't have a lot of settings that depend on other settings, and the ones
we do have a very specific relationship.

Moreover, the name logical_work_mem makes it sound like it's a logical
version of work_mem. Maybe we could think of another name.

I won't object to a better name, of course. Any proposals?

logical_decoding_[work_]mem?

Having a separate variable for this can give more flexibility, but
OTOH it will add one more knob which user might not have a good idea
to set. What are the problems we see if directly use work_mem for
this case?

IMHO it's similar to autovacuum_work_mem - we have an independent
setting, but most people use it as -1 so we use maintenance_work_mem as
a default value. I think it makes sense to do the same thing here.

It does ad an extra knob anyway (I don't think we should just use
maintenance_work_mem directly, the user should have an option to
override it when needed). But most users will not notice.

FWIW I don't think we should use work_mem, maintenace_work_mem seems
somewhat more appropriate here (not related to queries, etc.).

I have the same concern for using maintenace_work_mem as Peter E.
which is that the value of maintenace_work_mem will generally be
higher which is suitable for its current purpose, but not for the
purpose this patch is using. AFAIU, at this stage we want a better
memory accounting system for logical decoding and we are not sure what
is a good value for this variable. So, I think using work_mem or
maintenace_work_mem should serve the purpose. Later, if we have
requirements from people to have better control over the memory
required for this purpose then we can introduce a new variable.

I understand that currently work_mem is primarily tied with memory
used for query workspaces, but it might be okay to extend it for this
purpose. Another point is that the default for that sound to be more
appealing for this case. I can see the argument against it which is
having a separate variable will make the things look clean and give
better control. So, if we can't convince ourselves for using
work_mem, we can introduce a new guc variable and keep the default as
4MB or work_mem.

I feel it is always tempting to introduce a new guc for the different
tasks unless there is an exact match, but OTOH, having lesser guc's
has its own advantage which is that people don't have to bother about
a new setting which they need to tune and especially for which they
can't decide with ease. I am not telling that we should not introduce
new guc when it is required, but just to give more thought before
doing so.

I do think having a separate GUC is a must, irrespectedly of what other
GUC (if any) is used as a default. You're right the maintenance_work_mem
value might be too high (e.g. in cases with many subscriptions), but the
same issue applies to work_mem - there's no guarantee work_mem is lower
than maintenance_work_mem, and in analytics databases it may be set very
high. So work_mem does not really solve the issue

IMHO we can't really do without a new GUC. It's not difficult to create
examples that would benefit from small/large memory limit, depending on
the number of subscriptions etc.

I do however agree the GUC does not have to be tied to any existing one,
it was just an attempt to use a more sensible default value. I do think
m_w_m would be fine, but I can live with using an explicit value.

So that's what I did in the attached patch - I've renamed the GUC to
logical_decoding_work_mem, detached it from m_w_m and set the default to
64MB (i.e. the same default as m_w_m). It should also fix all the issues
from the recent reviews (at least I believe so).

I've realized that one of the subsequent patches allows overriding the
limit for individual subscriptions (in the CREATE SUBSCRIPTION command).
I think it'd be good to move this bit forward, but I think it can be
done in a separate patch.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0010-Add-support-for-streaming-to-built-in-replication.patch.gzapplication/gzipDownload
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.patch.gzapplication/gzipDownload
0002-Immediately-WAL-log-assignments.patch.gzapplication/gzipDownload
0003-Issue-individual-invalidations-with-wal_level-logica.patch.gzapplication/gzipDownload
0004-Extend-the-output-plugin-API-with-stream-methods.patch.gzapplication/gzipDownload
0005-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patch.gzapplication/gzipDownload
0006-Support-decoding-of-two-phase-transactions-at-PREPAR.patch.gzapplication/gzipDownload
0007-Gracefully-handle-concurrent-aborts-of-uncommitted.patch.gzapplication/gzipDownload
�j��]0007-Gracefully-handle-concurrent-aborts-of-uncommitted.patch�[�w�H��E�{v6`�g�8��m|1I���{8z���Bb��m2���VU�{��=�!>c�����_=��<��f���J�^��*�j��6y�n��r�bV
�j���X����-_�J���'��U��J������0$��{v`����_����U��r��!�yT`�&���+Vi�T�'�2;*��md��[�	��Mw��`��q��w�!0,�D��bs��]�,��� �^��B�|�E��/"���,OV(|O�pne����7c6�|���fR����~�s���4��[�,�z�%q��qT,"����V�����y�������������X�z4�r�.����V-���/�{x���.F.�S.�G���Lx��k&�&�~4�g�g�G.�"�ya�zF�FYs�	�g�=���bZ�sK�B2��� S��W�<�������>�Ha
 *N�H&����5�E!JvU\�,����54K��Cj���/ACv65ty3�t���u��{����$~�g�\�,��0����Hr���`j.\��KX���f�#!F.��2B���c���r%C� ���hsG���(hC����`�@�ax���x�����{}��M����xtyy���5+���  14�d�� �(�{3���`���8�#���Kn	GX)���vC?�/������X�Y�:Z?���m�$(o�'G��"�ev��u)[,����c���{Ln<�������5��!z������X�/cQ����F�����CVt<��>>�:��P*���.,E����3r$��Cu�Cqk+Q(\y�?���\���%T	=Zx��\�������N�k2G�<vqv�U�U0?�,�O�L�*t��^�g��pV,�����z3��Q���v�Q7��N�T�4��U�4lV)���:��%�eA/���_X�^kZ�?*eO������{_�,wx����o�����,'� �B���g�.*P�C�������d0f�������T;���������<�qd�����`x����9���?��>'�0������p�A��"����X�1!�������C��R���^���j.L�C�90zcd!{D�����@v�|q���TnK�x7�\#��Y�@�{���J|!������0��gV�)rD.�����g�����	�j�;,��<1�A�.[�u%�l[��+���	3-2{XG ��/����<�������e�P����.�N6>
{�S�\~���2�4��u���$��N�^oW��S*5-�j���v5��e�\��d�*�r��dG�	�-<#M�x��U�&(�[Pu�K�ITz��w���f���,?9sq?+�8L�G�������x2���������@�������W�l�h������|�8{�9>�_��}&��
!%�����N� 6�0
���p�������U�]W�sJ�bT��A��S�kr�(��c�-�4�+����M� �C�	������O?�����P��Y��K�2�"-�rj���3���������j(?Jnw�����#�GB��2C��|������������n'�p�j��L�@,�����D!+�@����;��%�Mjaxb�
@��9 >(Q2��Mp�����a�S��v|�Gr�i�Z!��k�B������5��R�?[#��<��&��w/Zx2������X���[����m�$��~�(�m�~������K��%�
��li�8�*�V.�@��F�A�h4������Tr#��S�u��No|�0,�����)S#
;@�_��X�6;K")4�}B����=C'U�#dH�������$}�E6AX�&��j�M$M�����PD;��A"IL�����`�]%�*�M���|8��FQ�d��J,76A�
��6�m���]�k�mSt�����/Q��NJ��8�����J���X04��\�-0z)Tc@a��+^�6u���I�����J�r(�R/sn�b����hT�1RxB�����`|��L��#��hhg2��������������.�����nAS;���R�V�,�W�xx�	�J����gz��������`<���:����x;�4 b
��Y�>�:tc��G�sR����^I��
���t��]�WK�j�bX�i:�WHiVOVHi24�j�
~cq�e��&;pp�t1W-�6_.8y,z����tR���A�h�%h�2��Xm
V�-�r�>7#�B�+��t�
d�9���O��������Kzq
~�������E��+����x��i��E���d�����=:.�;�X����t��U�FH���-9�f�v���V��������5�>��5�}=km�k��9l�(`G��?(������.��\���F���Lp���)�}*-���cd�5���4�S��a!���������������
7���u
�f�	��RK����������b	6��x�1�C�gZ6��v�!UZ��R���������:j�D^�A����h�5N�����# �_�#��l6�f�l�J
�R1�-��������KF����xv�>��X��d�"�+<�a����f������\�G�R�
����g+@D�=�<t�|��\���
d���h����ka�A��?$�����b&s��3a��1�x����T���S�J)S��|l�oN}��v�����
�
v�w���T�;���4�T�/<����w��:����eO����-R�����1G�����mX����������I����L��OS@�������:L���I��r�(s�~���^H=U~�g=u��z���O��3�)H��5~y�����X`U:����w,��RN�r���/9������Y(CCa�kd_)�e0���Pt�	���`s����g�X�����8k����sK3��/lW�K�$W�m��$�V��z�*V�pl@@��>�*T�=����Oo�e���^?|�����q�.^��-��`4�C��+��E��������{9�(���t�?���h�u�Cz:e�X{�����xm��J3�	T�����!�����
��FE�@vp���+����.�g:k*P�G��$����y��,T��5x��SS~�7y&����
<6��&��v���=\�
���3�!U����r����`
�|���[*!0���Uv�]
7�/W���2G�?��:Z�Lyb�_w��89t���W�t��nu��A�v���@��r�U�hc�R��A=��2���7o;Q��J1��}(P�`�L�.��g�r��q�G��	�o���7����+��K����@�N��Hf��:,�����J	!m&	g�f����H����t�R�T��wj�B��z�����l���z2�W���c(n�S�9�A�dl���S@��V�f���>��h�"�6|����x�\T������X������Y�W�B����n$�d��(����xp�O���Og�{�c����y{�,7*�F�Y*u����Z��t����n���L��`U���-,{^v���4�U�$�oo��F�����}M�=B�����G�"5B�
/��b�7���Jn
��1�+�����Jg�02�]�$���1�9�;Hteq��(qC�7�I��9��uGU�xi��v��I������Z<ED�$1�r9�2?Z�q����}�a,>t����I�%�h�������r��w�=��y�>Y��0t(/H�C	�X�&o���+pG�r�������F��j`�2����o�b9�B�c'�r��[l��J)���2���p���#M��.�dK��9�������|���q���A�-�=�4d���,�����Gu���
�]x�yg��W�a��47�*��sYX)xM���a�(�k�7��V{$��F�������|9�V��gY�=���i,���t� ���K���n:�f��d����~[�������Kn��Y����F�������r��m-��<Y��3�AU�L�R�X%\ CN�(n[}����������Q!�B��iB1����F@'�s�>����V��F�po�.������^\�����������:C]���{��~m��;��7f`�%��]=��[UU4=Y�`�LpJ��F���jG�q��e��� �Ww7�x������:�hB���z��,�h;��Fm7�x��:�x�B]4k�����
I���{zW�z�D��}�uFkQ�p`j��"#���r/D����������]k�`P�^�7�/���6F��m�C��[72�.^=��?�d��kj���Y/��b{����Z
4��c=��oSw-�?���A�e��j�T�f�r$=uQ6
0008-Teach-test_decoding-plugin-to-work-with-2PC.patch.gzapplication/gzipDownload
0009-Implement-streaming-mode-in-ReorderBuffer.patch.gzapplication/gzipDownload
0011-Track-statistics-for-streaming-spilling.patch.gzapplication/gzipDownload
0012-Enable-streaming-for-all-subscription-TAP-tests.patch.gzapplication/gzipDownload
0013-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch.gzapplication/gzipDownload
0014-Add-TAP-test-for-streaming-vs.-DDL.patch.gzapplication/gzipDownload
#86Dilip Kumar
dilipbalaut@gmail.com
In reply to: Tomas Vondra (#73)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Sep 26, 2019 at 11:38 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

No, that's a good question, and I'm not sure what the answer is at the
moment. My understanding was that the infrastructure in the 2PC patch is
enough even for subtransactions, but I might be wrong. I need to think
about that for a while.

IIUC, for 2PC it's enough to check whether the main transaction is
aborted or not but for the in-progress transaction it's possible that
the current subtransaction might have done catalog changes and it
might get aborted when we are decoding. So we need to extend an
infrastructure such that we can check the status of the transaction
for which we are decoding the change. Also, I think we need to handle
the ERRCODE_TRANSACTION_ROLLBACK and ignore it.

I have attached a small patch to handle this which can be applied on
top of your patch set.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

handle_concurrent_abort_for_in_progress_transaction.patchapplication/octet-stream; name=handle_concurrent_abort_for_in_progress_transaction.patchDownload
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 163395f..ca7e2d7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -3449,6 +3449,7 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	volatile CommandId command_id;
 	bool		using_subtxn;
 	Size		streamed = 0;
+	MemoryContext ccxt = CurrentMemoryContext;
 	ReorderBufferStreamIterTXNState *volatile iterstate = NULL;
 
 	/*
@@ -3579,6 +3580,13 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			/* we're going to stream this change */
 			streamed++;
 
+			/*
+			 * Set the CheckXidAlive to the current (sub)xid for which this
+			 * change belongs to so that we can detect the abort while we are
+			 * decoding.
+			 */
+			CheckXidAlive = change->txn->xid;
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -3891,6 +3899,9 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+		
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferStreamIterTXNFinish(rb, iterstate);
@@ -3909,7 +3920,14 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		PG_RE_THROW();
+		/* re-throw only if it's not an abort */
+		if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+		else
+			FlushErrorState();
 	}
 	PG_END_TRY();
 
#87Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#85)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Sat, Sep 28, 2019 at 01:36:46PM +0530, Amit Kapila wrote:

On Fri, Sep 27, 2019 at 4:55 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I do think having a separate GUC is a must, irrespectedly of what other
GUC (if any) is used as a default. You're right the maintenance_work_mem
value might be too high (e.g. in cases with many subscriptions), but the
same issue applies to work_mem - there's no guarantee work_mem is lower
than maintenance_work_mem, and in analytics databases it may be set very
high. So work_mem does not really solve the issue

IMHO we can't really do without a new GUC. It's not difficult to create
examples that would benefit from small/large memory limit, depending on
the number of subscriptions etc.

I do however agree the GUC does not have to be tied to any existing one,
it was just an attempt to use a more sensible default value. I do think
m_w_m would be fine, but I can live with using an explicit value.

So that's what I did in the attached patch - I've renamed the GUC to
logical_decoding_work_mem, detached it from m_w_m and set the default to
64MB (i.e. the same default as m_w_m).

Fair enough, let's not argue more on this unless someone else wants to
share his opinion.

It should also fix all the issues
from the recent reviews (at least I believe so).

Have you given any thought on creating a test case for this patch? I
think you also told that you will test some worst-case scenarios and
report the numbers so that we are convinced that the current eviction
algorithm is good.

I've realized that one of the subsequent patches allows overriding the
limit for individual subscriptions (in the CREATE SUBSCRIPTION command).
I think it'd be good to move this bit forward, but I think it can be
done in a separate patch.

Yeah, it is better to deal it separately as I am also not entirely
convinced at this stage about this parameter. I have mentioned the
same in the previous email as well.

While glancing through the changes, I noticed a small thing:
+#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use maintenance_work_mem

I guess this need to be updated.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#88Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Kapila (#87)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 2019-Sep-29, Amit Kapila wrote:

On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

So that's what I did in the attached patch - I've renamed the GUC to
logical_decoding_work_mem, detached it from m_w_m and set the default to
64MB (i.e. the same default as m_w_m).

Fair enough, let's not argue more on this unless someone else wants to
share his opinion.

I just read this part of the conversation and I agree that having a
separate GUC with its own value independent from other GUCs is a good
solution. Tying it to m_w_m seemed reasonable, but it's true that
people frequently set m_w_m very high, and it would be undesirable to
propagate that value to logical decoding memory usage.

I wonder what would constitute good advice on how to set this value, I
mean what is the metric that the user needs to be thinking about. Is
it the total of memory required to keep all concurrent write transactions
in memory? (Quick example: if you do 2048 wTPS and each transaction
lasts 1s, and each transaction does 1kB of logically-decoded changes,
then ~2MB are sufficient for the average case. Is that correct? I
*think* that full-page images do not count, correct? With these things
in mind users could go through pg_waldump output and figure out what to
set the value to.)

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#89Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alvaro Herrera (#88)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sun, Sep 29, 2019 at 02:30:44PM -0300, Alvaro Herrera wrote:

On 2019-Sep-29, Amit Kapila wrote:

On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

So that's what I did in the attached patch - I've renamed the GUC to
logical_decoding_work_mem, detached it from m_w_m and set the default to
64MB (i.e. the same default as m_w_m).

Fair enough, let's not argue more on this unless someone else wants to
share his opinion.

I just read this part of the conversation and I agree that having a
separate GUC with its own value independent from other GUCs is a good
solution. Tying it to m_w_m seemed reasonable, but it's true that
people frequently set m_w_m very high, and it would be undesirable to
propagate that value to logical decoding memory usage.

I wonder what would constitute good advice on how to set this value, I
mean what is the metric that the user needs to be thinking about. Is
it the total of memory required to keep all concurrent write transactions
in memory? (Quick example: if you do 2048 wTPS and each transaction
lasts 1s, and each transaction does 1kB of logically-decoded changes,
then ~2MB are sufficient for the average case. Is that correct?

Yes, something like that. Essentially we'd like to keep all concurrent
transactions decoded in memory, to eliminate the need to spill to disk.
One of the subsequent patches adds some subscription-level stats, so
maybe we don't need to worry about this too much - the stats seem like a
better source of information for tuning.

I *think* that full-page images do not count, correct? With these
things in mind users could go through pg_waldump output and figure out
what to set the value to.)

Right, FPW do not matter here.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#90Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#87)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sun, Sep 29, 2019 at 11:24 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Yeah, it is better to deal it separately as I am also not entirely
convinced at this stage about this parameter. I have mentioned the
same in the previous email as well.

While glancing through the changes, I noticed a small thing:
+#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use

maintenance_work_mem

I guess this need to be updated.

On further testing, I found that the patch seems to have problems with
toast. Consider below scenario:
Session-1
Create table large_text(t1 text);
INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);

Session-2
SELECT * FROM pg_create_logical_replication_slot('regression_slot',
'test_decoding');
SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
*--kaboom*

The second statement in Session-2 leads to a crash.

Other than that, I am not sure if the changes related to spill to disk
after logical_decoding_work_mem works for toast table as I couldn't hit
that code for toast table case, but I might be missing something. As
mentioned previously, I feel there should be some way to test whether this
patch works for the cases it claims to work. As of now, I have to check
via debugging. Let me know if there is any way, I can test this.

I am reluctant to say, but I think this patch still needs some more work
(review, test, rework) before we can commit it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#91Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#90)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:

On Sun, Sep 29, 2019 at 11:24 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Yeah, it is better to deal it separately as I am also not entirely
convinced at this stage about this parameter. I have mentioned the
same in the previous email as well.

While glancing through the changes, I noticed a small thing:
+#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use

maintenance_work_mem

I guess this need to be updated.

On further testing, I found that the patch seems to have problems with
toast. Consider below scenario:
Session-1
Create table large_text(t1 text);
INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);

Session-2
SELECT * FROM pg_create_logical_replication_slot('regression_slot',
'test_decoding');
SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
*--kaboom*

The second statement in Session-2 leads to a crash.

OK, thanks for the report - will investigate.

Other than that, I am not sure if the changes related to spill to disk
after logical_decoding_work_mem works for toast table as I couldn't hit
that code for toast table case, but I might be missing something. As
mentioned previously, I feel there should be some way to test whether this
patch works for the cases it claims to work. As of now, I have to check
via debugging. Let me know if there is any way, I can test this.

That's one of the reasons why I proposed to move the statistics (which
say how many transactions / bytes were spilled to disk) from a later
patch in the series. I don't think there's a better way.

I am reluctant to say, but I think this patch still needs some more work
(review, test, rework) before we can commit it.

I agreee.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#92Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#91)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:

On further testing, I found that the patch seems to have problems with
toast. Consider below scenario:
Session-1
Create table large_text(t1 text);
INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);

Session-2
SELECT * FROM pg_create_logical_replication_slot('regression_slot',
'test_decoding');
SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
*--kaboom*

The second statement in Session-2 leads to a crash.

OK, thanks for the report - will investigate.

It was an assertion failure in ReorderBufferCleanupTXN at below line:
+ /* Check we're not mixing changes from different transactions. */
+ Assert(change->txn == txn);

Other than that, I am not sure if the changes related to spill to disk
after logical_decoding_work_mem works for toast table as I couldn't hit
that code for toast table case, but I might be missing something. As
mentioned previously, I feel there should be some way to test whether this
patch works for the cases it claims to work. As of now, I have to check
via debugging. Let me know if there is any way, I can test this.

That's one of the reasons why I proposed to move the statistics (which
say how many transactions / bytes were spilled to disk) from a later
patch in the series. I don't think there's a better way.

I like that idea, but I think you need to split that patch to only get the
stats related to the spill. It would be easier to review if you can
prepare that atop of
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#93Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#92)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:

On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:

On further testing, I found that the patch seems to have problems with
toast. Consider below scenario:
Session-1
Create table large_text(t1 text);
INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);

Session-2
SELECT * FROM pg_create_logical_replication_slot('regression_slot',
'test_decoding');
SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
*--kaboom*

The second statement in Session-2 leads to a crash.

OK, thanks for the report - will investigate.

It was an assertion failure in ReorderBufferCleanupTXN at below line:
+ /* Check we're not mixing changes from different transactions. */
+ Assert(change->txn == txn);

Can you still reproduce this issue with the patch I sent on 28/9? I have
been unable to trigger the failure, and it seems pretty similar to the
failure you reported (and I fixed) on 28/9.

Other than that, I am not sure if the changes related to spill to disk
after logical_decoding_work_mem works for toast table as I couldn't hit
that code for toast table case, but I might be missing something. As
mentioned previously, I feel there should be some way to test whether this
patch works for the cases it claims to work. As of now, I have to check
via debugging. Let me know if there is any way, I can test this.

That's one of the reasons why I proposed to move the statistics (which
say how many transactions / bytes were spilled to disk) from a later
patch in the series. I don't think there's a better way.

I like that idea, but I think you need to split that patch to only get the
stats related to the spill. It would be easier to review if you can
prepare that atop of
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.

Sure, I wasn't really proposing to adding all stats from that patch,
including those related to streaming. We need to extract just those
related to spilling. And yes, it needs to be moved right after 0001.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#94Dilip Kumar
dilipbalaut@gmail.com
In reply to: Tomas Vondra (#93)
14 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

I have attempted to test the performance of (Stream + Spill) vs
(Stream + BGW pool) and I can see the similar gain what Alexey had
shown[1]/messages/by-id/8eda5118-2dd0-79a1-4fe9-eec7e334de17@postgrespro.ru.

In addition to this, I have rebased the latest patchset [2]/messages/by-id/20190928190917.hrpknmq76v3ts3lj@development without
the two-phase logical decoding patch set.

Test results:
I have repeated the same test as Alexy[1]/messages/by-id/8eda5118-2dd0-79a1-4fe9-eec7e334de17@postgrespro.ru for 1kk and 1kk data and
here is my result
Stream + Spill
N time on master(sec) Total xact time (sec)
1kk 6 21
3kk 18 55

Stream + BGW pool
N time on master(sec) Total xact time (sec)
1kk 6 13
3kk 19 35

Patch details:
All the patches are the same as posted on [2]/messages/by-id/20190928190917.hrpknmq76v3ts3lj@development except
1. 0006-Gracefully-handle-concurrent-aborts-of-uncommitted -> I have
removed the handling of error which is specific for 2PC
2. 0007-Implement-streaming-mode-in-ReorderBuffer -> Rebased without 2PC
3. 0009-Extend-the-concurrent-abort-handling-for-in-progress -> New
patch to handle concurrent abort error for the in-progress transaction
and also add handling for the sub transaction's abort.
4. v3-0014-BGWorkers-pool-for-streamed-transactions-apply -> Rebased
Alexey's patch

[1]: /messages/by-id/8eda5118-2dd0-79a1-4fe9-eec7e334de17@postgrespro.ru
[2]: /messages/by-id/20190928190917.hrpknmq76v3ts3lj@development

On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:

On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:

On further testing, I found that the patch seems to have problems with
toast. Consider below scenario:
Session-1
Create table large_text(t1 text);
INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);

Session-2
SELECT * FROM pg_create_logical_replication_slot('regression_slot',
'test_decoding');
SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
*--kaboom*

The second statement in Session-2 leads to a crash.

OK, thanks for the report - will investigate.

It was an assertion failure in ReorderBufferCleanupTXN at below line:
+ /* Check we're not mixing changes from different transactions. */
+ Assert(change->txn == txn);

Can you still reproduce this issue with the patch I sent on 28/9? I have
been unable to trigger the failure, and it seems pretty similar to the
failure you reported (and I fixed) on 28/9.

Other than that, I am not sure if the changes related to spill to disk
after logical_decoding_work_mem works for toast table as I couldn't hit
that code for toast table case, but I might be missing something. As
mentioned previously, I feel there should be some way to test whether this
patch works for the cases it claims to work. As of now, I have to check
via debugging. Let me know if there is any way, I can test this.

That's one of the reasons why I proposed to move the statistics (which
say how many transactions / bytes were spilled to disk) from a later
patch in the series. I don't think there's a better way.

I like that idea, but I think you need to split that patch to only get the
stats related to the spill. It would be easier to review if you can
prepare that atop of
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.

Sure, I wasn't really proposing to adding all stats from that patch,
including those related to streaming. We need to extract just those
related to spilling. And yes, it needs to be moved right after 0001.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0005-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patchapplication/octet-stream; name=0005-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patchDownload
From bab2c69894e1a22bfe9a96f452644065150bd4a9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 18:08:37 +0200
Subject: [PATCH 05/13] Cleaning up of flags in ReorderBufferTXN structure

---
 src/backend/replication/logical/reorderbuffer.c | 34 ++++++++++++-------------
 src/include/replication/reorderbuffer.h         | 33 ++++++++++++++----------
 2 files changed, 37 insertions(+), 30 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 08b4d4f..ca4b904 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -728,7 +728,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 			Assert(prev_first_lsn < cur_txn->first_lsn);
 
 		/* known-as-subtxn txns must not be listed */
-		Assert(!cur_txn->is_known_as_subxact);
+		Assert(!rbtxn_is_known_subxact(cur_txn));
 
 		prev_first_lsn = cur_txn->first_lsn;
 	}
@@ -748,7 +748,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 			Assert(prev_base_snap_lsn < cur_txn->base_snapshot_lsn);
 
 		/* known-as-subtxn txns must not be listed */
-		Assert(!cur_txn->is_known_as_subxact);
+		Assert(!rbtxn_is_known_subxact(cur_txn));
 
 		prev_base_snap_lsn = cur_txn->base_snapshot_lsn;
 	}
@@ -771,7 +771,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
 
 	txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
 
-	Assert(!txn->is_known_as_subxact);
+	Assert(!rbtxn_is_known_subxact(txn));
 	Assert(txn->first_lsn != InvalidXLogRecPtr);
 	return txn;
 }
@@ -831,7 +831,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 
 	if (!new_sub)
 	{
-		if (subtxn->is_known_as_subxact)
+		if (rbtxn_is_known_subxact(subtxn))
 		{
 			/* already associated, nothing to do */
 			return;
@@ -847,7 +847,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 		}
 	}
 
-	subtxn->is_known_as_subxact = true;
+	subtxn->txn_flags |= RBTXN_IS_SUBXACT;
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
@@ -1057,7 +1057,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	{
 		ReorderBufferChange *cur_change;
 
-		if (txn->serialized)
+		if (rbtxn_is_serialized(txn))
 		{
 			/* serialize remaining changes */
 			ReorderBufferSerializeTXN(rb, txn);
@@ -1086,7 +1086,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		{
 			ReorderBufferChange *cur_change;
 
-			if (cur_txn->serialized)
+			if (rbtxn_is_serialized(cur_txn))
 			{
 				/* serialize remaining changes */
 				ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1252,7 +1252,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * they originally were happening inside another subtxn, so we won't
 		 * ever recurse more than one level deep here.
 		 */
-		Assert(subtxn->is_known_as_subxact);
+		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
 		ReorderBufferCleanupTXN(rb, subtxn);
@@ -1300,7 +1300,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	/*
 	 * Remove TXN from its containing list.
 	 *
-	 * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+	 * Note: if txn is known as subxact, we are deleting the TXN from its
 	 * parent's list of known subxacts; this leaves the parent's nsubxacts
 	 * count too high, but we don't care.  Otherwise, we are deleting the TXN
 	 * from the LSN-ordered list of toplevel TXNs.
@@ -1315,7 +1315,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	Assert(found);
 
 	/* remove entries spilled to disk */
-	if (txn->serialized)
+	if (rbtxn_is_serialized(txn))
 		ReorderBufferRestoreCleanup(rb, txn);
 
 	/* deallocate */
@@ -1332,7 +1332,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
 		return;
 
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1972,7 +1972,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
 			 * final_lsn to that of their last change; this causes
 			 * ReorderBufferRestoreCleanup to do the right thing.
 			 */
-			if (txn->serialized && txn->final_lsn == 0)
+			if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
 			{
 				ReorderBufferChange *last =
 				dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -2120,7 +2120,7 @@ ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
 	 * operate on its top-level transaction instead.
 	 */
 	txn = ReorderBufferTXNByXid(rb, xid, true, &is_new, lsn, true);
-	if (txn->is_known_as_subxact)
+	if (rbtxn_is_known_subxact(txn))
 		txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
 									NULL, InvalidXLogRecPtr, false);
 	Assert(txn->base_snapshot == NULL);
@@ -2298,7 +2298,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	txn->has_catalog_changes = true;
+	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2315,7 +2315,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
 	if (txn == NULL)
 		return false;
 
-	return txn->has_catalog_changes;
+	return rbtxn_has_catalog_changes(txn);
 }
 
 /*
@@ -2335,7 +2335,7 @@ ReorderBufferXidHasBaseSnapshot(ReorderBuffer *rb, TransactionId xid)
 		return false;
 
 	/* a known subtxn? operate on top-level txn instead */
-	if (txn->is_known_as_subxact)
+	if (rbtxn_is_known_subxact(txn))
 		txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
 									NULL, InvalidXLogRecPtr, false);
 
@@ -2520,7 +2520,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	Assert(spilled == txn->nentries_mem);
 	Assert(dlist_is_empty(&txn->changes));
 	txn->nentries_mem = 0;
-	txn->serialized = true;
+	txn->txn_flags |= RBTXN_IS_SERIALIZED;
 
 	if (fd != -1)
 		CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e6d17fb..afe690a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -169,18 +169,34 @@ typedef struct ReorderBufferChange
 	dlist_node	node;
 } ReorderBufferChange;
 
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT          0x0002
+#define RBTXN_IS_SERIALIZED       0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn)    (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk?  It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn)       (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
 typedef struct ReorderBufferTXN
 {
+	int     txn_flags;
+
 	/*
 	 * The transactions transaction id, can be a toplevel or sub xid.
 	 */
 	TransactionId xid;
 
-	/* did the TX have catalog changes */
-	bool		has_catalog_changes;
-
 	/* Do we know this is a subxact?  Xid of top-level txn if so */
-	bool		is_known_as_subxact;
 	TransactionId toplevel_xid;
 
 	/*
@@ -249,15 +265,6 @@ typedef struct ReorderBufferTXN
 	uint64		nentries_mem;
 
 	/*
-	 * Has this transaction been spilled to disk?  It's not always possible to
-	 * deduce that fact by comparing nentries with nentries_mem, because e.g.
-	 * subtransactions of a large transaction might get serialized together
-	 * with the parent - if they're restored to memory they'd have
-	 * nentries_mem == nentries.
-	 */
-	bool		serialized;
-
-	/*
 	 * List of ReorderBufferChange structs, including new Snapshots and new
 	 * CommandIds
 	 */
-- 
1.8.3.1

0004-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=0004-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From 0db9e31ccc2537fc3c37edea1280210644905dc1 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH 04/13] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 +++++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  57 +++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 6c33c4b..9c77791 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -542,3 +571,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8db9686..fc4ad65 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_message_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction size and network bandwidth, the transfer time
+    may significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similarly to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_work_mem</varname> setting. At
+    that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index da265f5..3230e45 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -69,6 +69,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -193,6 +208,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -867,6 +915,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	/* FIXME ctx->write_location = apply_lsn; */
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	/* FIXME ctx->write_location = apply_lsn; */
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 31c796b..d95d1b9 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -81,6 +81,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index d4ce54f..a305462 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 82dcb7f..e6d17fb 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -345,6 +345,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -384,6 +430,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

0006-Gracefully-handle-concurrent-aborts-of-uncommitted.patchapplication/octet-stream; name=0006-Gracefully-handle-concurrent-aborts-of-uncommitted.patchDownload
From 9777828b8f52fa1175fc1bf4ca2341b0bdb2cc9c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 3 Oct 2019 09:00:49 +0530
Subject: [PATCH 06/13] Gracefully handle concurrent aborts of uncommitted 
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml               |  5 ++-
 src/backend/access/heap/heapam.c                | 51 +++++++++++++++++++++++++
 src/backend/access/index/genam.c                | 34 +++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c | 32 +++++++++++++---
 src/backend/utils/time/snapmgr.c                | 25 +++++++++++-
 src/include/utils/snapmgr.h                     |  4 +-
 6 files changed, 142 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index fc4ad65..da6a6f3 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,7 +432,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
      includes writing to tables, performing DDL changes, and
      calling <literal>txid_current()</literal>.
     </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e954482..6ce7878 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1303,6 +1303,17 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(scan->rs_base.rs_rd) ||
+		  RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_getnext call")));
+
 	/* Note: no locking manipulations needed */
 
 	HEAPDEBUG_1;				/* heap_getnext( info ) */
@@ -1422,6 +1433,16 @@ heap_fetch(Relation relation,
 	bool		valid;
 
 	/*
+	 * We don't expect direct calls to heap_fetch with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_fetch call")));
+
+	/*
 	 * Fetch and pin the appropriate page of the relation.
 	 */
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
@@ -1535,6 +1556,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		valid;
 	bool		skip;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search_buffer with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_hot_search_buffer call")));
+
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
 		*all_dead = first_call;
@@ -1683,6 +1714,16 @@ heap_get_latest_tid(TableScanDesc sscan,
 	Assert(ItemPointerIsValid(tid));
 
 	/*
+	 * We don't expect direct calls to heap_get_latest_tid with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_get_latest_tid call")));
+
+	/*
 	 * Loop to chase down t_ctid links.  At top of loop, ctid is the tuple we
 	 * need to examine, and *tid is the TID we will return if ctid turns out
 	 * to be bogus.
@@ -5481,6 +5522,16 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	ItemId		lp = NULL;
 	HeapTupleHeader htup;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_hot_search call")));
+
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d..201acfb 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -477,6 +478,17 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
@@ -513,6 +525,17 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return result;
 }
 
@@ -639,6 +662,17 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index ca4b904..1955bc5 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -679,7 +679,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 			txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 		/* setup snapshot to allow catalog access */
-		SetupHistoricSnapshot(snapshot_now, NULL);
+		SetupHistoricSnapshot(snapshot_now, NULL, xid);
 		PG_TRY();
 		{
 			rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1496,6 +1496,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	volatile CommandId command_id = FirstCommandId;
 	bool		using_subtxn;
 	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	MemoryContext ccxt = CurrentMemoryContext;
 
 	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
 								false);
@@ -1529,7 +1530,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1780,7 +1781,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1800,7 +1801,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 						/*
 						 * Every time the CommandId is incremented, we could
@@ -1879,6 +1880,20 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_CATCH();
 	{
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
+		/*
+		 * if the catalog scan access returned an error of
+		 * rollback, then abort on the other side as well
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			elog(LOG, "stopping decoding of %s (%u)",
+				 txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+			rb->abort(rb, txn, commit_lsn);
+		}
+
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
 
@@ -1902,7 +1917,14 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		/* remove potential on-disk data, and deallocate */
 		ReorderBufferCleanupTXN(rb, txn);
 
-		PG_RE_THROW();
+		/* re-throw only if it's not an abort */
+		if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+		else
+			FlushErrorState();
 	}
 	PG_END_TRY();
 }
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 47b0517..9fa1e43 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -154,6 +154,13 @@ static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
 /*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
+/*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2029,10 +2036,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
  *
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive.  This is to re-check XID status while accessing catalog.
+ *
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+					  TransactionId snapshot_xid)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -2041,8 +2052,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 
 	/* setup (cmin, cmax) lookup hash */
 	tuplecid_data = tuplecids;
-}
 
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
+	 */
+	if (TransactionIdIsValid(snapshot_xid) &&
+		!TransactionIdDidCommit(snapshot_xid))
+		CheckXidAlive = snapshot_xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
 /*
  * Make catalog snapshots behave normally again.
@@ -2052,6 +2072,7 @@ TeardownHistoricSnapshot(bool is_error)
 {
 	HistoricSnapshot = NULL;
 	tuplecid_data = NULL;
+	CheckXidAlive = InvalidTransactionId;
 }
 
 bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 67b07df..9a8f9ce 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,8 +145,10 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+								  TransactionId snapshot_xid);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-- 
1.8.3.1

0007-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=0007-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From 63de67f99e363f506502b490965e662e6374d56e Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 3 Oct 2019 09:02:37 +0530
Subject: [PATCH 07/13] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c     |   38 +-
 src/backend/replication/logical/reorderbuffer.c | 1074 ++++++++++++++++++++++-
 src/include/replication/reorderbuffer.h         |   32 +
 3 files changed, 1111 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 537e681..76a105a 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1955bc5..c2d4677 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -149,6 +149,28 @@ typedef struct ReorderBufferIterTXNState
 	ReorderBufferIterTXNEntry entries[FLEXIBLE_ARRAY_MEMBER];
 } ReorderBufferIterTXNState;
 
+/*
+ * k-way in-order change iteration support structures
+ *
+ * This is a simplified version for streaming, which does not require
+ * serialization to files and only reads changes that are currently in
+ * memory.
+ */
+typedef struct ReorderBufferStreamIterTXNEntry
+{
+	XLogRecPtr	lsn;
+	ReorderBufferChange *change;
+	ReorderBufferTXN *txn;
+}			ReorderBufferStreamIterTXNEntry;
+
+typedef struct ReorderBufferStreamIterTXNState
+{
+	binaryheap *heap;
+	Size		nr_txns;
+	dlist_head	old_change;
+	ReorderBufferStreamIterTXNEntry entries[FLEXIBLE_ARRAY_MEMBER];
+}			ReorderBufferStreamIterTXNState;
+
 /* toast datastructures */
 typedef struct ReorderBufferToastEnt
 {
@@ -213,6 +235,20 @@ static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
 static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
 
+
+/* iterator for streaming (only get data from memory) */
+static ReorderBufferStreamIterTXNState * ReorderBufferStreamIterTXNInit(
+																		ReorderBuffer *rb,
+																		ReorderBufferTXN *txn);
+
+static ReorderBufferChange *ReorderBufferStreamIterTXNNext(
+							   ReorderBuffer *rb,
+							   ReorderBufferStreamIterTXNState * state);
+
+static void ReorderBufferStreamIterTXNFinish(
+								 ReorderBuffer *rb,
+								 ReorderBufferStreamIterTXNState * state);
+
 /*
  * ---------------------------------------
  * Disk serialization support functions
@@ -227,6 +263,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -235,6 +272,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -358,6 +404,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -755,6 +804,33 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+static void
+AssertChangeLsnOrder(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -851,6 +927,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -974,7 +1053,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1002,6 +1081,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	cur_txn_i;
 	int32		off;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(rb, txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1016,6 +1098,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(rb, cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1231,6 +1316,210 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 }
 
 /*
+ * Binary heap comparison function (streaming iterator).
+ */
+static int
+ReorderBufferStreamIterCompare(Datum a, Datum b, void *arg)
+{
+	ReorderBufferStreamIterTXNState *state = (ReorderBufferStreamIterTXNState *) arg;
+	XLogRecPtr	pos_a = state->entries[DatumGetInt32(a)].lsn;
+	XLogRecPtr	pos_b = state->entries[DatumGetInt32(b)].lsn;
+
+	if (pos_a < pos_b)
+		return 1;
+	else if (pos_a == pos_b)
+		return 0;
+	return -1;
+}
+
+/*
+ * Allocate & initialize an iterator which iterates in lsn order over a
+ * transaction and all its subtransactions. This version is meant for
+ * streaming of incomplete transactions.
+ */
+static ReorderBufferStreamIterTXNState *
+ReorderBufferStreamIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Size		nr_txns = 0;
+	ReorderBufferStreamIterTXNState *state;
+	dlist_iter	cur_txn_i;
+	int32		off;
+
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(rb, txn);
+
+	/*
+	 * Calculate the size of our heap: one element for every transaction that
+	 * contains changes.  (Besides the transactions already in the reorder
+	 * buffer, we count the one we were directly passed.)
+	 */
+	if (txn->nentries > 0)
+		nr_txns++;
+
+	dlist_foreach(cur_txn_i, &txn->subtxns)
+	{
+		ReorderBufferTXN *cur_txn;
+
+		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
+
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(rb, cur_txn);
+
+		if (cur_txn->nentries > 0)
+			nr_txns++;
+	}
+
+	/*
+	 * TODO: Consider adding fastpath for the rather common nr_txns=1 case, no
+	 * need to allocate/build a heap then.
+	 */
+
+	/* allocate iteration state */
+	state = (ReorderBufferStreamIterTXNState *)
+		MemoryContextAllocZero(rb->context,
+							   sizeof(ReorderBufferStreamIterTXNState) +
+							   sizeof(ReorderBufferStreamIterTXNEntry) * nr_txns);
+
+	state->nr_txns = nr_txns;
+	dlist_init(&state->old_change);
+
+	/* allocate heap */
+	state->heap = binaryheap_allocate(state->nr_txns,
+									  ReorderBufferStreamIterCompare,
+									  state);
+
+	/*
+	 * Now insert items into the binary heap, in an unordered fashion.  (We
+	 * will run a heap assembly step at the end; this is more efficient.)
+	 */
+
+	off = 0;
+
+	/* add toplevel transaction if it contains changes */
+	if (txn->nentries > 0)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_head_element(ReorderBufferChange, node,
+										&txn->changes);
+
+		state->entries[off].lsn = cur_change->lsn;
+		state->entries[off].change = cur_change;
+		state->entries[off].txn = txn;
+
+		binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
+	}
+
+	/* add subtransactions if they contain changes */
+	dlist_foreach(cur_txn_i, &txn->subtxns)
+	{
+		ReorderBufferTXN *cur_txn;
+
+		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
+
+		if (cur_txn->nentries > 0)
+		{
+			ReorderBufferChange *cur_change;
+
+			cur_change = dlist_head_element(ReorderBufferChange, node,
+											&cur_txn->changes);
+
+			state->entries[off].lsn = cur_change->lsn;
+			state->entries[off].change = cur_change;
+			state->entries[off].txn = cur_txn;
+
+			binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
+		}
+	}
+
+	Assert(off == nr_txns);
+
+	/* assemble a valid binary heap */
+	binaryheap_build(state->heap);
+
+	return state;
+}
+
+/*
+ * Return the next change when iterating over a transaction and its
+ * subtransactions.
+ *
+ * Returns NULL when no further changes exist.
+ */
+static ReorderBufferChange *
+ReorderBufferStreamIterTXNNext(ReorderBuffer *rb, ReorderBufferStreamIterTXNState * state)
+{
+	ReorderBufferChange *change;
+	ReorderBufferStreamIterTXNEntry *entry;
+	int32		off;
+
+	/* nothing there anymore */
+	if (state->heap->bh_size == 0)
+		return NULL;
+
+	off = DatumGetInt32(binaryheap_first(state->heap));
+	entry = &state->entries[off];
+
+	/* free memory we might have "leaked" in the previous *Next call */
+	if (!dlist_is_empty(&state->old_change))
+	{
+		change = dlist_container(ReorderBufferChange, node,
+								 dlist_pop_head_node(&state->old_change));
+		ReorderBufferReturnChange(rb, change);
+		Assert(dlist_is_empty(&state->old_change));
+	}
+
+	change = entry->change;
+
+	/*
+	 * update heap with information about which transaction has the next
+	 * relevant change in LSN order
+	 */
+
+	/* there are in-memory changes */
+	if (dlist_has_next(&entry->txn->changes, &entry->change->node))
+	{
+		dlist_node *next = dlist_next_node(&entry->txn->changes, &change->node);
+		ReorderBufferChange *next_change =
+		dlist_container(ReorderBufferChange, node, next);
+
+		/* txn stays the same */
+		state->entries[off].lsn = next_change->lsn;
+		state->entries[off].change = next_change;
+
+		binaryheap_replace_first(state->heap, Int32GetDatum(off));
+		return change;
+	}
+
+	/* ok, no changes there anymore, remove */
+	binaryheap_remove_first(state->heap);
+
+	return change;
+}
+
+/*
+ * Deallocate the iterator
+ */
+static void
+ReorderBufferStreamIterTXNFinish(ReorderBuffer *rb,
+								 ReorderBufferStreamIterTXNState * state)
+{
+	/* free memory we might have "leaked" in the last *Next call */
+	if (!dlist_is_empty(&state->old_change))
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node,
+								 dlist_pop_head_node(&state->old_change));
+		ReorderBufferReturnChange(rb, change);
+		Assert(dlist_is_empty(&state->old_change));
+	}
+
+	binaryheap_free(state->heap);
+	pfree(state);
+}
+
+/*
  * Cleanup the contents of a transaction, usually after the transaction
  * committed or aborted.
  */
@@ -1323,33 +1612,104 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
  */
 static void
-ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	dlist_iter	iter;
-	HASHCTL		hash_ctl;
+	dlist_mutable_iter iter;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
 
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
-	hash_ctl.entrysize = sizeof(ReorderBufferTupleCidEnt);
-	hash_ctl.hcxt = rb->context;
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
 
-	/*
-	 * create the hash with the exact number of to-be-stored tuplecids from
-	 * the start
-	 */
-	txn->tuplecid_hash =
-		hash_create("ReorderBufferTupleCid", txn->ntuplecids, &hash_ctl,
-					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
 
-	dlist_foreach(iter, &txn->tuplecids)
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
+ * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
+ */
+static void
+ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_iter	iter;
+	HASHCTL		hash_ctl;
+
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+
+	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
+	hash_ctl.entrysize = sizeof(ReorderBufferTupleCidEnt);
+	hash_ctl.hcxt = rb->context;
+
+	/*
+	 * create the hash with the exact number of to-be-stored tuplecids from
+	 * the start
+	 */
+	txn->tuplecid_hash =
+		hash_create("ReorderBufferTupleCid", txn->ntuplecids, &hash_ctl,
+					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	dlist_foreach(iter, &txn->tuplecids)
 	{
 		ReorderBufferTupleCidKey key;
 		ReorderBufferTupleCidEnt *ent;
@@ -1399,6 +1759,16 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 }
 
+static void
+ReorderBufferDestroyTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+}
+
 /*
  * Copy a provided snapshot so we can modify it privately. This is needed so
  * that catalog modifying transactions can look into intermediate catalog
@@ -1472,6 +1842,19 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 		SnapBuildSnapDecRefcount(snap);
 }
 
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
+
+	ReorderBufferStreamTXN(rb, txn);
+
+	rb->stream_commit(rb, txn, txn->final_lsn);
+
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
 /*
  * Perform the replay of a transaction and its non-aborted subtransactions.
  *
@@ -1512,6 +1895,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
 	 * If this transaction has no snapshot, it didn't make any changes to the
 	 * database, so there's nothing to decode.  Note that
 	 * ReorderBufferCommitChild will have transferred any snapshots from
@@ -1546,6 +1945,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 	PG_TRY();
 	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 		ReorderBufferChange *change;
 		ReorderBufferChange *specinsert = NULL;
 
@@ -1562,6 +1962,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			if (prev_lsn != InvalidXLogRecPtr)
+				Assert(prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -2037,6 +2447,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2172,8 +2589,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2181,6 +2607,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2192,19 +2619,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2232,6 +2668,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->data.tuplecid.combocid = combocid;
 	change->lsn = lsn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2307,6 +2744,9 @@ ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	for (i = 0; i < txn->ninvalidations; i++)
 		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+
+	/* Invalidate current schema as well */
+	txn->is_schema_sent = false;
 }
 
 /*
@@ -2321,6 +2761,23 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * We read catalog changes from WAL, which are not yet sent, so
+	 * invalidate current schema in order output plugin can resend
+	 * schema again.
+	 */
+	txn->is_schema_sent = false;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+	{
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		txn->toptxn->is_schema_sent = false;
+	}
 }
 
 /*
@@ -2425,6 +2882,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
  *
@@ -2444,15 +2933,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2737,6 +3257,504 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id;
+	bool		using_subtxn;
+	Size		streamed = 0;
+	ReorderBufferStreamIterTXNState *volatile iterstate = NULL;
+
+	/*
+	 * If this is a subxact, we need to stream the top-level transaction
+	 * instead.
+	 */
+	if (txn->toptxn)
+	{
+		ReorderBufferStreamTXN(rb, txn->toptxn);
+		return;
+	}
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+
+			if (subtxn->base_snapshot != NULL &&
+				(txn->base_snapshot == NULL ||
+				 txn->base_snapshot_lsn > subtxn->base_snapshot_lsn))
+			{
+				txn->base_snapshot = subtxn->base_snapshot;
+				txn->base_snapshot_lsn = subtxn->base_snapshot_lsn;
+				subtxn->base_snapshot = NULL;
+				subtxn->base_snapshot_lsn = InvalidXLogRecPtr;
+			}
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
+		 * information about subtransactions, which could arrive after streaming start.
+		 */
+		if (!txn->is_schema_sent)
+			snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+												 txn, command_id);
+	}
+
+	/*
+	 * build data to be able to lookup the CommandIds of catalog tuples
+	 */
+	ReorderBufferBuildTupleCidHash(rb, txn);
+
+	/* setup the initial snapshot */
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, txn->xid);
+
+	/*
+	 * Decoding needs access to syscaches et al., which in turn use
+	 * heavyweight locks and such. Thus we need to have enough state around to
+	 * keep track of those.  The easiest way is to simply use a transaction
+	 * internally.  That also allows us to easily enforce that nothing writes
+	 * to the database by checking for xid assignments.
+	 *
+	 * When we're called via the SQL SRF there's already a transaction
+	 * started, so start an explicit subtransaction there.
+	 */
+	using_subtxn = IsTransactionOrTransactionBlock();
+
+	PG_TRY();
+	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+		ReorderBufferChange *change;
+		ReorderBufferChange *specinsert = NULL;
+
+		if (using_subtxn)
+			BeginInternalSubTransaction("stream");
+		else
+			StartTransactionCommand();
+
+		/* start streaming this chunk of transaction */
+		rb->stream_start(rb, txn);
+
+		iterstate = ReorderBufferStreamIterTXNInit(rb, txn);
+		while ((change = ReorderBufferStreamIterTXNNext(rb, iterstate)) != NULL)
+		{
+			Relation	relation = NULL;
+			Oid			reloid;
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			if (prev_lsn != InvalidXLogRecPtr)
+				Assert(prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* we're going to stream this change */
+			streamed++;
+
+			switch (change->action)
+			{
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+
+					/*
+					 * Confirmation for speculative insertion arrived. Simply
+					 * use as a normal record. It'll be cleaned up at the end
+					 * of INSERT processing.
+					 */
+					Assert(specinsert->data.tp.oldtuple == NULL);
+					change = specinsert;
+					change->action = REORDER_BUFFER_CHANGE_INSERT;
+
+					/* intentionally fall through */
+				case REORDER_BUFFER_CHANGE_INSERT:
+				case REORDER_BUFFER_CHANGE_UPDATE:
+				case REORDER_BUFFER_CHANGE_DELETE:
+					Assert(snapshot_now);
+
+					reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
+												change->data.tp.relnode.relNode);
+
+					/*
+					 * Catalog tuple without data, emitted while catalog was
+					 * in the process of being rewritten.
+					 */
+					if (reloid == InvalidOid &&
+						change->data.tp.newtuple == NULL &&
+						change->data.tp.oldtuple == NULL)
+						goto change_done;
+					else if (reloid == InvalidOid)
+						elog(ERROR, "could not map filenode \"%s\" to relation OID",
+							 relpathperm(change->data.tp.relnode,
+										 MAIN_FORKNUM));
+
+					relation = RelationIdGetRelation(reloid);
+
+					if (relation == NULL)
+						elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
+							 reloid,
+							 relpathperm(change->data.tp.relnode,
+										 MAIN_FORKNUM));
+
+					if (!RelationIsLogicallyLogged(relation))
+						goto change_done;
+
+					/*
+					 * For now ignore sequence changes entirely. Most of the
+					 * time they don't log changes using records we
+					 * understand, so it doesn't make sense to handle the few
+					 * cases we do.
+					 */
+					if (relation->rd_rel->relkind == RELKIND_SEQUENCE)
+						goto change_done;
+
+					/* user-triggered change */
+					if (!IsToastRelation(relation))
+					{
+						ReorderBufferToastReplace(rb, txn, relation, change);
+						rb->stream_change(rb, txn, relation, change);
+
+						/*
+						 * Only clear reassembled toast chunks if we're sure
+						 * they're not required anymore. The creator of the
+						 * tuple tells us.
+						 */
+						if (change->data.tp.clear_toast_afterwards)
+							ReorderBufferToastReset(rb, txn);
+					}
+					/* we're not interested in toast deletions */
+					else if (change->action == REORDER_BUFFER_CHANGE_INSERT)
+					{
+						/*
+						 * Need to reassemble the full toasted Datum in
+						 * memory, to ensure the chunks don't get reused till
+						 * we're done remove it from the list of this
+						 * transaction's changes. Otherwise it will get
+						 * freed/reused while restoring spooled data from
+						 * disk.
+						 */
+						dlist_delete(&change->node);
+						ReorderBufferToastAppendChunk(rb, txn, relation,
+													  change);
+					}
+
+			change_done:
+
+					/*
+					 * Either speculative insertion was confirmed, or it was
+					 * unsuccessful and the record isn't needed anymore.
+					 */
+					if (specinsert != NULL)
+					{
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+
+					if (relation != NULL)
+					{
+						RelationClose(relation);
+						relation = NULL;
+					}
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
+
+					/*
+					 * Speculative insertions are dealt with by delaying the
+					 * processing of the insert until the confirmation record
+					 * arrives. For that we simply unlink the record from the
+					 * chain, so it does not get freed/reused while restoring
+					 * spooled data from disk.
+					 *
+					 * This is safe in the face of concurrent catalog changes
+					 * because the relevant relation can't be changed between
+					 * speculative insertion and confirmation due to
+					 * CheckTableNotInUse() and locking.
+					 */
+
+					/* clear out a pending (and thus failed) speculation */
+					if (specinsert != NULL)
+					{
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+
+					/* and memorize the pending insertion */
+					dlist_delete(&change->node);
+					specinsert = change;
+					break;
+
+				case REORDER_BUFFER_CHANGE_TRUNCATE:
+					{
+						int			i;
+						int			nrelids = change->data.truncate.nrelids;
+						int			nrelations = 0;
+						Relation   *relations;
+
+						relations = palloc0(nrelids * sizeof(Relation));
+						for (i = 0; i < nrelids; i++)
+						{
+							Oid			relid = change->data.truncate.relids[i];
+							Relation	relation;
+
+							relation = RelationIdGetRelation(relid);
+
+							if (relation == NULL)
+								elog(ERROR, "could not open relation with OID %u", relid);
+
+							if (!RelationIsLogicallyLogged(relation))
+								continue;
+
+							relations[nrelations++] = relation;
+						}
+
+						rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+						for (i = 0; i < nrelations; i++)
+							RelationClose(relations[i]);
+
+						break;
+					}
+
+				case REORDER_BUFFER_CHANGE_MESSAGE:
+
+					rb->stream_message(rb, txn, change->lsn, true,
+									   change->data.msg.prefix,
+									   change->data.msg.message_size,
+									   change->data.msg.message);
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
+					/* get rid of the old */
+					TeardownHistoricSnapshot(false);
+
+					if (snapshot_now->copied)
+					{
+						ReorderBufferFreeSnap(rb, snapshot_now);
+						snapshot_now =
+							ReorderBufferCopySnap(rb, change->data.snapshot,
+												  txn, command_id);
+					}
+
+					/*
+					 * Restored from disk, need to be careful not to double
+					 * free. We could introduce refcounting for that, but for
+					 * now this seems infrequent enough not to care.
+					 */
+					else if (change->data.snapshot->copied)
+					{
+						snapshot_now =
+							ReorderBufferCopySnap(rb, change->data.snapshot,
+												  txn, command_id);
+					}
+					else
+					{
+						snapshot_now = change->data.snapshot;
+					}
+
+					/*
+					 * TOCHECK: Snapshot changed, then invalidate current schema to reflect
+					 * possible catalog changes.
+					 */
+					txn->is_schema_sent = false;
+
+					/* and continue with the new one */
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+										  txn->xid);
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
+					Assert(change->data.command_id != InvalidCommandId);
+
+					if (command_id < change->data.command_id)
+					{
+						command_id = change->data.command_id;
+
+						if (!snapshot_now->copied)
+						{
+							/* we don't use the global one anymore */
+							snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+																 txn, command_id);
+						}
+
+						snapshot_now->curcid = command_id;
+
+						TeardownHistoricSnapshot(false);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+											  txn->xid);
+
+						/*
+						 * Every time the CommandId is incremented, we could
+						 * see new catalog contents, so execute all
+						 * invalidations.
+						 */
+						ReorderBufferExecuteInvalidations(rb, txn);
+					}
+
+					break;
+
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation message locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+					elog(ERROR, "tuplecid value in changequeue");
+					break;
+			}
+		}
+
+		/*
+		 * There's a speculative insertion remaining, just clean in up, it
+		 * can't have been successful, otherwise we'd gotten a confirmation
+		 * record.
+		 */
+		if (specinsert)
+		{
+			ReorderBufferReturnChange(rb, specinsert);
+			specinsert = NULL;
+		}
+
+		/* clean up the iterator */
+		ReorderBufferStreamIterTXNFinish(rb, iterstate);
+		iterstate = NULL;
+
+		/* call stream_stop callback */
+		rb->stream_stop(rb, txn);
+
+		/* this is just a sanity check against bad output plugin behaviour */
+		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
+			elog(ERROR, "output plugin used XID %u",
+				 GetCurrentTransactionId());
+
+		/* remember the command ID and snapshot for the streaming run */
+		txn->command_id = command_id;
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+
+		/* cleanup */
+		TeardownHistoricSnapshot(false);
+
+		/*
+		 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+		 * any memory. We could also keep the hash table and update it with
+		 * new ctid values, but this seems simpler and good enough for now.
+		 */
+		ReorderBufferDestroyTupleCidHash(rb, txn);
+
+		/*
+		 * Aborting the current (sub-)transaction as a whole has the right
+		 * semantics. We want all locks acquired in here to be released, not
+		 * reassigned to the parent and we do not want any database access
+		 * have persistent effects.
+		 */
+		AbortCurrentTransaction();
+
+		/* make sure there's no cache pollution */
+		ReorderBufferExecuteInvalidations(rb, txn);
+
+		if (using_subtxn)
+			RollbackAndReleaseCurrentSubTransaction();
+	}
+	PG_CATCH();
+	{
+		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
+		if (iterstate)
+			ReorderBufferStreamIterTXNFinish(rb, iterstate);
+
+		TeardownHistoricSnapshot(true);
+
+		/*
+		 * Force cache invalidation to happen outside of a valid transaction
+		 * to prevent catalog access as we just caught an error.
+		 */
+		AbortCurrentTransaction();
+
+		/* make sure there's no cache pollution */
+		ReorderBufferExecuteInvalidations(rb, txn);
+
+		if (using_subtxn)
+			RollbackAndReleaseCurrentSubTransaction();
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	/*
+	 * Discard the changes that we just streamed, and mark the transactions
+	 * as streamed (if they contained changes).
+	 */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index afe690a..f39b34f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* does the txn have catalog changes */
 #define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -187,6 +188,20 @@ typedef struct ReorderBufferChange
  */
 #define rbtxn_is_serialized(txn)       (txn->txn_flags & RBTXN_IS_SERIALIZED)
 
+/*
+ * Has this transaction been streamed to downstream? Similarly to spilling
+ * to disk, it's not trivial to deduce this from nentries and nentries_mem,
+ * for various reasons. For example, all changes may be in subtransactions
+ * in which case we'd have nentries==0 for the toplevel one, and it'd say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.
+ *
+ * Note: We never stream and serialize a transaction at the same time (e
+ * only do spill to disk when streaming is not supported by the plugin),
+ * so only one of those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn)         (txn->txn_flags & RBTXN_IS_STREAMED)
+
 typedef struct ReorderBufferTXN
 {
 	int     txn_flags;
@@ -222,6 +237,16 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	final_lsn;
 
 	/*
+	 * Do we need to send schema for this transaction in output plugin?
+	 */
+	bool		is_schema_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
+	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
 	XLogRecPtr	end_lsn;
@@ -252,6 +277,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
-- 
1.8.3.1

0008-Add-support-for-streaming-to-built-in-replication.patchapplication/octet-stream; name=0008-Add-support-for-streaming-to-built-in-replication.patchDownload
From 71e1f6727e3bedcd372d3995022c53d9eb011b23 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 18:56:18 +0200
Subject: [PATCH 08/13] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |    5 +-
 doc/src/sgml/ref/create_subscription.sgml          |   12 +
 src/backend/catalog/pg_subscription.c              |    1 +
 src/backend/commands/subscriptioncmds.c            |   60 +-
 src/backend/postmaster/pgstat.c                    |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    8 +-
 src/backend/replication/logical/launcher.c         |    2 +
 src/backend/replication/logical/logical.c          |    4 +-
 src/backend/replication/logical/proto.c            |  157 ++-
 src/backend/replication/logical/worker.c           | 1031 ++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c        |  263 ++++-
 src/backend/replication/slotfuncs.c                |    7 +
 src/backend/replication/walsender.c                |    6 +
 src/include/catalog/pg_subscription.h              |    3 +
 src/include/pgstat.h                               |    6 +-
 src/include/replication/logicalproto.h             |   42 +-
 src/include/replication/walreceiver.h              |    1 +
 src/test/subscription/t/009_stream_simple.pl       |   86 ++
 src/test/subscription/t/010_stream_subxact.pl      |  102 ++
 src/test/subscription/t/011_stream_ddl.pl          |   95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |   82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |   84 ++
 22 files changed, 2027 insertions(+), 42 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 6dfb2e4..e1fb907 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
+      and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 91790b0..d9abf5e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -218,6 +218,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 7e3ba8e..a1c2a9f 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -72,6 +72,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->workmem = subform->subworkmem;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index d85e831..5b3b095 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
 						   bool *refresh, int *logical_wm,
-						   bool *logical_wm_given)
+						   bool *logical_wm_given, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -100,6 +101,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*refresh = true;
 	if (logical_wm)
 		*logical_wm_given = false;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +197,26 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 
 			*logical_wm_given = true;
 			*logical_wm = defGetInt32(defel);
+
+			/*
+			 * Check that the value is not smaller than 64kB (which is
+			 * the minimum value for logical_work_mem).
+			 */
+			if (*logical_wm < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("%d is outside the valid range for parameter \"work_mem\" (64 .. 2147483647)",
+								*logical_wm)));
+		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
 		}
 		else
 			ereport(ERROR,
@@ -340,6 +363,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	int			logical_wm;
 	bool		logical_wm_given;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -356,7 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL, &logical_wm, &logical_wm_given);
+							   NULL, &logical_wm, &logical_wm_given,
+							   &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -438,7 +464,15 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 		values[Anum_pg_subscription_subworkmem - 1] =
 			Int32GetDatum(logical_wm);
 	else
-		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(-1);
+
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
 
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
@@ -705,11 +739,14 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				int			logical_wm;
 				bool		logical_wm_given;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
 										   NULL, &synchronous_commit, NULL,
-										   &logical_wm, &logical_wm_given);
+										   &logical_wm, &logical_wm_given,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -741,6 +778,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subworkmem - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -753,7 +797,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
 										   NULL, NULL, NULL, NULL, NULL,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -791,7 +835,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh, NULL, NULL);
+										   NULL, &refresh, NULL, NULL,
+										   NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -828,7 +873,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 011076c..b22a053 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4106,6 +4106,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 65b3266..9baf61f 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -406,8 +406,12 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
-		appendStringInfo(&cmd, ", work_mem '%d'",
-						 options->proto.logical.work_mem);
+		if (options->proto.logical.work_mem != -1)
+			appendStringInfo(&cmd, ", work_mem '%d'",
+							 options->proto.logical.work_mem);
+
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
 
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 186057b..02574a0 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,6 +14,8 @@
  *
  *-------------------------------------------------------------------------
  */
+#include <sys/types.h>
+#include <unistd.h>
 
 #include "postgres.h"
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 3230e45..d5b5fe1 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1153,7 +1153,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1198,7 +1198,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index e7df47d..5a379fb 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,7 +139,8 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
@@ -147,6 +148,10 @@ logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -182,8 +187,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -191,6 +196,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -252,13 +261,18 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
+	pq_sendbyte(out, 'D');		/* action DELETE */
+
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
-	pq_sendbyte(out, 'D');		/* action DELETE */
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
 
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
@@ -300,6 +314,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -309,6 +324,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -351,12 +370,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -401,7 +424,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -409,6 +432,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -689,3 +716,119 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendint32(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgint(in, 4) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+}
+
+TransactionId
+logicalrep_read_stream_stop(StringInfo in)
+{
+	TransactionId xid;
+
+	xid = pq_getmsgint(in, 4);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f737afb..3493b02 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in /tmp by default, and the filenames include both
+ * the XID of the toplevel transaction and OID of the subscription. This
+ * is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -31,6 +52,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -59,6 +81,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -66,6 +89,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -105,6 +129,50 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	off_t		offset;			/* offset in the file */
+}			SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
+static bool handle_streamed_transaction(const char action, StringInfo s);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
@@ -114,6 +182,9 @@ static void maybe_reread_subscription(void);
 /* Flags set by signal handlers */
 static volatile sig_atomic_t got_SIGHUP = false;
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -165,6 +236,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -512,6 +619,318 @@ apply_handle_origin(StringInfo s)
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* FIXME optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/* We should not receive aborts for unknown subtransactions. */
+		Assert(found);
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	/* XXX Should this be allocated in another memory context? */
+
+	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	ensure_transaction();
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pfree(buffer);
+	pfree(s2.data);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -524,6 +943,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -539,6 +961,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -574,6 +999,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -677,6 +1105,9 @@ apply_handle_update(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -797,6 +1228,9 @@ apply_handle_delete(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -896,6 +1330,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -987,6 +1424,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1084,6 +1537,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 }
 
 /*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i]);
+}
+
+/*
  * Apply main loop.
  */
 static void
@@ -1097,6 +1566,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1546,6 +2018,564 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	CloseTransientFile(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	CloseTransientFile(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * TODO: Add missing_ok flag to specify in which cases it's OK not to
+ * find the files, and when it's an error.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect 
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialied in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* SIGHUP: set flag to reload configuration at next convenient time */
 static void
 logicalrep_worker_sighup(SIGNAL_ARGS)
@@ -1726,6 +2756,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.work_mem = MySubscription->workmem;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 317c5d4..0faa701 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -48,16 +48,42 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
 
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
+
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 
-/* Entry in the map used to remember which relation schemas we sent. */
+/*
+ * Entry in the map used to remember which relation schemas we sent.
+ *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
+	TransactionId	xid;		/* transaction that created the record */
 	bool		schema_sent;	/* did we send the schema? */
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -67,6 +93,7 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
@@ -87,16 +114,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, int *logical_decoding_work_mem)
+						List **publication_names, int *logical_decoding_work_mem,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		work_mem_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -165,6 +202,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*logical_decoding_work_mem = (int)parsed;
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -177,6 +231,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -200,7 +255,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&logical_decoding_work_mem);
+								&logical_decoding_work_mem,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -220,6 +276,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -287,9 +364,42 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (!relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may 
+		 * occur when streaming already started, so we have to track new catalog 
+		 * changes somehow.
+		 */
+		schema_sent = txn->is_schema_sent;
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (!schema_sent)
 	{
 		TupleDesc	desc;
 		int			i;
@@ -315,19 +425,26 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 				continue;
 
 			OutputPluginPrepareWrite(ctx, false);
-			logicalrep_write_typ(ctx->out, att->atttypid);
+			logicalrep_write_typ(ctx->out, xid, att->atttypid);
 			OutputPluginWrite(ctx, false);
 		}
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_rel(ctx->out, relation);
+		logicalrep_write_rel(ctx->out, xid, relation);
 		OutputPluginWrite(ctx, false);
-		relentry->schema_sent = true;
+		relentry->xid = change->txn->xid;
+
+		if (in_streaming)
+			txn->is_schema_sent = true;
+		else
+			relentry->schema_sent = true;
 	}
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -336,6 +453,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -364,14 +485,14 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
 	{
 		case REORDER_BUFFER_CHANGE_INSERT:
 			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_insert(ctx->out, relation,
+			logicalrep_write_insert(ctx->out, xid, relation,
 									&change->data.tp.newtuple->tuple);
 			OutputPluginWrite(ctx, true);
 			break;
@@ -381,7 +502,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				&change->data.tp.oldtuple->tuple : NULL;
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple,
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
 										&change->data.tp.newtuple->tuple);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -390,7 +511,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			if (change->data.tp.oldtuple)
 			{
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation,
+				logicalrep_write_delete(ctx->out, xid, relation,
 										&change->data.tp.oldtuple->tuple);
 				OutputPluginWrite(ctx, true);
 			}
@@ -416,6 +537,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -436,13 +561,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -516,6 +642,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -626,6 +837,34 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+	}
+
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 42da631..507661c 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -147,6 +147,13 @@ create_logical_replication_slot(char *name, char *plugin,
 									logical_read_local_xlog_page, NULL, NULL,
 									NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+
 	/* build initial snapshot, might take a while */
 	DecodingContextFindStartpoint(ctx);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index eb4a98c..1c0c020 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -944,6 +944,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndUpdateProgress);
 
 		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
+		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
 		 * messages or send keepalives. As we possibly need to wait for
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 10ea113..8793676 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -50,6 +50,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	int32		subworkmem;		/* Memory to use to decode changes. */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -76,6 +78,7 @@ typedef struct Subscription
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	int			workmem;		/* Memory to decode changes. */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fe076d8..bc45194 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -944,7 +944,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 3fc430a..bf02cbc 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -85,25 +89,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out, TransactionId xid);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4e68a69..fe6acb4 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -163,6 +163,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			int			work_mem;	/* Memory limit to use for decoding */
+			bool		streaming;	/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..4d01f7e
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..1a8b8ff
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..04af090
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..6fecfe6
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..50990c1
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.patchapplication/octet-stream; name=0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.patchDownload
From d2830f660ca86dd84b5c377748d133cfe57463ae Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:04:54 +0200
Subject: [PATCH 01/13] Add logical_decoding_work_mem to limit ReorderBuffer
 memory usage

Instead of deciding to serialize a transaction merely based on the
number of changes in that xact (toplevel or subxact), this makes
the decisions based on amount of memory consumed by the changes.

The memory limit is defined by a new logical_decoding_work_mem GUC,
so for example we can do this

    SET logical_decoding_work_mem = '128kB'

to trigger very aggressive streaming. The minimum value is 64kB.

When adding a change to a transaction, we account for the size in
two places. Firstly, in the ReorderBuffer, which is then used to
decide if we reached the total memory limit. And secondly in the
transaction the change belongs to, so that we can pick the largest
transaction to evict (and serialize to disk).

We still use max_changes_in_memory when loading changes serialized
to disk. The trouble is we can't use the memory limit directly as
there might be multiple subxact serialized, we need to read all of
them but we don't know how many are there (and which subxact to
read first).

We do not serialize the ReorderBufferTXN entries, so if there is a
transaction with many subxacts, most memory may be in this type of
objects. Those records are not included in the memory accounting.

We also do not account for INTERNAL_TUPLECID changes, which are
kept in a separate list and not evicted from memory. Transactions
with many CTID changes may consume significant amounts of memory,
but we can't really do much about that.

The current eviction algorithm is very simple - the transaction is
picked merely by size, while it might be useful to also consider age
(LSN) of the changes for example. With the new Generational memory
allocator, evicting the oldest changes would make it more likely
the memory gets actually pfreed.

The logical_decoding_work_mem may be set either in postgresql.conf,
in which case it serves as the default for all publishers on that
instance, or when creating the subscription, using a work_mem
paramemter in the WITH clause (specifies number of kilobytes).
---
 doc/src/sgml/config.sgml                           |  21 ++
 doc/src/sgml/ref/create_subscription.sgml          |  12 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  44 +++-
 .../libpqwalreceiver/libpqwalreceiver.c            |   3 +
 src/backend/replication/logical/reorderbuffer.c    | 292 ++++++++++++++++++++-
 src/backend/replication/logical/worker.c           |   1 +
 src/backend/replication/pgoutput/pgoutput.c        |  30 ++-
 src/backend/utils/misc/guc.c                       |  36 +++
 src/backend/utils/misc/postgresql.conf.sample      |   1 +
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/replication/reorderbuffer.h            |  16 ++
 src/include/replication/walreceiver.h              |   1 +
 13 files changed, 441 insertions(+), 20 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 619ac8c..a207ac2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1715,6 +1715,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-logical-decoding-work-mem" xreflabel="logical_decoding_work_mem">
+      <term><varname>logical_decoding_work_mem</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>logical_decoding_work_mem</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to be used by logical decoding,
+        before some of the decoded changes are either written to local disk.
+        This limits the amount of memory used by logical streaming replication
+        connections. It defaults to 64 megabytes (<literal>64MB</literal>).
+        Since each replication connection only uses a single buffer of this size,
+        and an installation normally doesn't have many such connections
+        concurrently (as limited by <varname>max_wal_senders</varname>), it's
+        safe to set this value significantly higher than <varname>work_mem</varname>,
+        reducing the amount of decoded changes written to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c24..91790b0 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>work_mem</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          Limits the amount of memory used to decode changes on the
+          publisher.  If not specified, the publisher will use the default
+          specified by <varname>logical_decoding_work_mem</varname>. When
+          needed, additional data are spilled to disk.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index afee283..7e3ba8e 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -71,6 +71,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->workmem = subform->subworkmem;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 2e67a58..d85e831 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -66,7 +66,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, int *logical_wm,
+						   bool *logical_wm_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -97,6 +98,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (logical_wm)
+		*logical_wm_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -182,6 +185,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0 && logical_wm)
+		{
+			if (*logical_wm_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*logical_wm_given = true;
+			*logical_wm = defGetInt32(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -325,6 +338,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	int			logical_wm;
+	bool		logical_wm_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -341,7 +356,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &logical_wm, &logical_wm_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -419,6 +434,12 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (logical_wm_given)
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(logical_wm);
+	else
+		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -682,10 +703,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				int			logical_wm;
+				bool		logical_wm_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &logical_wm, &logical_wm_given);
 
 				if (slotname_given)
 				{
@@ -710,6 +734,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (logical_wm_given)
+				{
+					values[Anum_pg_subscription_subworkmem - 1] =
+						Int32GetDatum(logical_wm);
+					replaces[Anum_pg_subscription_subworkmem - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -721,7 +752,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -759,7 +791,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -796,7 +828,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 6eba08a..65b3266 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -406,6 +406,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		appendStringInfo(&cmd, ", work_mem '%d'",
+						 options->proto.logical.work_mem);
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8ce28ad..6228140 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -49,6 +49,34 @@
  *	  GenerationContext for the variable-length transaction data (allocated
  *	  and freed in groups with similar lifespan).
  *
+ *	  To limit the amount of memory used by decoded changes, we track memory
+ *	  used at the reorder buffer level (i.e. total amount of memory), and for
+ *	  each toplevel transaction. When the total amount of used memory exceeds
+ *	  the limit, the toplevel transaction consuming the most memory is then
+ *	  serialized to disk.
+ *
+ *	  Only decoded changes are evicted from memory (spilled to disk), not the
+ *	  transaction records. The number of toplevel transactions is limited,
+ *	  but a transaction with many subtransactions may still consume significant
+ *	  amounts of memory. The transaction records are fairly small, though, and
+ *	  are not included in the memory limit.
+ *
+ *	  The current eviction algorithm is very simple - the transaction is
+ *	  picked merely by size, while it might be useful to also consider age
+ *	  (LSN) of the changes for example. With the new Generational memory
+ *	  allocator, evicting the oldest changes would make it more likely the
+ *	  memory gets actually freed.
+ *
+ *	  We still rely on max_changes_in_memory when loading serialized changes
+ *	  back into memory. At that point we can't use the memory limit directly
+ *	  as we load the subxacts independently. One option do deal with this
+ *	  would be to count the subxacts, and allow each to allocate 1/N of the
+ *	  memory limit. That however does not seem very appealing, because with
+ *	  many subtransactions it may easily cause trashing (short cycles of
+ *	  deserializing and applying very few changes). We probably should give
+ *	  a bit more memory to the oldest subtransactions, because it's likely
+ *	  the source for the next sequence of changes.
+ *
  * -------------------------------------------------------------------------
  */
 #include "postgres.h"
@@ -154,7 +182,8 @@ typedef struct ReorderBufferDiskChange
  * resource management here, but it's not entirely clear what that would look
  * like.
  */
-static const Size max_changes_in_memory = 4096;
+int			logical_decoding_work_mem;
+static const Size max_changes_in_memory = 4096; /* XXX for restore only */
 
 /* ---------------------------------------
  * primary reorderbuffer support routines
@@ -189,7 +218,7 @@ static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTX
  * Disk serialization support functions
  * ---------------------------------------
  */
-static void ReorderBufferCheckSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferCheckMemoryLimit(ReorderBuffer *rb);
 static void ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 										 int fd, ReorderBufferChange *change);
@@ -217,6 +246,14 @@ static void ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static void ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn,
 										  Relation relation, ReorderBufferChange *change);
 
+/*
+ * ---------------------------------------
+ * memory accounting
+ * ---------------------------------------
+ */
+static Size ReorderBufferChangeSize(ReorderBufferChange *change);
+static void ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
+								ReorderBufferChange *change, bool addition);
 
 /*
  * Allocate a new ReorderBuffer and clean out any old serialized state from
@@ -269,6 +306,7 @@ ReorderBufferAllocate(void)
 
 	buffer->outbuf = NULL;
 	buffer->outbufsize = 0;
+	buffer->size = 0;
 
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
@@ -374,6 +412,9 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 void
 ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 {
+	/* update memory accounting info */
+	ReorderBufferChangeMemoryUpdate(rb, change, false);
+
 	/* free contained data */
 	switch (change->action)
 	{
@@ -585,12 +626,18 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	change->lsn = lsn;
+	change->txn = txn;
+
 	Assert(InvalidXLogRecPtr != lsn);
 	dlist_push_tail(&txn->changes, &change->node);
 	txn->nentries++;
 	txn->nentries_mem++;
 
-	ReorderBufferCheckSerializeTXN(rb, txn);
+	/* update memory accounting information */
+	ReorderBufferChangeMemoryUpdate(rb, change, true);
+
+	/* check the memory limits and evict something if needed */
+	ReorderBufferCheckMemoryLimit(rb);
 }
 
 /*
@@ -1217,6 +1264,9 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		change = dlist_container(ReorderBufferChange, node, iter.cur);
 
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
 		ReorderBufferReturnChange(rb, change);
 	}
 
@@ -1229,7 +1279,11 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		ReorderBufferChange *change;
 
 		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
 		ReorderBufferReturnChange(rb, change);
 	}
 
@@ -2082,9 +2136,48 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferQueueChange(rb, xid, lsn, change);
 }
 
+/*
+ * Update the memory accounting info. We track memory used by the whole
+ * reorder buffer and the transaction containing the change.
+ */
+static void
+ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
+								ReorderBufferChange *change,
+								bool addition)
+{
+	Size		sz;
+
+	Assert(change->txn);
+
+	/*
+	 * Ignore tuple CID changes, because those are not evicted when
+	 * reaching memory limit. So we just don't count them, because it
+	 * might easily trigger a pointless attempt to spill/stream.
+	 */
+	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
+		return;
+
+	sz = ReorderBufferChangeSize(change);
+
+	if (addition)
+	{
+		change->txn->size += sz;
+		rb->size += sz;
+	}
+	else
+	{
+		Assert((rb->size >= sz) && (change->txn->size >= sz));
+		change->txn->size -= sz;
+		rb->size -= sz;
+	}
+}
 
 /*
  * Add new (relfilenode, tid) -> (cmin, cmax) mappings.
+ *
+ * We do not include this change type in memory accounting, because we
+ * keep CIDs in a separate list and do not evict them when reaching
+ * the memory limit.
  */
 void
 ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
@@ -2230,20 +2323,84 @@ ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
 }
 
 /*
- * Check whether the transaction tx should spill its data to disk.
+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
+ *
+ * XXX With many subtransactions this might be quite slow, because we'll have
+ * to walk through all of them. There are some options how we could improve
+ * that: (a) maintain some secondary structure with transactions sorted by
+ * amount of changes, (b) not looking for the entirely largest transaction,
+ * but e.g. for transaction using at least some fraction of the memory limit,
+ * and (c) evicting multiple transactions at once, e.g. to free a given portion
+ * of the memory limit (e.g. 50%).
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTXN(ReorderBuffer *rb)
+{
+	HASH_SEQ_STATUS hash_seq;
+	ReorderBufferTXNByIdEnt	*ent;
+	ReorderBufferTXN *largest = NULL;
+
+	hash_seq_init(&hash_seq, rb->by_txn);
+	while ((ent = hash_seq_search(&hash_seq)) != NULL)
+	{
+		ReorderBufferTXN *txn = ent->txn;
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
+ * Check whether the logical_decoding_work_mem limit was reached, and if yes
+ * pick the transaction to evict and spill the changes to disk.
+ *
+ * XXX At this point we select just a single (largest) transaction, but
+ * we might also adapt a more elaborate eviction strategy - for example
+ * evicting enough transactions to free certain fraction (e.g. 50%) of
+ * the memory limit.
  */
 static void
-ReorderBufferCheckSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
+	ReorderBufferTXN *txn;
+
+	/* bail out if we haven't exceeded the memory limit */
+	if (rb->size < logical_decoding_work_mem * 1024L)
+		return;
+
 	/*
-	 * TODO: improve accounting so we cheaply can take subtransactions into
-	 * account here.
+	 * Pick the largest transaction (or subtransaction) and evict it from
+	 * memory by serializing it to disk.
 	 */
-	if (txn->nentries_mem >= max_changes_in_memory)
-	{
-		ReorderBufferSerializeTXN(rb, txn);
-		Assert(txn->nentries_mem == 0);
-	}
+	txn = ReorderBufferLargestTXN(rb);
+
+	ReorderBufferSerializeTXN(rb, txn);
+
+	/*
+	 * After eviction, the transaction should have no entries in memory, and
+	 * should use 0 bytes for changes.
+	 */
+	Assert(txn->size == 0);
+	Assert(txn->nentries_mem == 0);
+
+	/*
+	 * And furthermore, evicting the transaction should get us below the
+	 * memory limit again - it is not possible that we're still exceeding the
+	 * memory limit after evicting the transaction.
+	 *
+	 * This follows from the simple fact that the selected transaction is at
+	 * least as large as the most recent change (which caused us to go over
+	 * the memory limit). So by evicting it we're definitely back below the
+	 * memory limit.
+	 */
+	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
 /*
@@ -2513,6 +2670,84 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 }
 
 /*
+ * Size of a change in memory.
+ */
+static Size
+ReorderBufferChangeSize(ReorderBufferChange *change)
+{
+	Size		sz = sizeof(ReorderBufferChange);
+
+	switch (change->action)
+	{
+			/* fall through these, they're all similar enough */
+		case REORDER_BUFFER_CHANGE_INSERT:
+		case REORDER_BUFFER_CHANGE_UPDATE:
+		case REORDER_BUFFER_CHANGE_DELETE:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
+			{
+				ReorderBufferTupleBuf *oldtup,
+						   *newtup;
+				Size		oldlen = 0;
+				Size		newlen = 0;
+
+				oldtup = change->data.tp.oldtuple;
+				newtup = change->data.tp.newtuple;
+
+				if (oldtup)
+				{
+					sz += sizeof(HeapTupleData);
+					oldlen = oldtup->tuple.t_len;
+					sz += oldlen;
+				}
+
+				if (newtup)
+				{
+					sz += sizeof(HeapTupleData);
+					newlen = newtup->tuple.t_len;
+					sz += newlen;
+				}
+
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_MESSAGE:
+			{
+				Size		prefix_size = strlen(change->data.msg.prefix) + 1;
+
+				sz += prefix_size + change->data.msg.message_size +
+					sizeof(Size) + sizeof(Size);
+
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
+			{
+				Snapshot	snap;
+
+				snap = change->data.snapshot;
+
+				sz += sizeof(SnapshotData) +
+					sizeof(TransactionId) * snap->xcnt +
+					sizeof(TransactionId) * snap->subxcnt;
+
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_TRUNCATE:
+			{
+				sz += sizeof(Oid) * change->data.truncate.nrelids;
+
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
+		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+			/* ReorderBufferChange contains everything important */
+			break;
+	}
+
+	return sz;
+}
+
+
+/*
  * Restore a number of changes spilled to disk back into memory.
  */
 static Size
@@ -2784,6 +3019,16 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	dlist_push_tail(&txn->changes, &change->node);
 	txn->nentries_mem++;
+
+	/*
+	 * Update memory accounting for the restored change.  We need to do this
+	 * although we don't check the memory limit when restoring the changes in
+	 * this branch (we only do that when initially queueing the changes after
+	 * decoding), because we will release the changes later, and that will
+	 * update the accounting too (subtracting the size from the counters).
+	 * And we don't want to underflow there.
+	 */
+	ReorderBufferChangeMemoryUpdate(rb, change, true);
 }
 
 /*
@@ -3003,6 +3248,19 @@ ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn,
  *
  * We cannot replace unchanged toast tuples though, so those will still point
  * to on-disk toast data.
+ *
+ * While updating the existing change with detoasted tuple data, we need to
+ * update the memory accounting info, because the change size will differ.
+ * Otherwise the accounting may get out of sync, triggering serialization
+ * at unexpected times.
+ *
+ * We simply subtract size of the change before rejiggering the tuple, and
+ * then adding the new size. This makes it look like the change was removed
+ * and then added back, except it only tweaks the accounting info.
+ *
+ * In particular it can't trigger serialization, which would be pointless
+ * anyway as it happens during commit processing right before handing
+ * the change to the output plugin.
  */
 static void
 ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
@@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	if (txn->toast_hash == NULL)
 		return;
 
+	/*
+	 * We're going modify the size of the change, so to make sure the
+	 * accounting is correct we'll make it look like we're removing the
+	 * change now (with the old size), and then re-add it at the end.
+	 */
+	ReorderBufferChangeMemoryUpdate(rb, change, false);
+
 	oldcontext = MemoryContextSwitchTo(rb->context);
 
 	/* we should only have toast tuples in an INSERT or UPDATE */
@@ -3172,6 +3437,9 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	pfree(isnull);
 
 	MemoryContextSwitchTo(oldcontext);
+
+	/* now add the change back, with the correct size */
+	ReorderBufferChangeMemoryUpdate(rb, change, true);
 }
 
 /*
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 11e6331..f737afb 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1725,6 +1725,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.work_mem = MySubscription->workmem;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c08757..317c5d4 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -21,6 +21,7 @@
 #include "replication/origin.h"
 #include "replication/pgoutput.h"
 
+#include "utils/guc.h"
 #include "utils/inval.h"
 #include "utils/int8.h"
 #include "utils/memutils.h"
@@ -90,11 +91,12 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, int *logical_decoding_work_mem)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		work_mem_given = false;
 
 	foreach(lc, options)
 	{
@@ -140,6 +142,29 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0)
+		{
+			int64	parsed;
+
+			if (work_mem_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			work_mem_given = true;
+
+			if (!scanint8(strVal(defel->arg), true, &parsed))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid work_mem")));
+
+			if (parsed > PG_INT32_MAX || parsed < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("work_mem \"%s\" out of range",
+								strVal(defel->arg))));
+
+			*logical_decoding_work_mem = (int)parsed;
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -174,7 +199,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&logical_decoding_work_mem);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2178e1c..5d7e687 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -65,6 +65,7 @@
 #include "postmaster/postmaster.h"
 #include "postmaster/syslogger.h"
 #include "postmaster/walwriter.h"
+#include "replication/logical.h"
 #include "replication/logicallauncher.h"
 #include "replication/slot.h"
 #include "replication/syncrep.h"
@@ -191,6 +192,7 @@ static bool check_maxconnections(int *newval, void **extra, GucSource source);
 static bool check_max_worker_processes(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_max_workers(int *newval, void **extra, GucSource source);
 static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
+static bool check_logical_decoding_work_mem(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
@@ -2251,6 +2253,18 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM,
+			gettext_noop("Sets the maximum memory to be used for logical decoding."),
+			gettext_noop("This much memory can be used by each internal "
+						 "reorder buffer before spilling to disk or streaming."),
+			GUC_UNIT_KB
+		},
+		&logical_decoding_work_mem,
+		-1, -1, MAX_KILOBYTES,
+		check_logical_decoding_work_mem, NULL, NULL
+	},
+
 	/*
 	 * We use the hopefully-safely-small value of 100kB as the compiled-in
 	 * default for max_stack_depth.  InitializeGUCOptions will increase it if
@@ -11286,6 +11300,28 @@ check_max_wal_senders(int *newval, void **extra, GucSource source)
 }
 
 static bool
+check_logical_decoding_work_mem(int *newval, void **extra, GucSource source)
+{
+	/*
+	 * -1 indicates fallback.
+	 *
+	 * If we haven't yet changed the boot_val default of -1, just let it be.
+	 * logical decoding will look to maintenance_work_mem instead.
+	 */
+	if (*newval == -1)
+		return true;
+
+	/*
+	 * We clamp manually-set values to at least 64kB. The maintenance_work_mem
+	 * uses a higher minimum value (1MB), so this is OK.
+	 */
+	if (*newval < 64)
+		*newval = 64;
+
+	return true;
+}
+
+static bool
 check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
 {
 	/*
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0fc23e3..00a22b8 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -130,6 +130,7 @@
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #autovacuum_work_mem = -1		# min 1MB, or -1 to use maintenance_work_mem
+#logical_decoding_work_mem = 64MB	# min 1MB, or -1 to use maintenance_work_mem
 #max_stack_depth = 2MB			# min 100kB
 #shared_memory_type = mmap		# the default is the first option
 					# supported by the operating system:
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3cb13d8..10ea113 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	int32		subworkmem;		/* Memory to use to decode changes. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	int			workmem;		/* Memory to decode changes. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 4c06a78..4dcef80 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -17,6 +17,8 @@
 #include "utils/snapshot.h"
 #include "utils/timestamp.h"
 
+extern PGDLLIMPORT	int	logical_decoding_work_mem;
+
 /* an individual tuple, stored in one chunk of memory */
 typedef struct ReorderBufferTupleBuf
 {
@@ -63,6 +65,9 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_TRUNCATE
 };
 
+/* forward declaration */
+struct ReorderBufferTXN;
+
 /*
  * a single 'change', can be an insert (with one tuple), an update (old, new),
  * or a delete (old).
@@ -77,6 +82,9 @@ typedef struct ReorderBufferChange
 	/* The type of change. */
 	enum ReorderBufferChangeType action;
 
+	/* Transaction this change belongs to. */
+	struct ReorderBufferTXN *txn;
+
 	RepOriginId origin_id;
 
 	/*
@@ -286,6 +294,11 @@ typedef struct ReorderBufferTXN
 	 */
 	dlist_node	node;
 
+	/*
+	 * Size of this transaction (changes currently in memory, in bytes).
+	 */
+	Size		size;
+
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -386,6 +399,9 @@ struct ReorderBuffer
 	/* buffer for disk<->memory conversions */
 	char	   *outbuf;
 	Size		outbufsize;
+
+	/* memory accounting */
+	Size		size;
 };
 
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e12a934..4e68a69 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -162,6 +162,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			int			work_mem;	/* Memory limit to use for decoding */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
-- 
1.8.3.1

0002-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=0002-Immediately-WAL-log-assignments.patchDownload
From f70b1e79f8dcb23f1acffd5a29758e7952e61d9c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:07:31 +0200
Subject: [PATCH 02/13] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So instead we write the assignment info into WAL immediately, as
part of the next WAL record (to minimize overhead).
---
 src/backend/access/rmgrdesc/xactdesc.c   |  26 ------
 src/backend/access/transam/xact.c        | 152 +++++++++----------------------
 src/backend/access/transam/xlog.c        |   2 -
 src/backend/access/transam/xloginsert.c  |  22 ++++-
 src/backend/access/transam/xlogreader.c  |   5 +
 src/backend/replication/logical/decode.c |  39 ++++----
 src/include/access/xact.h                |  15 +--
 src/include/access/xlog.h                |   2 +
 src/include/access/xlogreader.h          |   3 +
 src/include/access/xlogrecord.h          |   1 +
 src/tools/pgindent/typedefs.list         |   1 -
 11 files changed, 96 insertions(+), 172 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index a61f38d..66fc8fb 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -293,17 +293,6 @@ xact_desc_abort(StringInfo buf, uint8 info, xl_xact_abort *xlrec)
 	}
 }
 
-static void
-xact_desc_assignment(StringInfo buf, xl_xact_assignment *xlrec)
-{
-	int			i;
-
-	appendStringInfoString(buf, "subxacts:");
-
-	for (i = 0; i < xlrec->nsubxacts; i++)
-		appendStringInfo(buf, " %u", xlrec->xsub[i]);
-}
-
 void
 xact_desc(StringInfo buf, XLogReaderState *record)
 {
@@ -323,18 +312,6 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 
 		xact_desc_abort(buf, XLogRecGetInfo(record), xlrec);
 	}
-	else if (info == XLOG_XACT_ASSIGNMENT)
-	{
-		xl_xact_assignment *xlrec = (xl_xact_assignment *) rec;
-
-		/*
-		 * Note that we ignore the WAL record's xid, since we're more
-		 * interested in the top-level xid that issued the record and which
-		 * xids are being reported here.
-		 */
-		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
-		xact_desc_assignment(buf, xlrec);
-	}
 }
 
 const char *
@@ -359,9 +336,6 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ABORT_PREPARED:
 			id = "ABORT_PREPARED";
 			break;
-		case XLOG_XACT_ASSIGNMENT:
-			id = "ASSIGNMENT";
-			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 9162286..33141fb 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -188,9 +188,9 @@ typedef struct TransactionStateData
 	int			prevSecContext; /* previous SecurityRestrictionContext */
 	bool		prevXactReadOnly;	/* entry-time xact r/o state */
 	bool		startedInRecovery;	/* did we start in recovery? */
-	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -225,13 +225,6 @@ static TransactionStateData TopTransactionStateData = {
 	.blockState = TBLOCK_DEFAULT,
 };
 
-/*
- * unreportedXids holds XIDs of all subtransactions that have not yet been
- * reported in an XLOG_XACT_ASSIGNMENT record.
- */
-static int	nUnreportedXids;
-static TransactionId unreportedXids[PGPROC_MAX_CACHED_SUBXIDS];
-
 static TransactionState CurrentTransactionState = &TopTransactionStateData;
 
 /*
@@ -502,19 +495,6 @@ GetCurrentFullTransactionIdIfAny(void)
 }
 
 /*
- *	MarkCurrentTransactionIdLoggedIfAny
- *
- * Remember that the current xid - if it is assigned - now has been wal logged.
- */
-void
-MarkCurrentTransactionIdLoggedIfAny(void)
-{
-	if (FullTransactionIdIsValid(CurrentTransactionState->fullTransactionId))
-		CurrentTransactionState->didLogXid = true;
-}
-
-
-/*
  *	GetStableLatestTransactionId
  *
  * Get the transaction's XID if it has one, else read the next-to-be-assigned
@@ -555,7 +535,6 @@ AssignTransactionId(TransactionState s)
 {
 	bool		isSubXact = (s->parent != NULL);
 	ResourceOwner currentOwner;
-	bool		log_unknown_top = false;
 
 	/* Assert that caller didn't screw up */
 	Assert(!FullTransactionIdIsValid(s->fullTransactionId));
@@ -598,20 +577,6 @@ AssignTransactionId(TransactionState s)
 	}
 
 	/*
-	 * When wal_level=logical, guarantee that a subtransaction's xid can only
-	 * be seen in the WAL stream if its toplevel xid has been logged before.
-	 * If necessary we log an xact_assignment record with fewer than
-	 * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
-	 * for a transaction even though it appears in a WAL record, we just might
-	 * superfluously log something. That can happen when an xid is included
-	 * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
-	 * xl_standby_locks.
-	 */
-	if (isSubXact && XLogLogicalInfoActive() &&
-		!TopTransactionStateData.didLogXid)
-		log_unknown_top = true;
-
-	/*
 	 * Generate a new FullTransactionId and record its xid in PG_PROC and
 	 * pg_subtrans.
 	 *
@@ -646,59 +611,6 @@ AssignTransactionId(TransactionState s)
 	XactLockTableInsert(XidFromFullTransactionId(s->fullTransactionId));
 
 	CurrentResourceOwner = currentOwner;
-
-	/*
-	 * Every PGPROC_MAX_CACHED_SUBXIDS assigned transaction ids within each
-	 * top-level transaction we issue a WAL record for the assignment. We
-	 * include the top-level xid and all the subxids that have not yet been
-	 * reported using XLOG_XACT_ASSIGNMENT records.
-	 *
-	 * This is required to limit the amount of shared memory required in a hot
-	 * standby server to keep track of in-progress XIDs. See notes for
-	 * RecordKnownAssignedTransactionIds().
-	 *
-	 * We don't keep track of the immediate parent of each subxid, only the
-	 * top-level transaction that each subxact belongs to. This is correct in
-	 * recovery only because aborted subtransactions are separately WAL
-	 * logged.
-	 *
-	 * This is correct even for the case where several levels above us didn't
-	 * have an xid assigned as we recursed up to them beforehand.
-	 */
-	if (isSubXact && XLogStandbyInfoActive())
-	{
-		unreportedXids[nUnreportedXids] = XidFromFullTransactionId(s->fullTransactionId);
-		nUnreportedXids++;
-
-		/*
-		 * ensure this test matches similar one in
-		 * RecoverPreparedTransactions()
-		 */
-		if (nUnreportedXids >= PGPROC_MAX_CACHED_SUBXIDS ||
-			log_unknown_top)
-		{
-			xl_xact_assignment xlrec;
-
-			/*
-			 * xtop is always set by now because we recurse up transaction
-			 * stack to the highest unassigned xid and then come back down
-			 */
-			xlrec.xtop = GetTopTransactionId();
-			Assert(TransactionIdIsValid(xlrec.xtop));
-			xlrec.nsubxacts = nUnreportedXids;
-
-			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, MinSizeOfXactAssignment);
-			XLogRegisterData((char *) unreportedXids,
-							 nUnreportedXids * sizeof(TransactionId));
-
-			(void) XLogInsert(RM_XACT_ID, XLOG_XACT_ASSIGNMENT);
-
-			nUnreportedXids = 0;
-			/* mark top, not current xact as having been logged */
-			TopTransactionStateData.didLogXid = true;
-		}
-	}
 }
 
 /*
@@ -1792,13 +1704,6 @@ AtSubAbort_childXids(void)
 	s->childXids = NULL;
 	s->nChildXids = 0;
 	s->maxChildXids = 0;
-
-	/*
-	 * We could prune the unreportedXids array here. But we don't bother. That
-	 * would potentially reduce number of XLOG_XACT_ASSIGNMENT records but it
-	 * would likely introduce more CPU time into the more common paths, so we
-	 * choose not to do that.
-	 */
 }
 
 /* ----------------------------------------------------------------
@@ -1963,12 +1868,6 @@ StartTransaction(void)
 	currentCommandIdUsed = false;
 
 	/*
-	 * initialize reported xid accounting
-	 */
-	nUnreportedXids = 0;
-	s->didLogXid = false;
-
-	/*
 	 * must initialize resource-management stuff first
 	 */
 	AtStart_Memory();
@@ -5095,6 +4994,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -5990,14 +5890,46 @@ xact_redo(XLogReaderState *record)
 					   XLogRecGetOrigin(record));
 		LWLockRelease(TwoPhaseStateLock);
 	}
-	else if (info == XLOG_XACT_ASSIGNMENT)
-	{
-		xl_xact_assignment *xlrec = (xl_xact_assignment *) XLogRecGetData(record);
-
-		if (standbyState >= STANDBY_INITIALIZED)
-			ProcArrayApplyXidAssignment(xlrec->xtop,
-										xlrec->nsubxacts, xlrec->xsub);
-	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 790e2c8..b1daa05 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1121,8 +1121,6 @@ XLogInsertRecord(XLogRecData *rdata,
 	 */
 	WALInsertLockRelease();
 
-	MarkCurrentTransactionIdLoggedIfAny();
-
 	END_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3ec67d4..15ce79c 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -88,11 +88,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -194,6 +196,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -397,7 +403,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -743,6 +749,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		curinsert_flags |= XLOG_INCLUDE_XID;
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index c8b0d23..3b02fbf 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1072,6 +1072,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1110,6 +1111,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c53e7e2..ff74c65 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -96,12 +96,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel_xid is valid, we need to assign the subxact to the
+	 * toplevel transaction. We need to do this for all records, hence we
+	 * do it before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 record->toplevel_xid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrIds) XLogRecGetRmid(record))
 	{
@@ -220,12 +236,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(r->toplevel_xid))
 		return;
 
 	switch (info)
@@ -266,23 +282,6 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeAbort(ctx, buf, &parsed, xid);
 				break;
 			}
-		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index d714551..7553f84 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -145,7 +145,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_ABORT				0x20
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
-#define XLOG_XACT_ASSIGNMENT		0x50
+/* free opcode 0x50 */
 /* free opcode 0x60 */
 /* free opcode 0x70 */
 
@@ -188,15 +188,6 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XactCompletionForceSyncCommit(xinfo) \
 	((xinfo & XACT_COMPLETION_FORCE_SYNC_COMMIT) != 0)
 
-typedef struct xl_xact_assignment
-{
-	TransactionId xtop;			/* assigned XID's top-level XID */
-	int			nsubxacts;		/* number of subtransaction XIDs */
-	TransactionId xsub[FLEXIBLE_ARRAY_MEMBER];	/* assigned subxids */
-} xl_xact_assignment;
-
-#define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
-
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
@@ -363,7 +354,6 @@ extern FullTransactionId GetTopFullTransactionId(void);
 extern FullTransactionId GetTopFullTransactionIdIfAny(void);
 extern FullTransactionId GetCurrentFullTransactionId(void);
 extern FullTransactionId GetCurrentFullTransactionIdIfAny(void);
-extern void MarkCurrentTransactionIdLoggedIfAny(void);
 extern bool SubTransactionIsActive(SubTransactionId subxid);
 extern CommandId GetCurrentCommandId(bool used);
 extern void SetParallelStartTimestamps(TimestampTz xact_ts, TimestampTz stmt_ts);
@@ -410,6 +400,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d519252..060901d 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -227,6 +227,8 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
+#define XLOG_INCLUDE_INVALS		0x08	/* include invalidations */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 1bbee38..c37a83d 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -148,6 +148,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -243,6 +245,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 9375e54..bcfba0a 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 60c76cb..d08c08a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3423,7 +3423,6 @@ xl_standby_locks
 xl_tblspc_create_rec
 xl_tblspc_drop_rec
 xl_xact_abort
-xl_xact_assignment
 xl_xact_commit
 xl_xact_dbinfo
 xl_xact_invals
-- 
1.8.3.1

0003-Issue-individual-invalidations-with-wal_level-logica.patchapplication/octet-stream; name=0003-Issue-individual-invalidations-with-wal_level-logica.patchDownload
From 2a2ac4be202b9226a12934ef7097764d6d8cb638 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:20:53 +0200
Subject: [PATCH 03/13] Issue individual invalidations with wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.

The new invalidations are written to WAL immediately, without any
such caching. Perhaps it would be possible to add similar caching,
e.g. at the command level, or something like that?
---
 src/backend/access/rmgrdesc/xactdesc.c          | 52 ++++++++++++++++
 src/backend/access/transam/xact.c               |  7 +++
 src/backend/replication/logical/decode.c        | 17 ++++++
 src/backend/replication/logical/reorderbuffer.c | 52 +++++++++++++++-
 src/backend/utils/cache/inval.c                 | 81 +++++++++++++++++++++++++
 src/include/access/xact.h                       | 18 +++++-
 src/include/replication/reorderbuffer.h         | 14 +++++
 7 files changed, 238 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 66fc8fb..9cff9f0 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,11 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -312,6 +317,14 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 
 		xact_desc_abort(buf, XLogRecGetInfo(record), xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs,
+								xlrec->dbId, xlrec->tsId,
+								xlrec->relcacheInitFileInval);
+	}
 }
 
 const char *
@@ -336,7 +349,46 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ABORT_PREPARED:
 			id = "ABORT_PREPARED";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval)
+{
+	int			i;
+
+	if (relcacheInitFileInval)
+		appendStringInfo(buf, "; relcache init file inval dbid %u tsid %u",
+						 dbId, tsId);
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else if (msg->id == SHAREDINVALSNAPSHOT_ID)
+			appendStringInfo(buf, " snapshot %u", msg->sn.relId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 33141fb..dc3633c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5890,6 +5890,13 @@ xact_redo(XLogReaderState *record)
 					   XLogRecGetOrigin(record));
 		LWLockRelease(TwoPhaseStateLock);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index ff74c65..c100054 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -282,6 +282,23 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeAbort(ctx, buf, &parsed, xid);
 				break;
 			}
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/* XXX for now we're issuing invalidations one by one */
+				Assert(invals->nmsgs == 1);
+
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->dbId, invals->tsId,
+											 invals->relcacheInitFileInval,
+											 invals->msgs[0]);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 6228140..08b4d4f 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -460,6 +460,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
@@ -1811,6 +1812,18 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation message locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -2204,6 +2217,38 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 
 /*
  * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn,
+							 Oid dbId, Oid tsId, bool relcacheInitFileInval,
+							 SharedInvalidationMessage msg)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	/* XXX Should we even write invalidations without valid XID? */
+	if (xid == InvalidTransactionId)
+		return;
+
+	Assert(xid != InvalidTransactionId);
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.dbId = dbId;
+	change->data.inval.tsId = tsId;
+	change->data.inval.relcacheInitFileInval = relcacheInitFileInval;
+	change->data.inval.msg = msg;
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Setup the invalidation of the toplevel transaction.
  *
  * This needs to be done before ReorderBufferCommit is called!
  */
@@ -2643,6 +2688,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -2739,6 +2785,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -3014,6 +3061,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
@@ -3025,8 +3073,8 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	 * although we don't check the memory limit when restoring the changes in
 	 * this branch (we only do that when initially queueing the changes after
 	 * decoding), because we will release the changes later, and that will
-	 * update the accounting too (subtracting the size from the counters).
-	 * And we don't want to underflow there.
+	 * update the accounting too (subtracting the size from the counters). And
+	 * we don't want to underflow there.
 	 */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 }
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index f09e3a9..f921fdf 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -104,6 +104,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +211,9 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -489,6 +493,18 @@ RegisterCatcacheInvalidation(int cacheId,
 {
 	AddCatcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   cacheId, hashValue, dbId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cc.id = (int8) cacheId;
+		msg.cc.dbId = dbId;
+		msg.cc.hashValue = hashValue;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -501,6 +517,18 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 {
 	AddCatalogInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								  dbId, catId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cat.id = SHAREDINVALCATALOG_ID;
+		msg.cat.dbId = dbId;
+		msg.cat.catId = catId;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -511,6 +539,8 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 static void
 RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 {
+	bool		RelcacheInitFileInval = false;
+
 	AddRelcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   dbId, relId);
 
@@ -529,7 +559,22 @@ RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 	 * as well.  Also zap when we are invalidating whole relcache.
 	 */
 	if (relId == InvalidOid || RelationIdIsInInitFile(relId))
+	{
 		transInvalInfo->RelcacheInitFileInval = true;
+		RelcacheInitFileInval = true;
+	}
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.rc.id = SHAREDINVALRELCACHE_ID;
+		msg.rc.dbId = dbId;
+		msg.rc.relId = relId;
+
+		LogLogicalInvalidations(1, &msg, RelcacheInitFileInval);
+	}
 }
 
 /*
@@ -543,6 +588,18 @@ RegisterSnapshotInvalidation(Oid dbId, Oid relId)
 {
 	AddSnapshotInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   dbId, relId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.sn.id = SHAREDINVALSNAPSHOT_ID;
+		msg.sn.dbId = dbId;
+		msg.sn.relId = relId;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -1501,3 +1558,27 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval)
+{
+	xl_xact_invalidations xlrec;
+
+	/* prepare record */
+	memset(&xlrec, 0, sizeof(xlrec));
+	xlrec.dbId = MyDatabaseId;
+	xlrec.tsId = MyDatabaseTableSpace;
+	xlrec.relcacheInitFileInval = relcacheInitFileInval;
+	xlrec.nmsgs = nmsgs;
+
+	/* perform insertion */
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+	XLogRegisterData((char *) msgs,
+					 nmsgs * sizeof(SharedInvalidationMessage));
+	XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7553f84..b26d399 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -145,7 +145,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_ABORT				0x20
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
-/* free opcode 0x50 */
+#define XLOG_XACT_INVALIDATIONS		0x50
 /* free opcode 0x60 */
 /* free opcode 0x70 */
 
@@ -189,6 +189,22 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 	((xinfo & XACT_COMPLETION_FORCE_SYNC_COMMIT) != 0)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ *
+ * XXX Currently nmsgs=1 but that might change in the future.
+ */
+typedef struct xl_xact_invalidations
+{
+	Oid			dbId;			/* MyDatabaseId */
+	Oid			tsId;			/* MyDatabaseTableSpace */
+	bool		relcacheInitFileInval;	/* invalidate relcache init file */
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 4dcef80..82dcb7f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,16 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			Oid			dbId;	/* MyDatabaseId */
+			Oid			tsId;	/* MyDatabaseTableSpace */
+			bool		relcacheInitFileInval;	/* invalidate relcache init
+												 * file */
+			SharedInvalidationMessage msg;	/* invalidation message */
+		}			inval;
 	}			data;
 
 	/*
@@ -437,6 +448,9 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  Oid dbId, Oid tsId, bool relcacheInitFileInval,
+								  SharedInvalidationMessage msg);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
1.8.3.1

0010-Track-statistics-for-streaming-spilling.patchapplication/octet-stream; name=0010-Track-statistics-for-streaming-spilling.patchDownload
From 5404d4ad25d52190c7297642014e695277aee0a1 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:01:30 +0200
Subject: [PATCH 10/13] Track statistics for streaming/spilling

---
 doc/src/sgml/monitoring.sgml                    | 47 +++++++++++++++++++
 src/backend/catalog/system_views.sql            |  8 +++-
 src/backend/replication/logical/reorderbuffer.c | 21 +++++++++
 src/backend/replication/walsender.c             | 62 ++++++++++++++++++++++++-
 src/include/catalog/pg_proc.dat                 |  6 +--
 src/include/replication/reorderbuffer.h         | 16 +++++++
 src/include/replication/walsender_private.h     | 10 ++++
 src/test/regress/expected/rules.out             | 10 +++-
 8 files changed, 172 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 828e908..8de4fbd 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2121,6 +2121,53 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       with security-sensitive fields obfuscated.
      </entry>
     </row>
+    <row>
+     <entry><structfield>spill_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of transactions spilled to disk after the memory used by
+      logical decoding exceeds <literal>logical_work_mem</literal>. The
+      counter gets incremented both for toplevel transactions and
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>spill_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times transactions were spilled to disk. Transactions
+      may get spilled repeatedly, and this counter gets incremented on every
+      such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>spill_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded transaction data spilled to disk.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 9fe4a47..f8d0c4a 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -776,7 +776,13 @@ CREATE VIEW pg_stat_replication AS
             W.replay_lag,
             W.sync_priority,
             W.sync_state,
-            W.reply_time
+            W.reply_time,
+            W.spill_txns,
+            W.spill_count,
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 963cbf3..5e91522 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -354,6 +354,14 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	buffer->spillCount = 0;
+	buffer->spillTxns = 0;
+	buffer->spillBytes = 0;
+
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3001,6 +3009,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	int			fd = -1;
 	XLogSegNo	curOpenSegNo = 0;
 	Size		spilled = 0;
+	Size		size = txn->size;
 
 	elog(DEBUG2, "spill %u changes in XID %u to disk",
 		 (uint32) txn->nentries_mem, txn->xid);
@@ -3059,6 +3068,11 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		spilled++;
 	}
 
+	/* update the statistics */
+	rb->spillCount += 1;
+	rb->spillTxns += (rbtxn_is_serialized(txn)) ? 1 : 0;
+	rb->spillBytes += size;
+
 	Assert(spilled == txn->nentries_mem);
 	Assert(dlist_is_empty(&txn->changes));
 	txn->nentries_mem = 0;
@@ -3763,6 +3777,13 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	PG_END_TRY();
 
 	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 1 : 0;
+	rb->streamBytes += txn->size;
+
+	/*
 	 * Discard the changes that we just streamed, and mark the transactions
 	 * as streamed (if they contained changes).
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1c0c020..14abd85 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -248,6 +248,7 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
+static void UpdateSpillStats(LogicalDecodingContext *ctx);
 static void XLogRead(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count);
 
 
@@ -1267,7 +1268,8 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
 /*
  * LogicalDecodingContext 'update_progress' callback.
  *
- * Write the current position to the lag tracker (see XLogSendPhysical).
+ * Write the current position to the lag tracker (see XLogSendPhysical),
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1286,6 +1288,12 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 
 	LagTrackerWrite(lsn, now);
 	sendTime = now;
+
+	/*
+	 * Update statistics about transactions that spilled to disk or
+	 * streamed to subscriber (before being committed).
+	 */
+	UpdateSpillStats(ctx);
 }
 
 /*
@@ -2325,6 +2333,12 @@ InitWalSenderSlot(void)
 			walsnd->state = WALSNDSTATE_STARTUP;
 			walsnd->latch = &MyProc->procLatch;
 			walsnd->replyTime = 0;
+			walsnd->spillTxns = 0;
+			walsnd->spillCount = 0;
+			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3236,7 +3250,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	12
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3291,6 +3305,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int			pid;
 		WalSndState state;
 		TimestampTz replyTime;
+		int64		spillTxns;
+		int64		spillCount;
+		int64		spillBytes;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3311,6 +3331,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		applyLag = walsnd->applyLag;
 		priority = walsnd->sync_standby_priority;
 		replyTime = walsnd->replyTime;
+		spillTxns = walsnd->spillTxns;
+		spillCount = walsnd->spillCount;
+		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		memset(nulls, 0, sizeof(nulls));
@@ -3392,6 +3418,16 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 				nulls[11] = true;
 			else
 				values[11] = TimestampTzGetDatum(replyTime);
+
+			/* spill to disk */
+			values[12] = Int64GetDatum(spillTxns);
+			values[13] = Int64GetDatum(spillCount);
+			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3628,3 +3664,25 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
 	Assert(time != 0);
 	return now - time;
 }
+
+static void
+UpdateSpillStats(LogicalDecodingContext *ctx)
+{
+	ReorderBuffer *rb = ctx->reorder;
+
+	SpinLockAcquire(&MyWalSnd->mutex);
+
+	MyWalSnd->spillTxns = rb->spillTxns;
+	MyWalSnd->spillCount = rb->spillCount;
+	MyWalSnd->spillBytes = rb->spillBytes;
+
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(WARNING, "UpdateSpillStats: updating stats %p %ld %ld %ld %ld %ld %ld",
+		 rb, rb->spillTxns, rb->spillCount, rb->spillBytes,
+			 rb->streamTxns, rb->streamCount, rb->streamBytes);
+
+	SpinLockRelease(&MyWalSnd->mutex);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 58ea5b9..9a508bf 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5166,9 +5166,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index f39b34f..53d9440 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -509,6 +509,22 @@ struct ReorderBuffer
 
 	/* memory accounting */
 	Size		size;
+
+	/*
+	 * Statistics about transactions streamed or spilled to disk.
+	 *
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
+	 */
+	int64	spillCount;		/* spill-to-disk invocation counter */
+	int64	spillTxns;		/* number of transactions spilled to disk  */
+	int64	spillBytes;		/* amount of data spilled to disk */
+	int64	streamCount;	/* streaming invocation counter */
+	int64	streamTxns;		/* number of transactions spilled to disk */
+	int64	streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 0dd6d1c..f726f25 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -80,6 +80,16 @@ typedef struct WalSnd
 	 * Timestamp of the last message received from standby.
 	 */
 	TimestampTz replyTime;
+
+	/* Statistics for transactions spilled to disk. */
+	int64		spillTxns;
+	int64		spillCount;
+	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64		streamTxns;
+	int64		streamCount;
+	int64		streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 210e9cd..836abf0 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1951,9 +1951,15 @@ pg_stat_replication| SELECT s.pid,
     w.replay_lag,
     w.sync_priority,
     w.sync_state,
-    w.reply_time
+    w.reply_time,
+    w.spill_txns,
+    w.spill_count,
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
1.8.3.1

0009-Extend-the-concurrent-abort-handling-for-in-progress.patchapplication/octet-stream; name=0009-Extend-the-concurrent-abort-handling-for-in-progress.patchDownload
From 2a0712fb830a6be20296d18fb821862e302aeb8b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 3 Oct 2019 09:04:09 +0530
Subject: [PATCH 09/13] Extend the concurrent abort handling for in-progress
 transaction

---
 src/backend/replication/logical/reorderbuffer.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c2d4677..963cbf3 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -3280,6 +3280,7 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	volatile CommandId command_id;
 	bool		using_subtxn;
 	Size		streamed = 0;
+	MemoryContext ccxt = CurrentMemoryContext;
 	ReorderBufferStreamIterTXNState *volatile iterstate = NULL;
 
 	/*
@@ -3410,6 +3411,13 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			/* we're going to stream this change */
 			streamed++;
 
+			/*
+			 * Set the CheckXidAlive to the current (sub)xid for which this
+			 * change belongs to so that we can detect the abort while we are
+			 * decoding.
+			 */
+			CheckXidAlive = change->txn->xid;
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -3722,6 +3730,9 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+		
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferStreamIterTXNFinish(rb, iterstate);
@@ -3740,7 +3751,14 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		PG_RE_THROW();
+		/* re-throw only if it's not an abort */
+		if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+		else
+			FlushErrorState();
 	}
 	PG_END_TRY();
 
-- 
1.8.3.1

0011-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=0011-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From 0bc62e1df19e19198d20330f54471abfe9618abd Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:14:16 +0200
Subject: [PATCH 11/13] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 4 ++--
 src/test/subscription/t/010_stream_subxact.pl           | 4 ++--
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 4 ++--
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 4 ++--
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 6 +++---
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 40e306a..f41a0e1 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -64,7 +64,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 81547f6..8dfeafc 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 3ad00ea..78263c7 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR TABLE test_tab");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 4d01f7e..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -17,7 +17,7 @@ sub wait_for_caught_up
 # Create publisher node
 my $node_publisher = get_new_node('publisher');
 $node_publisher->init(allows_streaming => 'logical');
-$node_publisher->append_conf('postgresql.conf', 'logical_work_mem = 64kB');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
 $node_publisher->start;
 
 # Create subscriber node
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index 1a8b8ff..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -17,7 +17,7 @@ sub wait_for_caught_up
 # Create publisher node
 my $node_publisher = get_new_node('publisher');
 $node_publisher->init(allows_streaming => 'logical');
-$node_publisher->append_conf('postgresql.conf', 'logical_work_mem = 64kB');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
 $node_publisher->start;
 
 # Create subscriber node
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 04af090..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -17,7 +17,7 @@ sub wait_for_caught_up
 # Create publisher node
 my $node_publisher = get_new_node('publisher');
 $node_publisher->init(allows_streaming => 'logical');
-$node_publisher->append_conf('postgresql.conf', 'logical_work_mem = 64kB');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
 $node_publisher->start;
 
 # Create subscriber node
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 6fecfe6..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -17,7 +17,7 @@ sub wait_for_caught_up
 # Create publisher node
 my $node_publisher = get_new_node('publisher');
 $node_publisher->init(allows_streaming => 'logical');
-$node_publisher->append_conf('postgresql.conf', 'logical_work_mem = 64kB');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
 $node_publisher->start;
 
 # Create subscriber node
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index 50990c1..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -1,4 +1,4 @@
-# Test behavior with streaming transaction exceeding logical_work_mem
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
 use strict;
 use warnings;
 use PostgresNode;
@@ -17,7 +17,7 @@ sub wait_for_caught_up
 # Create publisher node
 my $node_publisher = get_new_node('publisher');
 $node_publisher->init(allows_streaming => 'logical');
-$node_publisher->append_conf('postgresql.conf', 'logical_work_mem = 64kB');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
 $node_publisher->start;
 
 # Create subscriber node
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

0012-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patchapplication/octet-stream; name=0012-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patchDownload
From 2fbbd9211397c5c79a25aa6b8160234d6fe4dfe6 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:14:45 +0200
Subject: [PATCH 12/13] BUGFIX: set final_lsn for subxacts before cleanup

---
 src/backend/replication/logical/reorderbuffer.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5e91522..29da34d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1544,6 +1544,10 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
+		/* make sure subtxn has final_lsn */
+		if (subtxn->final_lsn == InvalidXLogRecPtr)
+			subtxn->final_lsn = txn->final_lsn;
+
 		/*
 		 * Subtransactions are always associated to the toplevel TXN, even if
 		 * they originally were happening inside another subtxn, so we won't
-- 
1.8.3.1

0013-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=0013-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From c3146318b04d5a6beb139c5c8120cdeee0a8a8b6 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH 13/13] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v3-0014-BGWorkers-pool-for-streamed-transactions-apply.patchapplication/octet-stream; name=v3-0014-BGWorkers-pool-for-streamed-transactions-apply.patchDownload
From 9623e79695eb018f1e359e1a2b4b1be22a22d6a3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 3 Oct 2019 11:34:35 +0530
Subject: [PATCH] BGWorkers pool for streamed transactions apply

---
 src/backend/postmaster/bgworker.c        |    3 +
 src/backend/postmaster/pgstat.c          |    3 +
 src/backend/replication/logical/proto.c  |   17 +-
 src/backend/replication/logical/worker.c | 1782 ++++++++++++++++--------------
 src/include/pgstat.h                     |    1 +
 src/include/replication/logicalproto.h   |    6 +-
 src/include/replication/logicalworker.h  |    1 +
 7 files changed, 936 insertions(+), 877 deletions(-)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index b66b517..2e18083 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -129,6 +129,9 @@ static const struct
 	},
 	{
 		"ApplyWorkerMain", ApplyWorkerMain
+	},
+	{
+		"LogicalApplyBgwMain", LogicalApplyBgwMain
 	}
 };
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b22a053..46a75f2 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3810,6 +3810,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
 			event_name = "Hash/GrowBuckets/Reinserting";
 			break;
+		case WAIT_EVENT_LOGICAL_APPLY_WORKER_READY:
+			event_name = "LogicalApplyWorkerReady";
+			break;
 		case WAIT_EVENT_LOGICAL_SYNC_DATA:
 			event_name = "LogicalSyncData";
 			break;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 5a379fb..d615b1f 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -788,14 +788,11 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendint64(out, txn->commit_time);
 }
 
-TransactionId
+void
 logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
 {
-	TransactionId	xid;
 	uint8			flags;
 
-	xid = pq_getmsgint(in, 4);
-
 	/* read flags (unused for now) */
 	flags = pq_getmsgbyte(in);
 
@@ -806,8 +803,6 @@ logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
 	commit_data->commit_lsn = pq_getmsgint64(in);
 	commit_data->end_lsn = pq_getmsgint64(in);
 	commit_data->committime = pq_getmsgint64(in);
-
-	return xid;
 }
 
 void
@@ -822,13 +817,3 @@ logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 	pq_sendint32(out, xid);
 	pq_sendint32(out, subxid);
 }
-
-void
-logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
-							 TransactionId *subxid)
-{
-	Assert(xid && subxid);
-
-	*xid = pq_getmsgint(in, 4);
-	*subxid = pq_getmsgint(in, 4);
-}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 3493b02..7458ac0 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -81,11 +81,15 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/dsm.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
+#include "storage/shm_mq.h"
+#include "storage/shm_toc.h"
+#include "storage/spin.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
@@ -101,6 +105,54 @@
 #include "utils/timeout.h"
 
 #define NAPTIME_PER_CYCLE 1000	/* max sleep time between cycles (1s) */
+#define PG_LOGICAL_APPLY_SHM_MAGIC 0x79fb2447 // TODO Consider change
+
+typedef struct ParallelState
+{
+	slock_t	mutex;
+	// ConditionVariable cv;
+	bool	attached;
+	bool	ready;
+	bool	finished;
+	Oid		database_id;
+	Oid		authenticated_user_id;
+	Oid		subid;
+	Oid		stream_xid;
+	uint32	n;
+} ParallelState;
+
+typedef struct WorkerState
+{
+	TransactionId			 xid;
+	BackgroundWorkerHandle	*handle;
+	shm_mq_handle			*mq_handle;
+	dsm_segment				*dsm_seg;
+	ParallelState volatile	*pstate;
+} WorkerState;
+
+/* Apply workers hash table (initialized on first use) */
+static HTAB *ApplyWorkersHash = NULL;
+static WorkerState **ApplyWorkersIdleList = NULL;
+static uint32 pool_size = 10; /* MaxConnections default? */
+static uint32 nworkers = 0;
+static uint32 nfreeworkers = 0;
+
+/* Fields valid only for apply background workers */
+bool isLogicalApplyWorker = false;
+volatile ParallelState *MyParallelState = NULL;
+
+/* Worker setup and interactions */
+static void setup_dsm(WorkerState *wstate);
+static void setup_background_worker(WorkerState *wstate);
+static void cleanup_background_worker(dsm_segment *seg, Datum arg);
+static void handle_sigterm(SIGNAL_ARGS);
+
+static bool check_worker_status(WorkerState *wstate);
+static void wait_for_worker(WorkerState *wstate);
+static void wait_for_worker_to_finish(WorkerState *wstate);
+
+static WorkerState * find_or_start_worker(TransactionId xid, bool start);
+static void stop_worker(WorkerState *wstate);
 
 typedef struct FlushPosition
 {
@@ -129,47 +181,13 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
-/* fields valid only when processing streamed transaction */
+/* Fields valid only when processing streamed transaction */
 bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
-static int	stream_fd = -1;
-
-typedef struct SubXactInfo
-{
-	TransactionId xid;			/* XID of the subxact */
-	off_t		offset;			/* offset in the file */
-}			SubXactInfo;
-
-static uint32 nsubxacts = 0;
-static uint32 nsubxacts_max = 0;
-static SubXactInfo * subxacts = NULL;
-static TransactionId subxact_last = InvalidTransactionId;
-
-static void subxact_filename(char *path, Oid subid, TransactionId xid);
-static void changes_filename(char *path, Oid subid, TransactionId xid);
-
-/*
- * Information about subtransactions of a given toplevel transaction.
- */
-static void subxact_info_write(Oid subid, TransactionId xid);
-static void subxact_info_read(Oid subid, TransactionId xid);
-static void subxact_info_add(TransactionId xid);
-
-/*
- * Serialize and deserialize changes for a toplevel transaction.
- */
-static void stream_cleanup_files(Oid subid, TransactionId xid);
-static void stream_open_file(Oid subid, TransactionId xid, bool first);
-static void stream_write_change(char action, StringInfo s);
-static void stream_close_file(void);
-
-/*
- * Array of serialized XIDs.
- */
-static int	nxids = 0;
-static int	maxnxids = 0;
-static TransactionId	*xids = NULL;
+static TransactionId current_xid = InvalidTransactionId;
+static TransactionId prev_xid = InvalidTransactionId;
+static uint32 nchanges = 0;
 
 static bool handle_streamed_transaction(const char action, StringInfo s);
 
@@ -185,6 +203,16 @@ static volatile sig_atomic_t got_SIGHUP = false;
 /* prototype needed because of stream_commit */
 static void apply_dispatch(StringInfo s);
 
+// /* Debug only */
+// static void
+// iter_sleep(int seconds)
+// {
+// 	for (int i = 0; i < seconds; i++)
+// 	{
+// 		pg_usleep(1 * 1000L * 1000L);
+// 	}
+// }
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -237,6 +265,107 @@ ensure_transaction(void)
 }
 
 /*
+ * Look up worker inside ApplyWorkersHash for requested xid.
+ * Throw error if not found or start a new one if start=true is passed.
+ */
+static WorkerState *
+find_or_start_worker(TransactionId xid, bool start)
+{
+	bool found;
+	WorkerState *entry = NULL;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* First time through, initialize apply workers hashtable */
+	if (ApplyWorkersHash == NULL)
+	{
+		HASHCTL		ctl;
+
+		MemSet(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(TransactionId);
+		ctl.entrysize = sizeof(WorkerState);
+		ctl.hcxt = ApplyContext; /* Allocate ApplyWorkersHash in the ApplyContext */
+		ApplyWorkersHash = hash_create("logical apply workers hash", 8,
+									 &ctl,
+									 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	Assert(ApplyWorkersHash != NULL);
+
+	/*
+	 * Find entry for requested transaction.
+	 */
+	entry = hash_search(ApplyWorkersHash, &xid, HASH_FIND, &found);
+
+	if (!found && start)
+	{
+		/* If there is at least one worker in the idle list, then take one. */
+		if (nfreeworkers > 0)
+		{
+			char action = 'R';
+
+			Assert(ApplyWorkersIdleList != NULL);
+
+			entry = ApplyWorkersIdleList[nfreeworkers - 1];
+			if (!hash_update_hash_key(ApplyWorkersHash,
+									  (void *) entry,
+									  (void *) &xid))
+				elog(ERROR, "could not reassign apply worker #%u entry from xid %u to xid %u",
+													entry->pstate->n, entry->xid, xid);
+
+			entry->xid = xid;
+			entry->pstate->finished = false;
+			entry->pstate->stream_xid = xid;
+			shm_mq_send(entry->mq_handle, 1, &action, false);
+
+			ApplyWorkersIdleList[--nfreeworkers] = NULL;
+		}
+		else
+		{
+			/* No entry in hash and no idle workers. Create a new one. */
+			entry = hash_search(ApplyWorkersHash, &xid, HASH_ENTER, &found);
+			entry->xid = xid;
+			setup_background_worker(entry);
+
+			if (nworkers == pool_size)
+			{
+				ApplyWorkersIdleList = repalloc(ApplyWorkersIdleList, pool_size + 10);
+				pool_size += 10;
+			}
+		}
+	}
+	else if (!found && !start)
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+				errmsg("could not find logical apply worker for xid %u", xid)));
+	else
+		elog(DEBUG5, "there is an existing logical apply worker for xid %u", xid);
+
+	Assert(entry != NULL);
+
+	return entry;
+}
+
+/*
+ * Gracefully teardown apply worker.
+ */
+static void
+stop_worker(WorkerState *wstate)
+{
+	/*
+	 * Sending zero-length data to worker in order to stop it.
+	 */
+	shm_mq_send(wstate->mq_handle, 0, NULL, false);
+
+	elog(LOG, "detaching DSM of apply worker #%u for xid %u",
+									wstate->pstate->n, wstate->xid);
+	dsm_detach(wstate->dsm_seg);
+
+	/* Delete worker entry */
+	(void) hash_search(ApplyWorkersHash, &wstate->xid, HASH_REMOVE, NULL);
+}
+
+/*
  * Handle streamed transactions.
  *
  * If in streaming mode (receiving a block of streamed transaction), we
@@ -248,12 +377,12 @@ static bool
 handle_streamed_transaction(const char action, StringInfo s)
 {
 	TransactionId xid;
+	WorkerState *entry;
 
 	/* not in streaming mode */
-	if (!in_streamed_transaction)
+	if (!in_streamed_transaction || isLogicalApplyWorker)
 		return false;
 
-	Assert(stream_fd != -1);
 	Assert(TransactionIdIsValid(stream_xid));
 
 	/*
@@ -264,11 +393,16 @@ handle_streamed_transaction(const char action, StringInfo s)
 
 	Assert(TransactionIdIsValid(xid));
 
-	/* Add the new subxact to the array (unless already there). */
-	subxact_info_add(xid);
+	/*
+	 * Find worker for requested xid.
+	 */
+	entry = find_or_start_worker(stream_xid, false);
 
-	/* write the change to the current file */
-	stream_write_change(action, s);
+	// elog(LOG, "sending message of length=%d and raw=%s, action=%s", s->len, s->data, (char *) &action);
+	shm_mq_send(entry->mq_handle, s->len, s->data, false);
+	nchanges += 1;
+
+	// iter_sleep(3600);
 
 	return true;
 }
@@ -624,7 +758,8 @@ apply_handle_origin(StringInfo s)
 static void
 apply_handle_stream_start(StringInfo s)
 {
-	bool		first_segment;
+	bool		 first_segment;
+	WorkerState *entry;
 
 	Assert(!in_streamed_transaction);
 
@@ -633,17 +768,16 @@ apply_handle_stream_start(StringInfo s)
 
 	/* extract XID of the top-level transaction */
 	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+	nchanges = 0;
 
-	/* open the spool file for this transaction */
-	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+	/* Find worker for requested xid */
+	entry = find_or_start_worker(stream_xid, true);
 
-	/*
-	 * if this is not the first segment, open existing file
-	 *
-	 * XXX Note that the cleanup is performed by stream_open_file.
-	 */
-	if (!first_segment)
-		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+	SpinLockAcquire(&entry->pstate->mutex);
+	entry->pstate->ready = false;
+	SpinLockRelease(&entry->pstate->mutex);
+
+	elog(LOG, "starting streaming of xid %u", stream_xid);
 
 	pgstat_report_activity(STATE_RUNNING, NULL);
 }
@@ -654,16 +788,19 @@ apply_handle_stream_start(StringInfo s)
 static void
 apply_handle_stream_stop(StringInfo s)
 {
+	WorkerState *entry;
+	char action = 'E';
+
 	Assert(in_streamed_transaction);
 
-	/*
-	 * Close the file with serialized changes, and serialize information about
-	 * subxacts for the toplevel transaction.
-	 */
-	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
-	stream_close_file();
+	/* Find worker for requested xid */
+	entry = find_or_start_worker(stream_xid, false);
+
+	shm_mq_send(entry->mq_handle, 1, &action, false);
+	wait_for_worker(entry);
 
 	in_streamed_transaction = false;
+	elog(LOG, "stopped streaming of xid %u, %u changes streamed", stream_xid, nchanges);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
@@ -676,96 +813,67 @@ apply_handle_stream_abort(StringInfo s)
 {
 	TransactionId xid;
 	TransactionId subxid;
+	WorkerState *entry;
 
 	Assert(!in_streamed_transaction);
 
-	logicalrep_read_stream_abort(s, &xid, &subxid);
-
-	/*
-	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
-	 * just delete the files with serialized info.
-	 */
-	if (xid == subxid)
+	if(isLogicalApplyWorker)
 	{
-		char		path[MAXPGPATH];
+		subxid = pq_getmsgint(s, 4);
 
-		/*
-		 * XXX Maybe this should be an error instead? Can we receive abort for
-		 * a toplevel transaction we haven't received?
-		 */
+		ereport(LOG,
+				(errcode_for_file_access(),
+				errmsg("[Apply BGW #%u] aborting current transaction xid=%u, subxid=%u",
+				MyParallelState->n, GetCurrentTransactionIdIfAny(), GetCurrentSubTransactionId())));
 
-		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		if (subxid == stream_xid)
+			AbortCurrentTransaction();
+		else
+		{
+			char *spname = (char *) palloc(64 * sizeof(char));
+			sprintf(spname, "savepoint_for_xid_%u", subxid);
 
-		if (unlink(path) < 0)
-			ereport(ERROR,
+			ereport(LOG,
 					(errcode_for_file_access(),
-					 errmsg("could not remove file \"%s\": %m", path)));
+					errmsg("[Apply BGW #%u] rolling back to savepoint %s", MyParallelState->n, spname)));
 
-		subxact_filename(path, MyLogicalRepWorker->subid, xid);
-
-		if (unlink(path) < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not remove file \"%s\": %m", path)));
+			RollbackToSavepoint(spname);
+			CommitTransactionCommand();
+			// RollbackAndReleaseCurrentSubTransaction();
 
-		return;
+			pfree(spname);
+		}
 	}
 	else
 	{
-		/*
-		 * OK, so it's a subxact. We need to read the subxact file for the
-		 * toplevel transaction, determine the offset tracked for the subxact,
-		 * and truncate the file with changes. We also remove the subxacts
-		 * with higher offsets (or rather higher XIDs).
-		 *
-		 * We intentionally scan the array from the tail, because we're likely
-		 * aborting a change for the most recent subtransactions.
-		 *
-		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
-		 * would allow us to use binary search here.
-		 *
-		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
-		 * order, i.e. from the inner-most subxact (when nested)? In which
-		 * case we could simply check the last element.
-		 */
+		xid = pq_getmsgint(s, 4);
+		subxid = pq_getmsgint(s, 4);
 
-		int64		i;
-		int64		subidx;
-		bool		found = false;
-		char		path[MAXPGPATH];
+		/* Find worker for requested xid */
+		entry = find_or_start_worker(stream_xid, false);
 
-		subidx = -1;
-		subxact_info_read(MyLogicalRepWorker->subid, xid);
+		elog(LOG, "processing abort request of streamed transaction xid %u, subxid %u",
+			xid, subxid);
+		shm_mq_send(entry->mq_handle, s->len, s->data, false);
 
-		/* FIXME optimize the search by bsearch on sorted data */
-		for (i = nsubxacts; i > 0; i--)
+		if (subxid == stream_xid)
 		{
-			if (subxacts[i - 1].xid == subxid)
-			{
-				subidx = (i - 1);
-				found = true;
-				break;
-			}
-		}
-
-		/* We should not receive aborts for unknown subtransactions. */
-		Assert(found);
+			char action = 'F';
+			shm_mq_send(entry->mq_handle, 1, &action, false);
+			// shm_mq_send(entry->mq_handle, 0, NULL, false);
 
-		/* OK, truncate the file at the right offset. */
-		Assert((subidx >= 0) && (subidx < nsubxacts));
+			wait_for_worker_to_finish(entry);
 
-		changes_filename(path, MyLogicalRepWorker->subid, xid);
+			elog(LOG, "adding finished apply worker #%u for xid %u to the idle list",
+												entry->pstate->n, entry->xid);
+			ApplyWorkersIdleList[nfreeworkers++] = entry;
 
-		if (truncate(path, subxacts[subidx].offset))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not truncate file \"%s\": %m", path)));
+			// elog(LOG, "detaching DSM of apply worker for xid=%u\n", entry->xid);
+			// dsm_detach(entry->dsm_seg);
 
-		/* discard the subxacts added later */
-		nsubxacts = subidx;
-
-		/* write the updated subxact list */
-		subxact_info_write(MyLogicalRepWorker->subid, xid);
+			// /* Delete worker entry */
+			// (void) hash_search(ApplyWorkersHash, &xid, HASH_REMOVE, NULL);
+		}
 	}
 }
 
@@ -775,159 +883,56 @@ apply_handle_stream_abort(StringInfo s)
 static void
 apply_handle_stream_commit(StringInfo s)
 {
-	int			fd;
 	TransactionId xid;
-	StringInfoData s2;
-	int			nchanges;
-
-	char		path[MAXPGPATH];
-	char	   *buffer = NULL;
+	WorkerState *entry;
 	LogicalRepCommitData commit_data;
 
-	MemoryContext oldcxt;
-
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	/* open the spool file for the committed transaction */
-	changes_filename(path, MyLogicalRepWorker->subid, xid);
-
-	elog(DEBUG1, "replaying changes from file '%s'", path);
-
-	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (fd < 0)
+	if (isLogicalApplyWorker)
 	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
-	}
-
-	/* XXX Should this be allocated in another memory context? */
+		// logicalrep_read_stream_commit(s, &commit_data);
 
-	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
-
-	buffer = palloc(8192);
-	initStringInfo(&s2);
-
-	MemoryContextSwitchTo(oldcxt);
-
-	ensure_transaction();
-
-	/*
-	 * Make sure the handle apply_dispatch methods are aware we're in a remote
-	 * transaction.
-	 */
-	in_remote_transaction = true;
-	pgstat_report_activity(STATE_RUNNING, NULL);
-
-	/*
-	 * Read the entries one by one and pass them through the same logic as in
-	 * apply_dispatch.
-	 */
-	nchanges = 0;
-	while (true)
+		CommitTransactionCommand();
+	}
+	else
 	{
-		int			nbytes;
-		int			len;
-
-		/* read length of the on-disk record */
-		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
-		nbytes = read(fd, &len, sizeof(len));
-		pgstat_report_wait_end();
-
-		/* have we reached end of the file? */
-		if (nbytes == 0)
-			break;
-
-		/* do we have a correct length? */
-		if (nbytes != sizeof(len))
-		{
-			int			save_errno = errno;
-
-			CloseTransientFile(fd);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file: %m")));
-			return;
-		}
-
-		Assert(len > 0);
+		char action = 'F';
 
-		/* make sure we have sufficiently large buffer */
-		buffer = repalloc(buffer, len);
-
-		/* and finally read the data into the buffer */
-		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
-		if (read(fd, buffer, len) != len)
-		{
-			int			save_errno = errno;
-
-			CloseTransientFile(fd);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file: %m")));
-			return;
-		}
-		pgstat_report_wait_end();
+		Assert(!in_streamed_transaction);
 
-		/* copy the buffer to the stringinfo and call apply_dispatch */
-		resetStringInfo(&s2);
-		appendBinaryStringInfo(&s2, buffer, len);
+		xid = pq_getmsgint(s, 4);
+		logicalrep_read_stream_commit(s, &commit_data);
 
-		/* Ensure we are reading the data into our memory context. */
-		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+		elog(DEBUG1, "received commit for streamed transaction %u", xid);
 
-		apply_dispatch(&s2);
+		/* Find worker for requested xid */
+		entry = find_or_start_worker(xid, false);
 
-		MemoryContextReset(ApplyMessageContext);
+		/* Send commit message */
+		shm_mq_send(entry->mq_handle, s->len, s->data, false);
 
-		MemoryContextSwitchTo(oldcxt);
+		/* Notify worker, that we are done with this xact */
+		shm_mq_send(entry->mq_handle, 1, &action, false);
 
-		nchanges++;
+		wait_for_worker_to_finish(entry);
 
-		if (nchanges % 1000 == 0)
-			elog(DEBUG1, "replayed %d changes from file '%s'",
-				 nchanges, path);
+		elog(LOG, "adding finished apply worker #%u for xid %u to the idle list",
+											entry->pstate->n, entry->xid);
+		ApplyWorkersIdleList[nfreeworkers++] = entry;
 
 		/*
-		 * send feedback to upstream
-		 *
-		 * XXX Probably should send a valid LSN. But which one?
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
 		 */
-		send_feedback(InvalidXLogRecPtr, false, false);
-	}
-
-	CloseTransientFile(fd);
-
-	/*
-	 * Update origin state so we can restart streaming from correct
-	 * position in case of crash.
-	 */
-	replorigin_session_origin_lsn = commit_data.end_lsn;
-	replorigin_session_origin_timestamp = commit_data.committime;
-
-	CommitTransactionCommand();
-	pgstat_report_stat(false);
-
-	store_flush_position(commit_data.end_lsn);
-
-	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
-		 nchanges, path);
+		replorigin_session_origin_lsn = commit_data.end_lsn;
+		replorigin_session_origin_timestamp = commit_data.committime;
 
-	in_remote_transaction = false;
-	pgstat_report_activity(STATE_IDLE, NULL);
+		pgstat_report_stat(false);
 
-	/* unlink the files with serialized changes and subxact info */
-	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+		store_flush_position(commit_data.end_lsn);
 
-	pfree(buffer);
-	pfree(s2.data);
+		in_remote_transaction = false;
+		pgstat_report_activity(STATE_IDLE, NULL);
+	}
 }
 
 /*
@@ -946,6 +951,8 @@ apply_handle_relation(StringInfo s)
 	if (handle_streamed_transaction('R', s))
 		return;
 
+	// iter_sleep(3600);
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -1386,6 +1393,38 @@ apply_dispatch(StringInfo s)
 {
 	char		action = pq_getmsgbyte(s);
 
+	if (isLogicalApplyWorker)
+	{
+		/*
+		 * Inside logical apply worker we can figure out that new subtransaction
+		 * was started if new change arrived with different xid. In that case we
+		 * can define named savepoint, so that we were able to commit/rollback it
+		 * separately later.
+		 * 
+		 * Special case is if the first change comes from subtransuction, then
+		 * we check that current_xid differs from stream_xid.
+		 */
+		current_xid = pq_getmsgint(s, 4);
+
+		if (current_xid != stream_xid
+			&& ((TransactionIdIsValid(prev_xid) && current_xid != prev_xid)
+				|| !TransactionIdIsValid(prev_xid)))
+		{
+			char *spname = (char *) palloc(64 * sizeof(char));
+			sprintf(spname, "savepoint_for_xid_%u", current_xid);
+
+			elog(LOG, "[Apply BGW #%u] defining savepoint %s", MyParallelState->n, spname);
+
+			DefineSavepoint(spname);
+			CommitTransactionCommand();
+			// BeginInternalSubTransaction(NULL);
+		}
+
+		prev_xid = current_xid;
+	}
+	// else
+	// 	elog(LOG, "Logical worker: applying dispatch for action=%s", (char *) &action);
+
 	switch (action)
 	{
 			/* BEGIN */
@@ -1414,6 +1453,7 @@ apply_dispatch(StringInfo s)
 			break;
 			/* RELATION */
 		case 'R':
+			// elog(LOG, "%s worker: applying dispatch for action=R", isLogicalApplyWorker ? "Apply" : "Logical");
 			apply_handle_relation(s);
 			break;
 			/* TYPE */
@@ -1544,12 +1584,18 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 static void
 worker_onexit(int code, Datum arg)
 {
-	int	i;
+	HASH_SEQ_STATUS status;
+	WorkerState *entry;
 
-	elog(LOG, "cleanup files for %d transactions", nxids);
-
-	for (i = nxids-1; i >= 0; i--)
-		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i]);
+	if (ApplyWorkersHash != NULL)
+	{
+		hash_seq_init(&status, ApplyWorkersHash);
+		while ((entry = (WorkerState *) hash_seq_search(&status)) != NULL)
+		{
+			stop_worker(entry);
+		}
+		hash_seq_term(&status);
+	}
 }
 
 /*
@@ -1572,6 +1618,8 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
+	ApplyWorkersIdleList = palloc(sizeof(WorkerState *) * pool_size);
+
 	for (;;)
 	{
 		pgsocket	fd = PGINVALID_SOCKET;
@@ -1883,8 +1931,9 @@ maybe_reread_subscription(void)
 	Subscription *newsub;
 	bool		started_tx = false;
 
+	// TODO Probably we have to handle subscription reread in apply workers too.
 	/* When cache state is valid there is nothing to do here. */
-	if (MySubscriptionValid)
+	if (MySubscriptionValid || isLogicalApplyWorker)
 		return;
 
 	/* This function might be called inside or outside of transaction. */
@@ -2018,608 +2067,50 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
-/*
- * subxact_info_write
- *	  Store information about subxacts for a toplevel transaction.
- *
- * For each subxact we store offset of it's first change in the main file.
- * The file is always over-written as a whole, and we also include CRC32C
- * checksum of the information.
- *
- * XXX We should only store subxacts that were not aborted yet.
- *
- * XXX Maybe we should only include the checksum when the cluster is
- * initialized with checksums?
- *
- * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
- */
+/* SIGHUP: set flag to reload configuration at next convenient time */
 static void
-subxact_info_write(Oid subid, TransactionId xid)
+logicalrep_worker_sighup(SIGNAL_ARGS)
 {
-	int			fd;
-	char		path[MAXPGPATH];
-	uint32		checksum;
-	Size		len;
-
-	Assert(TransactionIdIsValid(xid));
-
-	subxact_filename(path, subid, xid);
-
-	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
-	if (fd < 0)
-	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create file \"%s\": %m",
-						path)));
-		return;
-	}
-
-	len = sizeof(SubXactInfo) * nsubxacts;
-
-	/* compute the checksum */
-	INIT_CRC32C(checksum);
-	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
-	COMP_CRC32C(checksum, (char *) subxacts, len);
-	FIN_CRC32C(checksum);
-
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
-
-	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
-	{
-		int			save_errno = errno;
+	int			save_errno = errno;
 
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not write to file \"%s\": %m",
-						path)));
-		return;
-	}
+	got_SIGHUP = true;
 
-	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
-	{
-		int			save_errno = errno;
+	/* Waken anything waiting on the process latch */
+	SetLatch(MyLatch);
 
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not write to file \"%s\": %m",
-						path)));
-		return;
-	}
+	errno = save_errno;
+}
 
-	if ((len > 0) && (write(fd, subxacts, len) != len))
-	{
-		int			save_errno = errno;
+/* Logical Replication Apply worker entry point */
+void
+ApplyWorkerMain(Datum main_arg)
+{
+	int			worker_slot = DatumGetInt32(main_arg);
+	MemoryContext oldctx;
+	char		originname[NAMEDATALEN];
+	XLogRecPtr	origin_startpos;
+	char	   *myslotname;
+	WalRcvStreamOptions options;
 
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not write to file \"%s\": %m",
-						path)));
-		return;
-	}
+	/* Attach to slot */
+	logicalrep_worker_attach(worker_slot);
 
-	pgstat_report_wait_end();
+	/* Setup signal handling */
+	pqsignal(SIGHUP, logicalrep_worker_sighup);
+	pqsignal(SIGTERM, die);
+	BackgroundWorkerUnblockSignals();
 
 	/*
-	 * We don't need to fsync or anything, as we'll recreate the files after a
-	 * crash from scratch. So just close the file.
+	 * We don't currently need any ResourceOwner in a walreceiver process, but
+	 * if we did, we could call CreateAuxProcessResourceOwner here.
 	 */
-	CloseTransientFile(fd);
 
-	/*
-	 * But we free the memory allocated for subxact info. There might be one
-	 * exceptional transaction with many subxacts, and we don't want to keep
-	 * the memory allocated forewer.
-	 *
-	 */
-	if (subxacts)
-		pfree(subxacts);
+	/* Initialise stats to a sanish value */
+	MyLogicalRepWorker->last_send_time = MyLogicalRepWorker->last_recv_time =
+		MyLogicalRepWorker->reply_time = GetCurrentTimestamp();
 
-	subxacts = NULL;
-	subxact_last = InvalidTransactionId;
-	nsubxacts = 0;
-	nsubxacts_max = 0;
-}
-
-/*
- * subxact_info_read
- *	  Restore information about subxacts of a streamed transaction.
- *
- * Read information about subxacts into the global variables, and while
- * reading the information verify the checksum.
- *
- * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
- *
- * XXX Do we need to allocate it in TopMemoryContext?
- */
-static void
-subxact_info_read(Oid subid, TransactionId xid)
-{
-	int			fd;
-	char		path[MAXPGPATH];
-	uint32		checksum;
-	uint32		checksum_new;
-	Size		len;
-	MemoryContext oldctx;
-
-	Assert(TransactionIdIsValid(xid));
-	Assert(!subxacts);
-	Assert(nsubxacts == 0);
-	Assert(nsubxacts_max == 0);
-
-	subxact_filename(path, subid, xid);
-
-	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (fd < 0)
-	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
-		return;
-	}
-
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
-
-	/* read the checksum */
-	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not read file \"%s\": %m",
-						path)));
-		return;
-	}
-
-	/* read number of subxact items */
-	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not read file \"%s\": %m",
-						path)));
-		return;
-	}
-
-	pgstat_report_wait_end();
-
-	len = sizeof(SubXactInfo) * nsubxacts;
-
-	/* we keep the maximum as a power of 2 */
-	nsubxacts_max = 1 << my_log2(nsubxacts);
-
-	/* subxacts are long-lived */
-	oldctx = MemoryContextSwitchTo(TopMemoryContext);
-	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
-	MemoryContextSwitchTo(oldctx);
-
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
-
-	if ((len > 0) && ((read(fd, subxacts, len)) != len))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not read file \"%s\": %m",
-						path)));
-		return;
-	}
-
-	pgstat_report_wait_end();
-
-	/* recompute the checksum */
-	INIT_CRC32C(checksum_new);
-	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
-	COMP_CRC32C(checksum_new, (char *) subxacts, len);
-	FIN_CRC32C(checksum_new);
-
-	if (checksum_new != checksum)
-		ereport(ERROR,
-				(errmsg("checksum failure when reading subxacts")));
-
-	CloseTransientFile(fd);
-}
-
-/*
- * subxact_info_add
- *	  Add information about a subxact (offset in the main file).
- *
- * XXX Do we need to allocate it in TopMemoryContext?
- */
-static void
-subxact_info_add(TransactionId xid)
-{
-	int64		i;
-
-	/*
-	 * If the XID matches the toplevel transaction, we don't want to add it.
-	 */
-	if (stream_xid == xid)
-		return;
-
-	/*
-	 * In most cases we're checking the same subxact as we've already seen in
-	 * the last call, so make ure just ignore it (this change comes later).
-	 */
-	if (subxact_last == xid)
-		return;
-
-	/* OK, remember we're processing this XID. */
-	subxact_last = xid;
-
-	/*
-	 * Check if the transaction is already present in the array of subxact. We
-	 * intentionally scan the array from the tail, because we're likely adding
-	 * a change for the most recent subtransactions.
-	 *
-	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
-	 * would allow us to use binary search here.
-	 */
-	for (i = nsubxacts; i > 0; i--)
-	{
-		/* found, so we're done */
-		if (subxacts[i - 1].xid == xid)
-			return;
-	}
-
-	/* This is a new subxact, so we need to add it to the array. */
-
-	if (nsubxacts == 0)
-	{
-		MemoryContext oldctx;
-
-		nsubxacts_max = 128;
-		oldctx = MemoryContextSwitchTo(TopMemoryContext);
-		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
-		MemoryContextSwitchTo(oldctx);
-	}
-	else if (nsubxacts == nsubxacts_max)
-	{
-		nsubxacts_max *= 2;
-		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
-	}
-
-	subxacts[nsubxacts].xid = xid;
-	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
-
-	nsubxacts++;
-}
-
-/* format filename for file containing the info about subxacts */
-static void
-subxact_filename(char *path, Oid subid, TransactionId xid)
-{
-	char		tempdirpath[MAXPGPATH];
-
-	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
-
-	/*
-	 * We might need to create the tablespace's tempfile directory, if no
-	 * one has yet done so.
-	 *
-	 * Don't check for error from mkdir; it could fail if the directory
-	 * already exists (maybe someone else just did the same thing).  If
-	 * it doesn't work then we'll bomb out when opening the file
-	 */
-	mkdir(tempdirpath, S_IRWXU);
-
-	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
-			 tempdirpath, subid, xid);
-}
-
-/* format filename for file containing serialized changes */
-static void
-changes_filename(char *path, Oid subid, TransactionId xid)
-{
-	char		tempdirpath[MAXPGPATH];
-
-	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
-
-	/*
-	 * We might need to create the tablespace's tempfile directory, if no
-	 * one has yet done so.
-	 *
-	 * Don't check for error from mkdir; it could fail if the directory
-	 * already exists (maybe someone else just did the same thing).  If
-	 * it doesn't work then we'll bomb out when opening the file
-	 */
-	mkdir(tempdirpath, S_IRWXU);
-
-	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
-			 tempdirpath, subid, xid);
-}
-
-/*
- * stream_cleanup_files
- *	  Cleanup files for a subscription / toplevel transaction.
- *
- * Remove files with serialized changes and subxact info for a particular
- * toplevel transaction. Each subscription has a separate set of files.
- *
- * Note: The files may not exists, so handle ENOENT as non-error.
- *
- * TODO: Add missing_ok flag to specify in which cases it's OK not to
- * find the files, and when it's an error.
- */
-static void
-stream_cleanup_files(Oid subid, TransactionId xid)
-{
-	int			i;
-	char		path[MAXPGPATH];
-	bool		found = false;
-
-	subxact_filename(path, subid, xid);
-
-	if ((unlink(path) < 0) && (errno != ENOENT))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not remove file \"%s\": %m", path)));
-
-	changes_filename(path, subid, xid);
-
-	if ((unlink(path) < 0) && (errno != ENOENT))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not remove file \"%s\": %m", path)));
-
-	/*
-	 * Cleanup the XID from the array - find the XID in the array and
-	 * remove it by shifting all the remaining elements. The array is
-	 * bound to be fairly small (maximum number of in-progress xacts,
-	 * so max_connections + max_prepared_transactions) so simply loop
-	 * through the array and find index of the XID. Then move the rest
-	 * of the array by one element to the left.
-	 *
-	 * Notice we also call this from stream_open_file for first segment
-	 * of each transaction, to deal with possible left-overs after a
-	 * crash, so it's entirely possible not to find the XID in the
-	 * array here. In that case we don't remove anything.
-	 *
-	 * XXX Perhaps it'd be better to handle this automatically after a
-	 * restart, instead of doing it over and over for each transaction.
-	 */
-	for (i = 0; i < nxids; i++)
-	{
-		if (xids[i] == xid)
-		{
-			found = true;
-			break;
-		}
-	}
-
-	if (!found)
-		return;
-
-	/*
-	 * Move the last entry from the array to the place. We don't keep
-	 * the streamed transactions sorted or anything - we only expect 
-	 * a few of them in progress (max_connections + max_prepared_xacts)
-	 * so linear search is just fine.
-	 */
-	xids[i] = xids[nxids-1];
-	nxids--;
-}
-
-/*
- * stream_open_file
- *	  Open file we'll use to serialize changes for a toplevel transaction.
- *
- * Open a file for streamed changes from a toplevel transaction identified
- * by stream_xid (global variable). If it's the first chunk of streamed
- * changes for this transaction, perform cleanup by removing existing
- * files after a possible previous crash.
- *
- * This can only be called at the beginning of a "streaming" block, i.e.
- * between stream_start/stream_stop messages from the upstream.
- */
-static void
-stream_open_file(Oid subid, TransactionId xid, bool first_segment)
-{
-	char		path[MAXPGPATH];
-	int			flags;
-
-	Assert(in_streamed_transaction);
-	Assert(OidIsValid(subid));
-	Assert(TransactionIdIsValid(xid));
-	Assert(stream_fd == -1);
-
-	/*
-	 * If this is the first segment for this transaction, try removing
-	 * existing files (if there are any, possibly after a crash).
-	 */
-	if (first_segment)
-	{
-		MemoryContext	oldcxt;
-
-		/* XXX make sure there are no previous files for this transaction */
-		stream_cleanup_files(subid, xid);
-
-		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
-
-		/*
-		 * We need to remember the XIDs we spilled to files, so that we can
-		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
-		 *
-		 * The number of XIDs we may need to track is fairly small, because
-		 * we can only stream toplevel xacts (so limited by max_connections
-		 * and max_prepared_transactions), and we only stream the large ones.
-		 * So we simply keep the XIDs in an unsorted array. If the number of
-		 * xacts gets large for some reason (e.g. very high max_connections),
-		 * a more elaborate approach might be better - e.g. sorted array, to
-		 * speed-up the lookups.
-		 */
-		if (nxids == maxnxids)	/* array of XIDs is full */
-		{
-			if (!xids)
-			{
-				maxnxids = 64;
-				xids = palloc(maxnxids * sizeof(TransactionId));
-			}
-			else
-			{
-				maxnxids = 2 * maxnxids;
-				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
-			}
-		}
-
-		xids[nxids++] = xid;
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-
-	changes_filename(path, subid, xid);
-
-	elog(DEBUG1, "opening file '%s' for streamed changes", path);
-
-	/*
-	 * If this is the first streamed segment, the file must not exist, so
-	 * make sure we're the ones creating it. Otherwise just open the file
-	 * for writing, in append mode.
-	 */
-	if (first_segment)
-		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
-	else
-		flags = (O_WRONLY | O_APPEND | PG_BINARY);
-
-	stream_fd = OpenTransientFile(path, flags);
-
-	if (stream_fd < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
-}
-
-/*
- * stream_close_file
- *	  Close the currently open file with streamed changes.
- *
- * This can only be called at the beginning of a "streaming" block, i.e.
- * between stream_start/stream_stop messages from the upstream.
- */
-static void
-stream_close_file(void)
-{
-	Assert(in_streamed_transaction);
-	Assert(TransactionIdIsValid(stream_xid));
-	Assert(stream_fd != -1);
-
-	CloseTransientFile(stream_fd);
-
-	stream_xid = InvalidTransactionId;
-	stream_fd = -1;
-}
-
-/*
- * stream_write_change
- *	  Serialize a change to a file for the current toplevel transaction.
- *
- * The change is serialied in a simple format, with length (not including
- * the length), action code (identifying the message type) and message
- * contents (without the subxact TransactionId value).
- *
- * XXX The subxact file includes CRC32C of the contents. Maybe we should
- * include something like that here too, but doing so will not be as
- * straighforward, because we write the file in chunks.
- */
-static void
-stream_write_change(char action, StringInfo s)
-{
-	int			len;
-
-	Assert(in_streamed_transaction);
-	Assert(TransactionIdIsValid(stream_xid));
-	Assert(stream_fd != -1);
-
-	/* total on-disk size, including the action type character */
-	len = (s->len - s->cursor) + sizeof(char);
-
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
-
-	/* first write the size */
-	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not serialize streamed change to file: %m")));
-
-	/* then the action */
-	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not serialize streamed change to file: %m")));
-
-	/* and finally the remaining part of the buffer (after the XID) */
-	len = (s->len - s->cursor);
-
-	if (write(stream_fd, &s->data[s->cursor], len) != len)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not serialize streamed change to file: %m")));
-
-	pgstat_report_wait_end();
-}
-
-/* SIGHUP: set flag to reload configuration at next convenient time */
-static void
-logicalrep_worker_sighup(SIGNAL_ARGS)
-{
-	int			save_errno = errno;
-
-	got_SIGHUP = true;
-
-	/* Waken anything waiting on the process latch */
-	SetLatch(MyLatch);
-
-	errno = save_errno;
-}
-
-/* Logical Replication Apply worker entry point */
-void
-ApplyWorkerMain(Datum main_arg)
-{
-	int			worker_slot = DatumGetInt32(main_arg);
-	MemoryContext oldctx;
-	char		originname[NAMEDATALEN];
-	XLogRecPtr	origin_startpos;
-	char	   *myslotname;
-	WalRcvStreamOptions options;
-
-	/* Attach to slot */
-	logicalrep_worker_attach(worker_slot);
-
-	/* Setup signal handling */
-	pqsignal(SIGHUP, logicalrep_worker_sighup);
-	pqsignal(SIGTERM, die);
-	BackgroundWorkerUnblockSignals();
-
-	/*
-	 * We don't currently need any ResourceOwner in a walreceiver process, but
-	 * if we did, we could call CreateAuxProcessResourceOwner here.
-	 */
-
-	/* Initialise stats to a sanish value */
-	MyLogicalRepWorker->last_send_time = MyLogicalRepWorker->last_recv_time =
-		MyLogicalRepWorker->reply_time = GetCurrentTimestamp();
-
-	/* Load the libpq-specific functions */
-	load_file("libpqwalreceiver", false);
+	/* Load the libpq-specific functions */
+	load_file("libpqwalreceiver", false);
 
 	/* Run as replica session replication role. */
 	SetConfigOption("session_replication_role", "replica",
@@ -2775,3 +2266,580 @@ IsLogicalWorker(void)
 {
 	return MyLogicalRepWorker != NULL;
 }
+
+/*
+ * Apply Background Worker main loop.
+ */
+void
+LogicalApplyBgwMain(Datum main_arg)
+{
+	volatile ParallelState *pst;
+
+	dsm_segment			*seg;
+	shm_toc				*toc;
+	PGPROC				*registrant;
+	shm_mq				*mq;
+	shm_mq_handle		*mqh;
+	shm_mq_result		 shmq_res;
+	// ConditionVariable	 cv;
+	LogicalRepWorker	 lrw;
+	MemoryContext		 oldcontext;
+
+	MemoryContextSwitchTo(TopMemoryContext);
+
+	/* Load the subscription into persistent memory context. */
+	ApplyContext = AllocSetContextCreate(TopMemoryContext,
+										 "ApplyContext",
+										 ALLOCSET_DEFAULT_SIZES);
+
+	oldcontext = MemoryContextSwitchTo(ApplyContext);
+
+	/*
+	 * Init the ApplyMessageContext which we clean up after each replication
+	 * protocol message.
+	 */
+	ApplyMessageContext = AllocSetContextCreate(ApplyContext,
+												"ApplyMessageContext",
+												ALLOCSET_DEFAULT_SIZES);
+
+	isLogicalApplyWorker = true;
+
+	/*
+	 * Establish signal handlers.
+	 *
+	 * We want CHECK_FOR_INTERRUPTS() to kill off this worker process just as
+	 * it would a normal user backend.  To make that happen, we establish a
+	 * signal handler that is a stripped-down version of die().
+	 */
+	pqsignal(SIGTERM, handle_sigterm);
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Connect to the dynamic shared memory segment.
+	 *
+	 * The backend that registered this worker passed us the ID of a shared
+	 * memory segment to which we must attach for further instructions.  In
+	 * order to attach to dynamic shared memory, we need a resource owner.
+	 * Once we've mapped the segment in our address space, attach to the table
+	 * of contents so we can locate the various data structures we'll need to
+	 * find within the segment.
+	 */
+	CurrentResourceOwner = ResourceOwnerCreate(NULL, "Logical apply worker");
+	seg = dsm_attach(DatumGetInt32(main_arg));
+	if (seg == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("unable to map dynamic shared memory segment")));
+	toc = shm_toc_attach(PG_LOGICAL_APPLY_SHM_MAGIC, dsm_segment_address(seg));
+	if (toc == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("bad magic number in dynamic shared memory segment")));
+
+	/*
+	 * Acquire a worker number.
+	 *
+	 * By convention, the process registering this background worker should
+	 * have stored the control structure at key 0.  We look up that key to
+	 * find it.  Our worker number gives our identity: there may be just one
+	 * worker involved in this parallel operation, or there may be many.
+	 */
+	pst = shm_toc_lookup(toc, 0, false);
+	MyParallelState = pst;
+
+	SpinLockAcquire(&pst->mutex);
+	pst->attached = true;
+	SpinLockRelease(&pst->mutex);
+
+	/*
+	 * Attach to the message queue.
+	 */
+	mq = shm_toc_lookup(toc, 1, false);
+	shm_mq_set_receiver(mq, MyProc);
+	mqh = shm_mq_attach(mq, seg, NULL);
+
+	/* Restore database connection. */
+	BackgroundWorkerInitializeConnectionByOid(pst->database_id,
+											  pst->authenticated_user_id, 0);
+
+	/*
+	 * Set the client encoding to the database encoding, since that is what
+	 * the leader will expect.
+	 */
+	SetClientEncoding(GetDatabaseEncoding());
+
+	lrw.subid = pst->subid;
+	MyLogicalRepWorker = &lrw;
+
+	stream_xid = pst->stream_xid;
+
+	StartTransactionCommand();
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	// PushActiveSnapshot(GetTransactionSnapshot());
+
+	MySubscription = GetSubscription(MyLogicalRepWorker->subid, true);
+
+	/*
+	 * Indicate that we're fully initialized and ready to begin the main part
+	 * of the parallel operation.
+	 *
+	 * Once we signal that we're ready, the user backend is entitled to assume
+	 * that our on_dsm_detach callbacks will fire before we disconnect from
+	 * the shared memory segment and exit.  Generally, that means we must have
+	 * attached to all relevant dynamic shared memory data structures by now.
+	 */
+	SpinLockAcquire(&pst->mutex);
+	pst->ready = true;
+	// cv = pst->cv;
+	// if (pst->workers_ready == pst->workers_total)
+	// {
+	//	 registrant = BackendPidGetProc(MyBgworkerEntry->bgw_notify_pid);
+	//	 if (registrant == NULL)
+	//	 {
+	//		 elog(DEBUG1, "registrant backend has exited prematurely");
+	//		 proc_exit(1);
+	//	 }
+	//	 SetLatch(&registrant->procLatch);
+	// }
+	SpinLockRelease(&pst->mutex);
+	elog(LOG, "[Apply BGW #%u] started", pst->n);
+
+	registrant = BackendPidGetProc(MyBgworkerEntry->bgw_notify_pid);
+	SetLatch(&registrant->procLatch);
+
+	for (;;)
+	{
+		void *data;
+		Size  len;
+		StringInfoData s;
+		MemoryContext	oldctx;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Ensure we are reading the data into our memory context. */
+		oldctx = MemoryContextSwitchTo(ApplyMessageContext);
+
+		shmq_res = shm_mq_receive(mqh, &len, &data, false);
+
+		if (shmq_res != SHM_MQ_SUCCESS)
+			break;
+
+		if (len == 0)
+		{
+			elog(LOG, "[Apply BGW #%u] got zero-length message, stopping", pst->n);
+			break;
+		}
+		else
+		{
+			s.cursor = 0;
+			s.maxlen = -1;
+			s.data = (char *) data;
+			s.len = len;
+
+			/*
+			 * We use first byte of message for additional communication between
+			 * main Logical replication worker and Apply BGWorkers, so if it
+			 * differs from 'w', then process it first.
+			 */
+			switch (pq_getmsgbyte(&s))
+			{
+				/* Stream stop */
+				case 'E':
+				{
+					in_remote_transaction = false;
+
+					SpinLockAcquire(&pst->mutex);
+					pst->ready = true;
+					SpinLockRelease(&pst->mutex);
+					SetLatch(&registrant->procLatch);
+
+					elog(LOG, "[Apply BGW #%u] ended processing streaming chunk, waiting on shm_mq_receive", pst->n);
+
+					continue;
+				}
+				/* Reassign to the new transaction */
+				case 'R':
+				{
+					elog(LOG, "[Apply BGW #%u] switching from processing xid %u to xid %u",
+											pst->n, stream_xid, pst->stream_xid);
+					stream_xid = pst->stream_xid;
+
+					StartTransactionCommand();
+					BeginTransactionBlock();
+					CommitTransactionCommand();
+					StartTransactionCommand();
+
+					MySubscription = GetSubscription(MyLogicalRepWorker->subid, true);
+
+					continue;
+				}
+				/* Finished processing xact */
+				case 'F':
+				{
+					elog(LOG, "[Apply BGW #%u] finished processing xact %u", pst->n, stream_xid);
+
+					MemoryContextSwitchTo(ApplyContext);
+
+					CommitTransactionCommand();
+					EndTransactionBlock(false);
+					CommitTransactionCommand();
+
+					SpinLockAcquire(&pst->mutex);
+					pst->finished = true;
+					SpinLockRelease(&pst->mutex);
+
+					continue;
+				}
+				default:
+					break;
+			}
+
+			pq_getmsgint64(&s); // Read LSN info
+			pq_getmsgint64(&s); // TODO Do we need to process it here again somehow?
+			pq_getmsgint64(&s);
+
+			/*
+			 * Make sure the handle apply_dispatch methods are aware we're in a remote
+			 * transaction.
+			 */
+			in_remote_transaction = true;
+			pgstat_report_activity(STATE_RUNNING, NULL);
+
+			elog(DEBUG5, "[Apply BGW #%u] applying dispatch for action=%s",
+									pst->n, (char *) &s.data[s.cursor]);
+			apply_dispatch(&s);
+		}
+
+		MemoryContextSwitchTo(oldctx);
+		MemoryContextReset(ApplyMessageContext);
+	}
+
+	CommitTransactionCommand();
+	EndTransactionBlock(false);
+	CommitTransactionCommand();
+
+	MemoryContextSwitchTo(oldcontext);
+	MemoryContextReset(ApplyContext);
+
+	SpinLockAcquire(&pst->mutex);
+	pst->finished = true;
+	// if (pst->workers_finished == pst->workers_total)
+	// {
+	//	 registrant = BackendPidGetProc(MyBgworkerEntry->bgw_notify_pid);
+	//	 if (registrant == NULL)
+	//	 {
+	//		 elog(DEBUG1, "registrant backend has exited prematurely");
+	//		 proc_exit(1);
+	//	 }
+	//	 SetLatch(&registrant->procLatch);
+	// }
+	SpinLockRelease(&pst->mutex);
+
+	elog(LOG, "[Apply BGW #%u] exiting", pst->n);
+
+	/* Signal main process that we are done. */
+	// ConditionVariableBroadcast(&cv);
+	SetLatch(&registrant->procLatch);
+
+	/*
+	 * We're done.  Explicitly detach the shared memory segment so that we
+	 * don't get a resource leak warning at commit time.  This will fire any
+	 * on_dsm_detach callbacks we've registered, as well.  Once that's done,
+	 * we can go ahead and exit.
+	 */
+	dsm_detach(seg);
+	proc_exit(0);
+}
+
+/*
+ * When we receive a SIGTERM, we set InterruptPending and ProcDiePending just
+ * like a normal backend.  The next CHECK_FOR_INTERRUPTS() will do the right
+ * thing.
+ */
+static void
+handle_sigterm(SIGNAL_ARGS)
+{
+	int save_errno = errno;
+
+	SetLatch(MyLatch);
+
+	if (!proc_exit_inprogress)
+	{
+		InterruptPending = true;
+		ProcDiePending = true;
+	}
+
+	errno = save_errno;
+}
+
+/*
+ * Set up a dynamic shared memory segment.
+ *
+ * We set up a control region that contains a ParallelState,
+ * plus one region per message queue. There are as many message queues as
+ * the number of workers.
+ */
+static void
+setup_dsm(WorkerState *wstate)
+{
+	shm_toc_estimator	 e;
+	int					 toc_key = 0;
+	Size				 segsize;
+	dsm_segment			*seg;
+	shm_toc				*toc;
+	ParallelState		*pst;
+	shm_mq				*mq;
+	int64				 queue_size = 160000000; /* 16 MB for now */
+
+	/* Ensure a valid queue size. */
+	if (queue_size < 0 || ((uint64) queue_size) < shm_mq_minimum_size)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("queue size must be at least %zu bytes",
+						shm_mq_minimum_size)));
+	if (queue_size != ((Size) queue_size))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("queue size overflows size_t")));
+
+	/*
+	 * Estimate how much shared memory we need.
+	 *
+	 * Because the TOC machinery may choose to insert padding of oddly-sized
+	 * requests, we must estimate each chunk separately.
+	 *
+	 * We need one key to register the location of the header, and we need
+	 * nworkers keys to track the locations of the message queues.
+	 */
+	shm_toc_initialize_estimator(&e);
+	shm_toc_estimate_chunk(&e, sizeof(ParallelState));
+	shm_toc_estimate_chunk(&e, (Size) queue_size);
+
+	shm_toc_estimate_keys(&e, 1 + 1);
+	segsize = shm_toc_estimate(&e);
+
+	/* Create the shared memory segment and establish a table of contents. */
+	seg = dsm_create(shm_toc_estimate(&e), 0);
+	toc = shm_toc_create(PG_LOGICAL_APPLY_SHM_MAGIC, dsm_segment_address(seg),
+						 segsize);
+
+	/* Set up the header region. */
+	pst = shm_toc_allocate(toc, sizeof(ParallelState));
+	SpinLockInit(&pst->mutex);
+	pst->attached = false;
+	pst->ready = false;
+	pst->finished = false;
+	pst->database_id = MyDatabaseId;
+	pst->subid = MyLogicalRepWorker->subid;
+	pst->stream_xid = stream_xid;
+	pst->authenticated_user_id = GetAuthenticatedUserId();
+	pst->n = nworkers + 1;
+	// ConditionVariableInit(&pst->cv);
+
+	shm_toc_insert(toc, toc_key++, pst);
+
+	/* Set up one message queue per worker, plus one. */
+	mq = shm_mq_create(shm_toc_allocate(toc, (Size) queue_size),
+						(Size) queue_size);
+	shm_toc_insert(toc, toc_key++, mq);
+	shm_mq_set_sender(mq, MyProc);
+
+	/* Attach the queues. */
+	wstate->mq_handle = shm_mq_attach(mq, seg, wstate->handle);
+
+	/* Return results to caller. */
+	wstate->dsm_seg = seg;
+	wstate->pstate = pst;
+}
+
+/*
+ * Register background workers.
+ */
+static void
+setup_background_worker(WorkerState *wstate)
+{
+	MemoryContext		oldcontext;
+	BackgroundWorker	worker;
+
+	elog(LOG, "setting up apply worker #%u", nworkers + 1);
+
+	/*
+	 * TOCHECK: We need the worker_state object and the background worker handles to
+	 * which it points to be allocated in TopMemoryContext rather than
+	 * ApplyMessageContext; otherwise, they'll be destroyed before the on_dsm_detach
+	 * hooks run.
+	 */
+	oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+	setup_dsm(wstate);
+
+	/*
+	 * Arrange to kill all the workers if we abort before all workers are
+	 * finished hooking themselves up to the dynamic shared memory segment.
+	 *
+	 * If we die after all the workers have finished hooking themselves up to
+	 * the dynamic shared memory segment, we'll mark the two queues to which
+	 * we're directly connected as detached, and the worker(s) connected to
+	 * those queues will exit, marking any other queues to which they are
+	 * connected as detached.  This will cause any as-yet-unaware workers
+	 * connected to those queues to exit in their turn, and so on, until
+	 * everybody exits.
+	 *
+	 * But suppose the workers which are supposed to connect to the queues to
+	 * which we're directly attached exit due to some error before they
+	 * actually attach the queues.  The remaining workers will have no way of
+	 * knowing this.  From their perspective, they're still waiting for those
+	 * workers to start, when in fact they've already died.
+	 */
+	on_dsm_detach(wstate->dsm_seg, cleanup_background_worker,
+				  PointerGetDatum(wstate));
+
+	/* Configure a worker. */
+	MemSet(&worker, 0, sizeof(BackgroundWorker));
+
+	worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
+		BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_ConsistentState;
+	worker.bgw_restart_time = BGW_NEVER_RESTART;
+	worker.bgw_notify_pid = MyProcPid;
+	sprintf(worker.bgw_library_name, "postgres");
+	sprintf(worker.bgw_function_name, "LogicalApplyBgwMain");
+
+	worker.bgw_main_arg = UInt32GetDatum(dsm_segment_handle(wstate->dsm_seg));
+
+	/* Register the workers. */
+	snprintf(worker.bgw_name, BGW_MAXLEN,
+			"logical replication apply worker #%u for subscription %u",
+										nworkers + 1, MySubscription->oid);
+	if (!RegisterDynamicBackgroundWorker(&worker, &wstate->handle))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+					errmsg("could not register background process"),
+					errhint("You may need to increase max_worker_processes.")));
+
+	/* All done. */
+	MemoryContextSwitchTo(oldcontext);
+
+	/* Wait for worker to become ready. */
+	wait_for_worker(wstate);
+
+	/*
+	 * Once we reach this point, all workers are ready.  We no longer need to
+	 * kill them if we die; they'll die on their own as the message queues
+	 * shut down.
+	 */
+	cancel_on_dsm_detach(wstate->dsm_seg, cleanup_background_worker,
+						 PointerGetDatum(wstate));
+
+	nworkers += 1;
+}
+
+static void
+cleanup_background_worker(dsm_segment *seg, Datum arg)
+{
+	WorkerState *wstate = (WorkerState *) DatumGetPointer(arg);
+
+	TerminateBackgroundWorker(wstate->handle);
+}
+
+static void
+wait_for_worker(WorkerState *wstate)
+{
+	bool result = false;
+
+	for (;;)
+	{
+		// ConditionVariable cv;
+		bool ready;
+
+		/* If the worker is ready, we have succeeded. */
+		SpinLockAcquire(&wstate->pstate->mutex);
+		ready = wstate->pstate->ready;
+		// cv = wstate->pstate->cv;
+		SpinLockRelease(&wstate->pstate->mutex);
+		if (ready)
+		{
+			result = true;
+			break;
+		}
+
+		/* If any workers (or the postmaster) have died, we have failed. */
+		if (!check_worker_status(wstate))
+		{
+			result = false;
+			break;
+		}
+
+		/* Wait for the workers to wake us up. */
+		// ConditionVariableSleep(&cv, WAIT_EVENT_LOGICAL_APPLY_WORKER_READY);
+
+		/* Wait to be signalled. */
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
+							WAIT_EVENT_LOGICAL_APPLY_WORKER_READY);
+
+		/* Reset the latch so we don't spin. */
+		ResetLatch(MyLatch);
+
+		/* An interrupt may have occurred while we were waiting. */
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	// ConditionVariableCancelSleep();
+
+	if (!result)
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+				 errmsg("one or more background workers failed to start")));
+}
+
+static bool
+check_worker_status(WorkerState *wstate)
+{
+	BgwHandleStatus status;
+	pid_t			pid;
+
+	status = GetBackgroundWorkerPid(wstate->handle, &pid);
+	if (status == BGWH_STOPPED || status == BGWH_POSTMASTER_DIED)
+		return false;
+
+	/* Otherwise, things still look OK. */
+	return true;
+}
+
+static void
+wait_for_worker_to_finish(WorkerState *wstate)
+{
+	elog(LOG, "waiting for apply worker #%u to finish processing xid %u",
+										wstate->pstate->n, wstate->xid);
+
+	for (;;)
+	{
+		// ConditionVariable cv;
+		bool finished;
+
+		/* If the worker is finished, we have succeeded. */
+		SpinLockAcquire(&wstate->pstate->mutex);
+		finished = wstate->pstate->finished;
+		// cv = wstate->pstate->cv;
+		SpinLockRelease(&wstate->pstate->mutex);
+		if (finished)
+		{
+			break;
+		}
+
+		/* Wait for the workers to wake us up. */
+		// ConditionVariableSleep(&cv, WAIT_EVENT_LOGICAL_APPLY_WORKER_READY);
+
+		/* Wait to be signalled. */
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
+							WAIT_EVENT_LOGICAL_APPLY_WORKER_READY);
+
+		/* Reset the latch so we don't spin. */
+		ResetLatch(MyLatch);
+
+		/* An interrupt may have occurred while we were waiting. */
+		CHECK_FOR_INTERRUPTS();
+	}
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index bc45194..342e9fd 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -839,6 +839,7 @@ typedef enum
 	WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
 	WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
 	WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
+	WAIT_EVENT_LOGICAL_APPLY_WORKER_READY,
 	WAIT_EVENT_LOGICAL_SYNC_DATA,
 	WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
 	WAIT_EVENT_MQ_INTERNAL,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index bf02cbc..479c2e2 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -122,12 +122,10 @@ extern TransactionId logicalrep_read_stream_stop(StringInfo in);
 
 extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 										   XLogRecPtr commit_lsn);
-extern TransactionId logicalrep_read_stream_commit(StringInfo out,
-												   LogicalRepCommitData *commit_data);
+extern void logicalrep_read_stream_commit(StringInfo out,
+										  LogicalRepCommitData *commit_data);
 
 extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 										  TransactionId subxid);
-extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
-										 TransactionId *subxid);
 
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/logicalworker.h b/src/include/replication/logicalworker.h
index e9524ae..30ad402 100644
--- a/src/include/replication/logicalworker.h
+++ b/src/include/replication/logicalworker.h
@@ -13,6 +13,7 @@
 #define LOGICALWORKER_H
 
 extern void ApplyWorkerMain(Datum main_arg);
+extern void LogicalApplyBgwMain(Datum main_arg);
 
 extern bool IsLogicalWorker(void);
 
-- 
1.8.3.1

#95Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#93)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:

On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com

wrote:

On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:

On further testing, I found that the patch seems to have problems with
toast. Consider below scenario:
Session-1
Create table large_text(t1 text);
INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);

Session-2
SELECT * FROM pg_create_logical_replication_slot('regression_slot',
'test_decoding');
SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL,

NULL);

*--kaboom*

The second statement in Session-2 leads to a crash.

OK, thanks for the report - will investigate.

It was an assertion failure in ReorderBufferCleanupTXN at below line:
+ /* Check we're not mixing changes from different transactions. */
+ Assert(change->txn == txn);

Can you still reproduce this issue with the patch I sent on 28/9? I have
been unable to trigger the failure, and it seems pretty similar to the
failure you reported (and I fixed) on 28/9.

Yes, it seems we need a similar change in ReorderBufferAddNewTupleCids. I
think in session-2 you need to create replication slot before creating
table in session-1 to see this problem.

--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2196,6 +2196,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb,
TransactionId xid,
        change->data.tuplecid.cmax = cmax;
        change->data.tuplecid.combocid = combocid;
        change->lsn = lsn;
+       change->txn = txn;
        change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
        dlist_push_tail(&txn->tuplecids, &change->node);
Few more comments:
-----------------------------------
1.
+static bool
+check_logical_decoding_work_mem(int *newval, void **extra, GucSource
source)
+{
+ /*
+ * -1 indicates fallback.
+ *
+ * If we haven't yet changed the boot_val default of -1, just let it be.
+ * logical decoding will look to maintenance_work_mem instead.
+ */
+ if (*newval == -1)
+ return true;
+
+ /*
+ * We clamp manually-set values to at least 64kB. The maintenance_work_mem
+ * uses a higher minimum value (1MB), so this is OK.
+ */
+ if (*newval < 64)
+ *newval = 64;

I think this needs to be changed as now we don't rely on
maintenance_work_mem. Another thing related to this is that I think the
default value for logical_decoding_work_mem still seems to be -1. We need
to make it to 64MB. I have seen this while debugging memory accounting
changes. I think this is the reason why I was not seeing toast related
changes being serialized because, in that test, I haven't changed the
default value of logical_decoding_work_mem.

2.
+ /*
+ * We're going modify the size of the change, so to make sure the
+ * accounting is correct we'll make it look like we're removing the
+ * change now (with the old size), and then re-add it at the end.
+ */

/going modify/going to modify/

3.
+ *
+ * While updating the existing change with detoasted tuple data, we need to
+ * update the memory accounting info, because the change size will differ.
+ * Otherwise the accounting may get out of sync, triggering serialization
+ * at unexpected times.
+ *
+ * We simply subtract size of the change before rejiggering the tuple, and
+ * then adding the new size. This makes it look like the change was removed
+ * and then added back, except it only tweaks the accounting info.
+ *
+ * In particular it can't trigger serialization, which would be pointless
+ * anyway as it happens during commit processing right before handing
+ * the change to the output plugin.
  */
 static void
 ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
@@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb,
ReorderBufferTXN *txn,
  if (txn->toast_hash == NULL)
  return;
+ /*
+ * We're going modify the size of the change, so to make sure the
+ * accounting is correct we'll make it look like we're removing the
+ * change now (with the old size), and then re-add it at the end.
+ */
+ ReorderBufferChangeMemoryUpdate(rb, change, false);

It is not very clear why this change is required. Basically, this is done
at commit time after which actually we shouldn't attempt to spill these
changes. This is mentioned in comments as well, but it is not clear if
that is the case, then how and when accounting can create a problem. If
possible, can you explain it with an example?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#96Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#94)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have attempted to test the performance of (Stream + Spill) vs
(Stream + BGW pool) and I can see the similar gain what Alexey had
shown[1].

In addition to this, I have rebased the latest patchset [2] without
the two-phase logical decoding patch set.

Test results:
I have repeated the same test as Alexy[1] for 1kk and 1kk data and
here is my result
Stream + Spill
N time on master(sec) Total xact time (sec)
1kk 6 21
3kk 18 55

Stream + BGW pool
N time on master(sec) Total xact time (sec)
1kk 6 13
3kk 19 35

Patch details:
All the patches are the same as posted on [2] except
1. 0006-Gracefully-handle-concurrent-aborts-of-uncommitted -> I have
removed the handling of error which is specific for 2PC

Here[1]/messages/by-id/CAFiTN-vHoksqvV4BZ0479NhugGe4QHq_ezngNdDd-YRQ_2cwug@mail.gmail.com, I mentioned that I have removed the 2PC changes from
this[0006] patch but mistakenly I attached the original patch itself
instead of the modified version. So attaching the modified version of
only this patch other patches are the same.

2. 0007-Implement-streaming-mode-in-ReorderBuffer -> Rebased without 2PC
3. 0009-Extend-the-concurrent-abort-handling-for-in-progress -> New
patch to handle concurrent abort error for the in-progress transaction
and also add handling for the sub transaction's abort.
4. v3-0014-BGWorkers-pool-for-streamed-transactions-apply -> Rebased
Alexey's patch

[1]: /messages/by-id/CAFiTN-vHoksqvV4BZ0479NhugGe4QHq_ezngNdDd-YRQ_2cwug@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0006-Gracefully-handle-concurrent-aborts-of-uncommitted-t.patchapplication/octet-stream; name=0006-Gracefully-handle-concurrent-aborts-of-uncommitted-t.patchDownload
From 9739d48a979868b912f6e1f90d4702808e061a35 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 3 Oct 2019 09:00:49 +0530
Subject: [PATCH 06/14] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml               |  5 ++-
 src/backend/access/heap/heapam.c                | 51 +++++++++++++++++++++++++
 src/backend/access/index/genam.c                | 34 +++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c |  8 ++--
 src/backend/utils/time/snapmgr.c                | 25 +++++++++++-
 src/include/utils/snapmgr.h                     |  4 +-
 6 files changed, 119 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index fc4ad65..da6a6f3 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,7 +432,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
      includes writing to tables, performing DDL changes, and
      calling <literal>txid_current()</literal>.
     </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e954482..6ce7878 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1303,6 +1303,17 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(scan->rs_base.rs_rd) ||
+		  RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_getnext call")));
+
 	/* Note: no locking manipulations needed */
 
 	HEAPDEBUG_1;				/* heap_getnext( info ) */
@@ -1422,6 +1433,16 @@ heap_fetch(Relation relation,
 	bool		valid;
 
 	/*
+	 * We don't expect direct calls to heap_fetch with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_fetch call")));
+
+	/*
 	 * Fetch and pin the appropriate page of the relation.
 	 */
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
@@ -1535,6 +1556,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		valid;
 	bool		skip;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search_buffer with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_hot_search_buffer call")));
+
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
 		*all_dead = first_call;
@@ -1683,6 +1714,16 @@ heap_get_latest_tid(TableScanDesc sscan,
 	Assert(ItemPointerIsValid(tid));
 
 	/*
+	 * We don't expect direct calls to heap_get_latest_tid with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_get_latest_tid call")));
+
+	/*
 	 * Loop to chase down t_ctid links.  At top of loop, ctid is the tuple we
 	 * need to examine, and *tid is the TID we will return if ctid turns out
 	 * to be bogus.
@@ -5481,6 +5522,16 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	ItemId		lp = NULL;
 	HeapTupleHeader htup;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_hot_search call")));
+
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d..201acfb 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -477,6 +478,17 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
@@ -513,6 +525,17 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return result;
 }
 
@@ -639,6 +662,17 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index ca4b904..3143479 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -679,7 +679,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 			txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 		/* setup snapshot to allow catalog access */
-		SetupHistoricSnapshot(snapshot_now, NULL);
+		SetupHistoricSnapshot(snapshot_now, NULL, xid);
 		PG_TRY();
 		{
 			rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1529,7 +1529,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1780,7 +1780,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1800,7 +1800,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 						/*
 						 * Every time the CommandId is incremented, we could
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 47b0517..9fa1e43 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -154,6 +154,13 @@ static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
 /*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
+/*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2029,10 +2036,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
  *
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive.  This is to re-check XID status while accessing catalog.
+ *
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+					  TransactionId snapshot_xid)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -2041,8 +2052,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 
 	/* setup (cmin, cmax) lookup hash */
 	tuplecid_data = tuplecids;
-}
 
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
+	 */
+	if (TransactionIdIsValid(snapshot_xid) &&
+		!TransactionIdDidCommit(snapshot_xid))
+		CheckXidAlive = snapshot_xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
 /*
  * Make catalog snapshots behave normally again.
@@ -2052,6 +2072,7 @@ TeardownHistoricSnapshot(bool is_error)
 {
 	HistoricSnapshot = NULL;
 	tuplecid_data = NULL;
+	CheckXidAlive = InvalidTransactionId;
 }
 
 bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 67b07df..9a8f9ce 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,8 +145,10 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+								  TransactionId snapshot_xid);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-- 
1.8.3.1

#97Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#95)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Oct 3, 2019 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:

On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:

On further testing, I found that the patch seems to have problems with
toast. Consider below scenario:
Session-1
Create table large_text(t1 text);
INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);

Session-2
SELECT * FROM pg_create_logical_replication_slot('regression_slot',
'test_decoding');
SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
*--kaboom*

The second statement in Session-2 leads to a crash.

OK, thanks for the report - will investigate.

It was an assertion failure in ReorderBufferCleanupTXN at below line:
+ /* Check we're not mixing changes from different transactions. */
+ Assert(change->txn == txn);

Can you still reproduce this issue with the patch I sent on 28/9? I have
been unable to trigger the failure, and it seems pretty similar to the
failure you reported (and I fixed) on 28/9.

Yes, it seems we need a similar change in ReorderBufferAddNewTupleCids. I think in session-2 you need to create replication slot before creating table in session-1 to see this problem.

--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2196,6 +2196,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
change->data.tuplecid.cmax = cmax;
change->data.tuplecid.combocid = combocid;
change->lsn = lsn;
+       change->txn = txn;
change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
dlist_push_tail(&txn->tuplecids, &change->node);
Few more comments:
-----------------------------------
1.
+static bool
+check_logical_decoding_work_mem(int *newval, void **extra, GucSource source)
+{
+ /*
+ * -1 indicates fallback.
+ *
+ * If we haven't yet changed the boot_val default of -1, just let it be.
+ * logical decoding will look to maintenance_work_mem instead.
+ */
+ if (*newval == -1)
+ return true;
+
+ /*
+ * We clamp manually-set values to at least 64kB. The maintenance_work_mem
+ * uses a higher minimum value (1MB), so this is OK.
+ */
+ if (*newval < 64)
+ *newval = 64;

I think this needs to be changed as now we don't rely on maintenance_work_mem. Another thing related to this is that I think the default value for logical_decoding_work_mem still seems to be -1. We need to make it to 64MB. I have seen this while debugging memory accounting changes. I think this is the reason why I was not seeing toast related changes being serialized because, in that test, I haven't changed the default value of logical_decoding_work_mem.

2.
+ /*
+ * We're going modify the size of the change, so to make sure the
+ * accounting is correct we'll make it look like we're removing the
+ * change now (with the old size), and then re-add it at the end.
+ */

/going modify/going to modify/

3.
+ *
+ * While updating the existing change with detoasted tuple data, we need to
+ * update the memory accounting info, because the change size will differ.
+ * Otherwise the accounting may get out of sync, triggering serialization
+ * at unexpected times.
+ *
+ * We simply subtract size of the change before rejiggering the tuple, and
+ * then adding the new size. This makes it look like the change was removed
+ * and then added back, except it only tweaks the accounting info.
+ *
+ * In particular it can't trigger serialization, which would be pointless
+ * anyway as it happens during commit processing right before handing
+ * the change to the output plugin.
*/
static void
ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
@@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
if (txn->toast_hash == NULL)
return;
+ /*
+ * We're going modify the size of the change, so to make sure the
+ * accounting is correct we'll make it look like we're removing the
+ * change now (with the old size), and then re-add it at the end.
+ */
+ ReorderBufferChangeMemoryUpdate(rb, change, false);

It is not very clear why this change is required. Basically, this is done at commit time after which actually we shouldn't attempt to spill these changes. This is mentioned in comments as well, but it is not clear if that is the case, then how and when accounting can create a problem. If possible, can you explain it with an example?

IIUC, we are keeping the track of the memory in ReorderBuffer which is
common across the transactions. So even if this transaction is
committing and will not spill to dis but we need to keep the memory
accounting correct for the future changes in other transactions.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#98Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#97)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sun, Oct 13, 2019 at 12:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Oct 3, 2019 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

3.
+ *
+ * While updating the existing change with detoasted tuple data, we need to
+ * update the memory accounting info, because the change size will differ.
+ * Otherwise the accounting may get out of sync, triggering serialization
+ * at unexpected times.
+ *
+ * We simply subtract size of the change before rejiggering the tuple, and
+ * then adding the new size. This makes it look like the change was removed
+ * and then added back, except it only tweaks the accounting info.
+ *
+ * In particular it can't trigger serialization, which would be pointless
+ * anyway as it happens during commit processing right before handing
+ * the change to the output plugin.
*/
static void
ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
@@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
if (txn->toast_hash == NULL)
return;
+ /*
+ * We're going modify the size of the change, so to make sure the
+ * accounting is correct we'll make it look like we're removing the
+ * change now (with the old size), and then re-add it at the end.
+ */
+ ReorderBufferChangeMemoryUpdate(rb, change, false);

It is not very clear why this change is required. Basically, this is done at commit time after which actually we shouldn't attempt to spill these changes. This is mentioned in comments as well, but it is not clear if that is the case, then how and when accounting can create a problem. If possible, can you explain it with an example?

IIUC, we are keeping the track of the memory in ReorderBuffer which is
common across the transactions. So even if this transaction is
committing and will not spill to dis but we need to keep the memory
accounting correct for the future changes in other transactions.

You are right. I somehow missed that we need to keep the size
computation in sync even during commit for other in-progress
transactions in the ReorderBuffer. You can ignore this point or maybe
slightly adjust the comment to make it explicit.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#99Craig Ringer
craig@2ndquadrant.com
In reply to: Amit Kapila (#98)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sun, 13 Oct 2019 at 19:50, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Oct 13, 2019 at 12:25 PM Dilip Kumar <dilipbalaut@gmail.com>
wrote:

On Thu, Oct 3, 2019 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com>

wrote:

3.
+ *
+ * While updating the existing change with detoasted tuple data, we

need to

+ * update the memory accounting info, because the change size will

differ.

+ * Otherwise the accounting may get out of sync, triggering

serialization

+ * at unexpected times.
+ *
+ * We simply subtract size of the change before rejiggering the

tuple, and

+ * then adding the new size. This makes it look like the change was

removed

+ * and then added back, except it only tweaks the accounting info.
+ *
+ * In particular it can't trigger serialization, which would be

pointless

+ * anyway as it happens during commit processing right before handing
+ * the change to the output plugin.
*/
static void
ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
@@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb,

ReorderBufferTXN *txn,

if (txn->toast_hash == NULL)
return;

+ /*
+ * We're going modify the size of the change, so to make sure the
+ * accounting is correct we'll make it look like we're removing the
+ * change now (with the old size), and then re-add it at the end.
+ */
+ ReorderBufferChangeMemoryUpdate(rb, change, false);

It is not very clear why this change is required. Basically, this is

done at commit time after which actually we shouldn't attempt to spill
these changes. This is mentioned in comments as well, but it is not clear
if that is the case, then how and when accounting can create a problem. If
possible, can you explain it with an example?

IIUC, we are keeping the track of the memory in ReorderBuffer which is
common across the transactions. So even if this transaction is
committing and will not spill to dis but we need to keep the memory
accounting correct for the future changes in other transactions.

You are right. I somehow missed that we need to keep the size
computation in sync even during commit for other in-progress
transactions in the ReorderBuffer. You can ignore this point or maybe
slightly adjust the comment to make it explicit.

Does anyone object if we add the reorder buffer total size & in-memory size
to struct WalSnd too, so we can report it in pg_stat_replication?

I can follow up with a patch to add on top of this one if you think it's
reasonable. I'll also take the opportunity to add a number of tracepoints
across the walsender and logical decoding, since right now it's very opaque
in production systems ... and everyone just LOVES hunting down debug syms
and attaching gdb to production DBs.

--
Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise

#100Amit Kapila
amit.kapila16@gmail.com
In reply to: Craig Ringer (#99)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Oct 14, 2019 at 6:51 AM Craig Ringer <craig@2ndquadrant.com> wrote:

On Sun, 13 Oct 2019 at 19:50, Amit Kapila <amit.kapila16@gmail.com> wrote:

Does anyone object if we add the reorder buffer total size & in-memory size to struct WalSnd too, so we can report it in pg_stat_replication?

There is already a patch
(0011-Track-statistics-for-streaming-spilling) in this series posted
by Tomas[1]/messages/by-id/20190928190917.hrpknmq76v3ts3lj@development which tracks important statistics in WalSnd which I think
are good enough. Have you checked that? I am not sure if adding
additional size will help, but I might be missing something.

I can follow up with a patch to add on top of this one if you think it's reasonable. I'll also take the opportunity to add a number of tracepoints across the walsender and logical decoding, since right now it's very opaque in production systems ... and everyone just LOVES hunting down debug syms and attaching gdb to production DBs.

Sure, adding tracepoints can be helpful, but isn't it better to start
that as a separate thread?

[1]: /messages/by-id/20190928190917.hrpknmq76v3ts3lj@development

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#101Dilip Kumar
dilipbalaut@gmail.com
In reply to: Tomas Vondra (#93)
3 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:

On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:

On further testing, I found that the patch seems to have problems with
toast. Consider below scenario:
Session-1
Create table large_text(t1 text);
INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);

Session-2
SELECT * FROM pg_create_logical_replication_slot('regression_slot',
'test_decoding');
SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
*--kaboom*

The second statement in Session-2 leads to a crash.

OK, thanks for the report - will investigate.

It was an assertion failure in ReorderBufferCleanupTXN at below line:
+ /* Check we're not mixing changes from different transactions. */
+ Assert(change->txn == txn);

Can you still reproduce this issue with the patch I sent on 28/9? I have
been unable to trigger the failure, and it seems pretty similar to the
failure you reported (and I fixed) on 28/9.

Other than that, I am not sure if the changes related to spill to disk
after logical_decoding_work_mem works for toast table as I couldn't hit
that code for toast table case, but I might be missing something. As
mentioned previously, I feel there should be some way to test whether this
patch works for the cases it claims to work. As of now, I have to check
via debugging. Let me know if there is any way, I can test this.

That's one of the reasons why I proposed to move the statistics (which
say how many transactions / bytes were spilled to disk) from a later
patch in the series. I don't think there's a better way.

I like that idea, but I think you need to split that patch to only get the
stats related to the spill. It would be easier to review if you can
prepare that atop of
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.

Sure, I wasn't really proposing to adding all stats from that patch,
including those related to streaming. We need to extract just those
related to spilling. And yes, it needs to be moved right after 0001.

I have extracted the spilling related code to a separate patch on top
of 0001. I have also fixed some bugs and review comments and attached
as a separate patch. Later I can merge it to the main patch if you
agree with the changes.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0002-Track-statistics-for-spilling.patchapplication/octet-stream; name=0002-Track-statistics-for-spilling.patchDownload
From cc021dc5dba6bf0059595bc29388bb20ce49c405 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 11 Oct 2019 09:07:41 +0530
Subject: [PATCH 2/2] Track statistics for spilling

---
 doc/src/sgml/monitoring.sgml                    | 23 ++++++++++++++
 src/backend/catalog/system_views.sql            |  5 ++-
 src/backend/replication/logical/reorderbuffer.c | 10 ++++++
 src/backend/replication/walsender.c             | 42 +++++++++++++++++++++++--
 src/include/catalog/pg_proc.dat                 |  6 ++--
 src/include/replication/reorderbuffer.h         | 11 +++++++
 src/include/replication/walsender_private.h     |  5 +++
 src/test/regress/expected/rules.out             |  7 +++--
 8 files changed, 101 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 828e908..1965a8d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2121,6 +2121,29 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       with security-sensitive fields obfuscated.
      </entry>
     </row>
+    <row>
+     <entry><structfield>spill_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of transactions spilled to disk after the memory used by
+      logical decoding exceeds <literal>logical_work_mem</literal>. The
+      counter gets incremented both for toplevel transactions and
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>spill_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times transactions were spilled to disk. Transactions
+      may get spilled repeatedly, and this counter gets incremented on every
+      such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>spill_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded transaction data spilled to disk.
+      </entry>
+    </row>
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 9fe4a47..2ee2a06 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -776,7 +776,10 @@ CREATE VIEW pg_stat_replication AS
             W.replay_lag,
             W.sync_priority,
             W.sync_state,
-            W.reply_time
+            W.reply_time,
+            W.spill_txns,
+            W.spill_count,
+            W.spill_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 6228140..e9d57b4 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -308,6 +308,10 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	buffer->spillCount = 0;
+	buffer->spillTxns = 0;
+	buffer->spillBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -2414,6 +2418,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	int			fd = -1;
 	XLogSegNo	curOpenSegNo = 0;
 	Size		spilled = 0;
+	Size		size = txn->size;
 
 	elog(DEBUG2, "spill %u changes in XID %u to disk",
 		 (uint32) txn->nentries_mem, txn->xid);
@@ -2472,6 +2477,11 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		spilled++;
 	}
 
+	/* update the statistics */
+	rb->spillCount += 1;
+	rb->spillTxns += txn->serialized ? 1 : 0;
+	rb->spillBytes += size;
+
 	Assert(spilled == txn->nentries_mem);
 	Assert(dlist_is_empty(&txn->changes));
 	txn->nentries_mem = 0;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index eb4a98c..d74f6f8 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -248,6 +248,7 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
+static void UpdateSpillStats(LogicalDecodingContext *ctx);
 static void XLogRead(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count);
 
 
@@ -1261,7 +1262,8 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
 /*
  * LogicalDecodingContext 'update_progress' callback.
  *
- * Write the current position to the lag tracker (see XLogSendPhysical).
+ * Write the current position to the lag tracker (see XLogSendPhysical),
+ * and update the spill statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1280,6 +1282,11 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 
 	LagTrackerWrite(lsn, now);
 	sendTime = now;
+
+	/*
+	 * Update statistics about transactions that spilled to disk.
+	 */
+	UpdateSpillStats(ctx);
 }
 
 /*
@@ -2319,6 +2326,9 @@ InitWalSenderSlot(void)
 			walsnd->state = WALSNDSTATE_STARTUP;
 			walsnd->latch = &MyProc->procLatch;
 			walsnd->replyTime = 0;
+			walsnd->spillTxns = 0;
+			walsnd->spillCount = 0;
+			walsnd->spillBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3230,7 +3240,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	12
+#define PG_STAT_GET_WAL_SENDERS_COLS	15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3285,6 +3295,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int			pid;
 		WalSndState state;
 		TimestampTz replyTime;
+		int64		spillTxns;
+		int64		spillCount;
+		int64		spillBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3305,6 +3318,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		applyLag = walsnd->applyLag;
 		priority = walsnd->sync_standby_priority;
 		replyTime = walsnd->replyTime;
+		spillTxns = walsnd->spillTxns;
+		spillCount = walsnd->spillCount;
+		spillBytes = walsnd->spillBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		memset(nulls, 0, sizeof(nulls));
@@ -3386,6 +3402,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 				nulls[11] = true;
 			else
 				values[11] = TimestampTzGetDatum(replyTime);
+
+			/* spill to disk */
+			values[12] = Int64GetDatum(spillTxns);
+			values[13] = Int64GetDatum(spillCount);
+			values[14] = Int64GetDatum(spillBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3622,3 +3643,20 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
 	Assert(time != 0);
 	return now - time;
 }
+
+static void
+UpdateSpillStats(LogicalDecodingContext *ctx)
+{
+	ReorderBuffer *rb = ctx->reorder;
+
+	SpinLockAcquire(&MyWalSnd->mutex);
+
+	MyWalSnd->spillTxns = rb->spillTxns;
+	MyWalSnd->spillCount = rb->spillCount;
+	MyWalSnd->spillBytes = rb->spillBytes;
+
+	elog(WARNING, "UpdateSpillStats: updating stats %p %ld %ld %ld",
+		 rb, rb->spillTxns, rb->spillCount, rb->spillBytes);
+
+	SpinLockRelease(&MyWalSnd->mutex);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 58ea5b9..fa0a2a1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5166,9 +5166,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 4dcef80..ba7f9f0 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -402,6 +402,17 @@ struct ReorderBuffer
 
 	/* memory accounting */
 	Size		size;
+
+	/*
+	 * Statistics about transactions spilled to disk.
+	 *
+	 * A single transaction may be spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 */
+	int64	spillCount;		/* spill-to-disk invocation counter */
+	int64	spillTxns;		/* number of transactions spilled to disk  */
+	int64	spillBytes;		/* amount of data spilled to disk */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 0dd6d1c..a6b3205 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -80,6 +80,11 @@ typedef struct WalSnd
 	 * Timestamp of the last message received from standby.
 	 */
 	TimestampTz replyTime;
+
+	/* Statistics for transactions spilled to disk. */
+	int64		spillTxns;
+	int64		spillCount;
+	int64		spillBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 210e9cd..750bdc4 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1951,9 +1951,12 @@ pg_stat_replication| SELECT s.pid,
     w.replay_lag,
     w.sync_priority,
     w.sync_state,
-    w.reply_time
+    w.reply_time,
+    w.spill_txns,
+    w.spill_count,
+    w.spill_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON   	((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
1.8.3.1

0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.patchapplication/octet-stream; name=0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.patchDownload
From 4bfca35b149a303779bee49d96e3be25b914478f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:04:54 +0200
Subject: [PATCH 1/2] Add logical_decoding_work_mem to limit ReorderBuffer
 memory usage

Instead of deciding to serialize a transaction merely based on the
number of changes in that xact (toplevel or subxact), this makes
the decisions based on amount of memory consumed by the changes.

The memory limit is defined by a new logical_decoding_work_mem GUC,
so for example we can do this

    SET logical_decoding_work_mem = '128kB'

to trigger very aggressive streaming. The minimum value is 64kB.

When adding a change to a transaction, we account for the size in
two places. Firstly, in the ReorderBuffer, which is then used to
decide if we reached the total memory limit. And secondly in the
transaction the change belongs to, so that we can pick the largest
transaction to evict (and serialize to disk).

We still use max_changes_in_memory when loading changes serialized
to disk. The trouble is we can't use the memory limit directly as
there might be multiple subxact serialized, we need to read all of
them but we don't know how many are there (and which subxact to
read first).

We do not serialize the ReorderBufferTXN entries, so if there is a
transaction with many subxacts, most memory may be in this type of
objects. Those records are not included in the memory accounting.

We also do not account for INTERNAL_TUPLECID changes, which are
kept in a separate list and not evicted from memory. Transactions
with many CTID changes may consume significant amounts of memory,
but we can't really do much about that.

The current eviction algorithm is very simple - the transaction is
picked merely by size, while it might be useful to also consider age
(LSN) of the changes for example. With the new Generational memory
allocator, evicting the oldest changes would make it more likely
the memory gets actually pfreed.

The logical_decoding_work_mem may be set either in postgresql.conf,
in which case it serves as the default for all publishers on that
instance, or when creating the subscription, using a work_mem
paramemter in the WITH clause (specifies number of kilobytes).
---
 doc/src/sgml/config.sgml                           |  21 ++
 doc/src/sgml/ref/create_subscription.sgml          |  12 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  44 +++-
 .../libpqwalreceiver/libpqwalreceiver.c            |   3 +
 src/backend/replication/logical/reorderbuffer.c    | 292 ++++++++++++++++++++-
 src/backend/replication/logical/worker.c           |   1 +
 src/backend/replication/pgoutput/pgoutput.c        |  30 ++-
 src/backend/utils/misc/guc.c                       |  36 +++
 src/backend/utils/misc/postgresql.conf.sample      |   1 +
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/replication/reorderbuffer.h            |  16 ++
 src/include/replication/walreceiver.h              |   1 +
 13 files changed, 441 insertions(+), 20 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 47b12c6..f1d13a0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1716,6 +1716,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-logical-decoding-work-mem" xreflabel="logical_decoding_work_mem">
+      <term><varname>logical_decoding_work_mem</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>logical_decoding_work_mem</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to be used by logical decoding,
+        before some of the decoded changes are either written to local disk.
+        This limits the amount of memory used by logical streaming replication
+        connections. It defaults to 64 megabytes (<literal>64MB</literal>).
+        Since each replication connection only uses a single buffer of this size,
+        and an installation normally doesn't have many such connections
+        concurrently (as limited by <varname>max_wal_senders</varname>), it's
+        safe to set this value significantly higher than <varname>work_mem</varname>,
+        reducing the amount of decoded changes written to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c24..91790b0 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>work_mem</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          Limits the amount of memory used to decode changes on the
+          publisher.  If not specified, the publisher will use the default
+          specified by <varname>logical_decoding_work_mem</varname>. When
+          needed, additional data are spilled to disk.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index afee283..7e3ba8e 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -71,6 +71,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->workmem = subform->subworkmem;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 2e67a58..d85e831 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -66,7 +66,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, int *logical_wm,
+						   bool *logical_wm_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -97,6 +98,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (logical_wm)
+		*logical_wm_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -182,6 +185,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0 && logical_wm)
+		{
+			if (*logical_wm_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*logical_wm_given = true;
+			*logical_wm = defGetInt32(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -325,6 +338,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	int			logical_wm;
+	bool		logical_wm_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -341,7 +356,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &logical_wm, &logical_wm_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -419,6 +434,12 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (logical_wm_given)
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(logical_wm);
+	else
+		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -682,10 +703,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				int			logical_wm;
+				bool		logical_wm_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &logical_wm, &logical_wm_given);
 
 				if (slotname_given)
 				{
@@ -710,6 +734,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (logical_wm_given)
+				{
+					values[Anum_pg_subscription_subworkmem - 1] =
+						Int32GetDatum(logical_wm);
+					replaces[Anum_pg_subscription_subworkmem - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -721,7 +752,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -759,7 +791,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -796,7 +828,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 6eba08a..65b3266 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -406,6 +406,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		appendStringInfo(&cmd, ", work_mem '%d'",
+						 options->proto.logical.work_mem);
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8ce28ad..6228140 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -49,6 +49,34 @@
  *	  GenerationContext for the variable-length transaction data (allocated
  *	  and freed in groups with similar lifespan).
  *
+ *	  To limit the amount of memory used by decoded changes, we track memory
+ *	  used at the reorder buffer level (i.e. total amount of memory), and for
+ *	  each toplevel transaction. When the total amount of used memory exceeds
+ *	  the limit, the toplevel transaction consuming the most memory is then
+ *	  serialized to disk.
+ *
+ *	  Only decoded changes are evicted from memory (spilled to disk), not the
+ *	  transaction records. The number of toplevel transactions is limited,
+ *	  but a transaction with many subtransactions may still consume significant
+ *	  amounts of memory. The transaction records are fairly small, though, and
+ *	  are not included in the memory limit.
+ *
+ *	  The current eviction algorithm is very simple - the transaction is
+ *	  picked merely by size, while it might be useful to also consider age
+ *	  (LSN) of the changes for example. With the new Generational memory
+ *	  allocator, evicting the oldest changes would make it more likely the
+ *	  memory gets actually freed.
+ *
+ *	  We still rely on max_changes_in_memory when loading serialized changes
+ *	  back into memory. At that point we can't use the memory limit directly
+ *	  as we load the subxacts independently. One option do deal with this
+ *	  would be to count the subxacts, and allow each to allocate 1/N of the
+ *	  memory limit. That however does not seem very appealing, because with
+ *	  many subtransactions it may easily cause trashing (short cycles of
+ *	  deserializing and applying very few changes). We probably should give
+ *	  a bit more memory to the oldest subtransactions, because it's likely
+ *	  the source for the next sequence of changes.
+ *
  * -------------------------------------------------------------------------
  */
 #include "postgres.h"
@@ -154,7 +182,8 @@ typedef struct ReorderBufferDiskChange
  * resource management here, but it's not entirely clear what that would look
  * like.
  */
-static const Size max_changes_in_memory = 4096;
+int			logical_decoding_work_mem;
+static const Size max_changes_in_memory = 4096; /* XXX for restore only */
 
 /* ---------------------------------------
  * primary reorderbuffer support routines
@@ -189,7 +218,7 @@ static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTX
  * Disk serialization support functions
  * ---------------------------------------
  */
-static void ReorderBufferCheckSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferCheckMemoryLimit(ReorderBuffer *rb);
 static void ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 										 int fd, ReorderBufferChange *change);
@@ -217,6 +246,14 @@ static void ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static void ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn,
 										  Relation relation, ReorderBufferChange *change);
 
+/*
+ * ---------------------------------------
+ * memory accounting
+ * ---------------------------------------
+ */
+static Size ReorderBufferChangeSize(ReorderBufferChange *change);
+static void ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
+								ReorderBufferChange *change, bool addition);
 
 /*
  * Allocate a new ReorderBuffer and clean out any old serialized state from
@@ -269,6 +306,7 @@ ReorderBufferAllocate(void)
 
 	buffer->outbuf = NULL;
 	buffer->outbufsize = 0;
+	buffer->size = 0;
 
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
@@ -374,6 +412,9 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 void
 ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 {
+	/* update memory accounting info */
+	ReorderBufferChangeMemoryUpdate(rb, change, false);
+
 	/* free contained data */
 	switch (change->action)
 	{
@@ -585,12 +626,18 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	change->lsn = lsn;
+	change->txn = txn;
+
 	Assert(InvalidXLogRecPtr != lsn);
 	dlist_push_tail(&txn->changes, &change->node);
 	txn->nentries++;
 	txn->nentries_mem++;
 
-	ReorderBufferCheckSerializeTXN(rb, txn);
+	/* update memory accounting information */
+	ReorderBufferChangeMemoryUpdate(rb, change, true);
+
+	/* check the memory limits and evict something if needed */
+	ReorderBufferCheckMemoryLimit(rb);
 }
 
 /*
@@ -1217,6 +1264,9 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		change = dlist_container(ReorderBufferChange, node, iter.cur);
 
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
 		ReorderBufferReturnChange(rb, change);
 	}
 
@@ -1229,7 +1279,11 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		ReorderBufferChange *change;
 
 		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
 		ReorderBufferReturnChange(rb, change);
 	}
 
@@ -2082,9 +2136,48 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferQueueChange(rb, xid, lsn, change);
 }
 
+/*
+ * Update the memory accounting info. We track memory used by the whole
+ * reorder buffer and the transaction containing the change.
+ */
+static void
+ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
+								ReorderBufferChange *change,
+								bool addition)
+{
+	Size		sz;
+
+	Assert(change->txn);
+
+	/*
+	 * Ignore tuple CID changes, because those are not evicted when
+	 * reaching memory limit. So we just don't count them, because it
+	 * might easily trigger a pointless attempt to spill/stream.
+	 */
+	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
+		return;
+
+	sz = ReorderBufferChangeSize(change);
+
+	if (addition)
+	{
+		change->txn->size += sz;
+		rb->size += sz;
+	}
+	else
+	{
+		Assert((rb->size >= sz) && (change->txn->size >= sz));
+		change->txn->size -= sz;
+		rb->size -= sz;
+	}
+}
 
 /*
  * Add new (relfilenode, tid) -> (cmin, cmax) mappings.
+ *
+ * We do not include this change type in memory accounting, because we
+ * keep CIDs in a separate list and do not evict them when reaching
+ * the memory limit.
  */
 void
 ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
@@ -2230,20 +2323,84 @@ ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
 }
 
 /*
- * Check whether the transaction tx should spill its data to disk.
+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
+ *
+ * XXX With many subtransactions this might be quite slow, because we'll have
+ * to walk through all of them. There are some options how we could improve
+ * that: (a) maintain some secondary structure with transactions sorted by
+ * amount of changes, (b) not looking for the entirely largest transaction,
+ * but e.g. for transaction using at least some fraction of the memory limit,
+ * and (c) evicting multiple transactions at once, e.g. to free a given portion
+ * of the memory limit (e.g. 50%).
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTXN(ReorderBuffer *rb)
+{
+	HASH_SEQ_STATUS hash_seq;
+	ReorderBufferTXNByIdEnt	*ent;
+	ReorderBufferTXN *largest = NULL;
+
+	hash_seq_init(&hash_seq, rb->by_txn);
+	while ((ent = hash_seq_search(&hash_seq)) != NULL)
+	{
+		ReorderBufferTXN *txn = ent->txn;
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
+ * Check whether the logical_decoding_work_mem limit was reached, and if yes
+ * pick the transaction to evict and spill the changes to disk.
+ *
+ * XXX At this point we select just a single (largest) transaction, but
+ * we might also adapt a more elaborate eviction strategy - for example
+ * evicting enough transactions to free certain fraction (e.g. 50%) of
+ * the memory limit.
  */
 static void
-ReorderBufferCheckSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
+	ReorderBufferTXN *txn;
+
+	/* bail out if we haven't exceeded the memory limit */
+	if (rb->size < logical_decoding_work_mem * 1024L)
+		return;
+
 	/*
-	 * TODO: improve accounting so we cheaply can take subtransactions into
-	 * account here.
+	 * Pick the largest transaction (or subtransaction) and evict it from
+	 * memory by serializing it to disk.
 	 */
-	if (txn->nentries_mem >= max_changes_in_memory)
-	{
-		ReorderBufferSerializeTXN(rb, txn);
-		Assert(txn->nentries_mem == 0);
-	}
+	txn = ReorderBufferLargestTXN(rb);
+
+	ReorderBufferSerializeTXN(rb, txn);
+
+	/*
+	 * After eviction, the transaction should have no entries in memory, and
+	 * should use 0 bytes for changes.
+	 */
+	Assert(txn->size == 0);
+	Assert(txn->nentries_mem == 0);
+
+	/*
+	 * And furthermore, evicting the transaction should get us below the
+	 * memory limit again - it is not possible that we're still exceeding the
+	 * memory limit after evicting the transaction.
+	 *
+	 * This follows from the simple fact that the selected transaction is at
+	 * least as large as the most recent change (which caused us to go over
+	 * the memory limit). So by evicting it we're definitely back below the
+	 * memory limit.
+	 */
+	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
 /*
@@ -2513,6 +2670,84 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 }
 
 /*
+ * Size of a change in memory.
+ */
+static Size
+ReorderBufferChangeSize(ReorderBufferChange *change)
+{
+	Size		sz = sizeof(ReorderBufferChange);
+
+	switch (change->action)
+	{
+			/* fall through these, they're all similar enough */
+		case REORDER_BUFFER_CHANGE_INSERT:
+		case REORDER_BUFFER_CHANGE_UPDATE:
+		case REORDER_BUFFER_CHANGE_DELETE:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
+			{
+				ReorderBufferTupleBuf *oldtup,
+						   *newtup;
+				Size		oldlen = 0;
+				Size		newlen = 0;
+
+				oldtup = change->data.tp.oldtuple;
+				newtup = change->data.tp.newtuple;
+
+				if (oldtup)
+				{
+					sz += sizeof(HeapTupleData);
+					oldlen = oldtup->tuple.t_len;
+					sz += oldlen;
+				}
+
+				if (newtup)
+				{
+					sz += sizeof(HeapTupleData);
+					newlen = newtup->tuple.t_len;
+					sz += newlen;
+				}
+
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_MESSAGE:
+			{
+				Size		prefix_size = strlen(change->data.msg.prefix) + 1;
+
+				sz += prefix_size + change->data.msg.message_size +
+					sizeof(Size) + sizeof(Size);
+
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
+			{
+				Snapshot	snap;
+
+				snap = change->data.snapshot;
+
+				sz += sizeof(SnapshotData) +
+					sizeof(TransactionId) * snap->xcnt +
+					sizeof(TransactionId) * snap->subxcnt;
+
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_TRUNCATE:
+			{
+				sz += sizeof(Oid) * change->data.truncate.nrelids;
+
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
+		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+			/* ReorderBufferChange contains everything important */
+			break;
+	}
+
+	return sz;
+}
+
+
+/*
  * Restore a number of changes spilled to disk back into memory.
  */
 static Size
@@ -2784,6 +3019,16 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	dlist_push_tail(&txn->changes, &change->node);
 	txn->nentries_mem++;
+
+	/*
+	 * Update memory accounting for the restored change.  We need to do this
+	 * although we don't check the memory limit when restoring the changes in
+	 * this branch (we only do that when initially queueing the changes after
+	 * decoding), because we will release the changes later, and that will
+	 * update the accounting too (subtracting the size from the counters).
+	 * And we don't want to underflow there.
+	 */
+	ReorderBufferChangeMemoryUpdate(rb, change, true);
 }
 
 /*
@@ -3003,6 +3248,19 @@ ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn,
  *
  * We cannot replace unchanged toast tuples though, so those will still point
  * to on-disk toast data.
+ *
+ * While updating the existing change with detoasted tuple data, we need to
+ * update the memory accounting info, because the change size will differ.
+ * Otherwise the accounting may get out of sync, triggering serialization
+ * at unexpected times.
+ *
+ * We simply subtract size of the change before rejiggering the tuple, and
+ * then adding the new size. This makes it look like the change was removed
+ * and then added back, except it only tweaks the accounting info.
+ *
+ * In particular it can't trigger serialization, which would be pointless
+ * anyway as it happens during commit processing right before handing
+ * the change to the output plugin.
  */
 static void
 ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
@@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	if (txn->toast_hash == NULL)
 		return;
 
+	/*
+	 * We're going modify the size of the change, so to make sure the
+	 * accounting is correct we'll make it look like we're removing the
+	 * change now (with the old size), and then re-add it at the end.
+	 */
+	ReorderBufferChangeMemoryUpdate(rb, change, false);
+
 	oldcontext = MemoryContextSwitchTo(rb->context);
 
 	/* we should only have toast tuples in an INSERT or UPDATE */
@@ -3172,6 +3437,9 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	pfree(isnull);
 
 	MemoryContextSwitchTo(oldcontext);
+
+	/* now add the change back, with the correct size */
+	ReorderBufferChangeMemoryUpdate(rb, change, true);
 }
 
 /*
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 11e6331..f737afb 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1725,6 +1725,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.work_mem = MySubscription->workmem;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c08757..317c5d4 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -21,6 +21,7 @@
 #include "replication/origin.h"
 #include "replication/pgoutput.h"
 
+#include "utils/guc.h"
 #include "utils/inval.h"
 #include "utils/int8.h"
 #include "utils/memutils.h"
@@ -90,11 +91,12 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, int *logical_decoding_work_mem)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		work_mem_given = false;
 
 	foreach(lc, options)
 	{
@@ -140,6 +142,29 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0)
+		{
+			int64	parsed;
+
+			if (work_mem_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			work_mem_given = true;
+
+			if (!scanint8(strVal(defel->arg), true, &parsed))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid work_mem")));
+
+			if (parsed > PG_INT32_MAX || parsed < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("work_mem \"%s\" out of range",
+								strVal(defel->arg))));
+
+			*logical_decoding_work_mem = (int)parsed;
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -174,7 +199,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&logical_decoding_work_mem);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2178e1c..5d7e687 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -65,6 +65,7 @@
 #include "postmaster/postmaster.h"
 #include "postmaster/syslogger.h"
 #include "postmaster/walwriter.h"
+#include "replication/logical.h"
 #include "replication/logicallauncher.h"
 #include "replication/slot.h"
 #include "replication/syncrep.h"
@@ -191,6 +192,7 @@ static bool check_maxconnections(int *newval, void **extra, GucSource source);
 static bool check_max_worker_processes(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_max_workers(int *newval, void **extra, GucSource source);
 static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
+static bool check_logical_decoding_work_mem(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
@@ -2251,6 +2253,18 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM,
+			gettext_noop("Sets the maximum memory to be used for logical decoding."),
+			gettext_noop("This much memory can be used by each internal "
+						 "reorder buffer before spilling to disk or streaming."),
+			GUC_UNIT_KB
+		},
+		&logical_decoding_work_mem,
+		-1, -1, MAX_KILOBYTES,
+		check_logical_decoding_work_mem, NULL, NULL
+	},
+
 	/*
 	 * We use the hopefully-safely-small value of 100kB as the compiled-in
 	 * default for max_stack_depth.  InitializeGUCOptions will increase it if
@@ -11286,6 +11300,28 @@ check_max_wal_senders(int *newval, void **extra, GucSource source)
 }
 
 static bool
+check_logical_decoding_work_mem(int *newval, void **extra, GucSource source)
+{
+	/*
+	 * -1 indicates fallback.
+	 *
+	 * If we haven't yet changed the boot_val default of -1, just let it be.
+	 * logical decoding will look to maintenance_work_mem instead.
+	 */
+	if (*newval == -1)
+		return true;
+
+	/*
+	 * We clamp manually-set values to at least 64kB. The maintenance_work_mem
+	 * uses a higher minimum value (1MB), so this is OK.
+	 */
+	if (*newval < 64)
+		*newval = 64;
+
+	return true;
+}
+
+static bool
 check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
 {
 	/*
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0fc23e3..00a22b8 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -130,6 +130,7 @@
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #autovacuum_work_mem = -1		# min 1MB, or -1 to use maintenance_work_mem
+#logical_decoding_work_mem = 64MB	# min 1MB, or -1 to use maintenance_work_mem
 #max_stack_depth = 2MB			# min 100kB
 #shared_memory_type = mmap		# the default is the first option
 					# supported by the operating system:
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3cb13d8..10ea113 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	int32		subworkmem;		/* Memory to use to decode changes. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	int			workmem;		/* Memory to decode changes. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 4c06a78..4dcef80 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -17,6 +17,8 @@
 #include "utils/snapshot.h"
 #include "utils/timestamp.h"
 
+extern PGDLLIMPORT	int	logical_decoding_work_mem;
+
 /* an individual tuple, stored in one chunk of memory */
 typedef struct ReorderBufferTupleBuf
 {
@@ -63,6 +65,9 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_TRUNCATE
 };
 
+/* forward declaration */
+struct ReorderBufferTXN;
+
 /*
  * a single 'change', can be an insert (with one tuple), an update (old, new),
  * or a delete (old).
@@ -77,6 +82,9 @@ typedef struct ReorderBufferChange
 	/* The type of change. */
 	enum ReorderBufferChangeType action;
 
+	/* Transaction this change belongs to. */
+	struct ReorderBufferTXN *txn;
+
 	RepOriginId origin_id;
 
 	/*
@@ -286,6 +294,11 @@ typedef struct ReorderBufferTXN
 	 */
 	dlist_node	node;
 
+	/*
+	 * Size of this transaction (changes currently in memory, in bytes).
+	 */
+	Size		size;
+
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -386,6 +399,9 @@ struct ReorderBuffer
 	/* buffer for disk<->memory conversions */
 	char	   *outbuf;
 	Size		outbufsize;
+
+	/* memory accounting */
+	Size		size;
 };
 
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e12a934..4e68a69 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -162,6 +162,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			int			work_mem;	/* Memory limit to use for decoding */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
-- 
1.8.3.1

bugs_and_review_comments_fix.patchapplication/octet-stream; name=bugs_and_review_comments_fix.patchDownload
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e9d57b4..5c69359 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2200,6 +2200,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->data.tuplecid.cmax = cmax;
 	change->data.tuplecid.combocid = combocid;
 	change->lsn = lsn;
+	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
@@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* update the statistics */
 	rb->spillCount += 1;
-	rb->spillTxns += txn->serialized ? 1 : 0;
+	rb->spillTxns += txn->serialized ? 0 : 1;
 	rb->spillBytes += size;
 
 	Assert(spilled == txn->nentries_mem);
@@ -3292,7 +3293,7 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		return;
 
 	/*
-	 * We're going modify the size of the change, so to make sure the
+	 * We're going to modify the size of the change, so to make sure the
 	 * accounting is correct we'll make it look like we're removing the
 	 * change now (with the old size), and then re-add it at the end.
 	 */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5d7e687..c7252cf 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -192,7 +192,6 @@ static bool check_maxconnections(int *newval, void **extra, GucSource source);
 static bool check_max_worker_processes(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_max_workers(int *newval, void **extra, GucSource source);
 static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
-static bool check_logical_decoding_work_mem(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
@@ -2261,8 +2260,8 @@ static struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&logical_decoding_work_mem,
-		-1, -1, MAX_KILOBYTES,
-		check_logical_decoding_work_mem, NULL, NULL
+		65536, 64, MAX_KILOBYTES,
+		NULL, NULL, NULL
 	},
 
 	/*
@@ -11300,28 +11299,6 @@ check_max_wal_senders(int *newval, void **extra, GucSource source)
 }
 
 static bool
-check_logical_decoding_work_mem(int *newval, void **extra, GucSource source)
-{
-	/*
-	 * -1 indicates fallback.
-	 *
-	 * If we haven't yet changed the boot_val default of -1, just let it be.
-	 * logical decoding will look to maintenance_work_mem instead.
-	 */
-	if (*newval == -1)
-		return true;
-
-	/*
-	 * We clamp manually-set values to at least 64kB. The maintenance_work_mem
-	 * uses a higher minimum value (1MB), so this is OK.
-	 */
-	if (*newval < 64)
-		*newval = 64;
-
-	return true;
-}
-
-static bool
 check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
 {
 	/*
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 00a22b8..04529aa 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -130,7 +130,7 @@
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #autovacuum_work_mem = -1		# min 1MB, or -1 to use maintenance_work_mem
-#logical_decoding_work_mem = 64MB	# min 1MB, or -1 to use maintenance_work_mem
+i#logical_decoding_work_mem = 64MB	# min 64kB
 #max_stack_depth = 2MB			# min 100kB
 #shared_memory_type = mmap		# the default is the first option
 					# supported by the operating system:
#102Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#101)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Oct 14, 2019 at 3:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Sure, I wasn't really proposing to adding all stats from that patch,
including those related to streaming. We need to extract just those
related to spilling. And yes, it needs to be moved right after 0001.

I have extracted the spilling related code to a separate patch on top
of 0001. I have also fixed some bugs and review comments and attached
as a separate patch. Later I can merge it to the main patch if you
agree with the changes.

Few comments
-------------------------
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer
1.
+ {
+ {"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM,
+ gettext_noop("Sets the maximum memory to be used for logical decoding."),
+ gettext_noop("This much memory can be used by each internal "
+ "reorder buffer before spilling to disk or streaming."),
+ GUC_UNIT_KB
+ },

I think we can remove 'or streaming' from above sentence for now. We
can add it later with later patch where streaming will be allowed.

2.
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable
class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>work_mem</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          Limits the amount of memory used to decode changes on the
+          publisher.  If not specified, the publisher will use the default
+          specified by <varname>logical_decoding_work_mem</varname>. When
+          needed, additional data are spilled to disk.
+         </para>
+        </listitem>
+       </varlistentry>

It is not clear why we need this parameter at least with this patch?
I have raised this multiple times [1][2].

bugs_and_review_comments_fix
1.
},
  &logical_decoding_work_mem,
- -1, -1, MAX_KILOBYTES,
- check_logical_decoding_work_mem, NULL, NULL
+ 65536, 64, MAX_KILOBYTES,
+ NULL, NULL, NULL

I think the default value should be 1MB similar to
maintenance_work_mem. The same was true before this change.

2. -#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use
maintenance_work_mem
+i#logical_decoding_work_mem = 64MB # min 64kB

It seems the 'i' is a leftover character in the above change. Also,
change the default value considering the previous point.

3.
@@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)

  /* update the statistics */
  rb->spillCount += 1;
- rb->spillTxns += txn->serialized ? 1 : 0;
+ rb->spillTxns += txn->serialized ? 0 : 1;
  rb->spillBytes += size;

Why is this change required? Shouldn't we increase the spillTxns
count only when the txn is serialized?

0002-Track-statistics-for-spilling
1.
+    <row>
+     <entry><structfield>spill_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of transactions spilled to disk after the memory used by
+      logical decoding exceeds <literal>logical_work_mem</literal>. The
+      counter gets incremented both for toplevel transactions and
+      subtransactions.
+      </entry>
+    </row>

The parameter name is wrong here. /logical_work_mem/logical_decoding_work_mem

2.
+    <row>
+     <entry><structfield>spill_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of transactions spilled to disk after the memory used by
+      logical decoding exceeds <literal>logical_work_mem</literal>. The
+      counter gets incremented both for toplevel transactions and
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>spill_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times transactions were spilled to disk. Transactions
+      may get spilled repeatedly, and this counter gets incremented on every
+      such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>spill_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded transaction data spilled to disk.
+      </entry>
+    </row>

In all the above cases, the explanation text starts immediately after
<entry> tag, but the general coding practice is to start from the next
line, see the explanation of nearby parameters.

It seems these parameters are added in pg-stat-wal-receiver-view in
the docs, but in code, it is present as part of pg_stat_replication.
It seems doc needs to be updated. Am, I missing something?

3.
ReorderBufferSerializeTXN()
{
..
/* update the statistics */
rb->spillCount += 1;
rb->spillTxns += txn->serialized ? 0 : 1;
rb->spillBytes += size;

Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
txn->serialized = true;
..
}

I am not able to understand the above code. We are setting the
serialized parameter a few lines after we check it and increment the
spillTxns count. Can you please explain it?

Also, isn't spillTxns count bit confusing, because in some cases it
will include subtransactions and other cases (where the largest picked
transaction is a subtransaction) it won't include it?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#103Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#102)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Oct 18, 2019 at 5:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have replied to some of your questions inline. I will work on the
remaining comments and post the patch for the same.

Sure, I wasn't really proposing to adding all stats from that patch,
including those related to streaming. We need to extract just those
related to spilling. And yes, it needs to be moved right after 0001.

I have extracted the spilling related code to a separate patch on top
of 0001. I have also fixed some bugs and review comments and attached
as a separate patch. Later I can merge it to the main patch if you
agree with the changes.

Few comments
-------------------------
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer
1.
+ {
+ {"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM,
+ gettext_noop("Sets the maximum memory to be used for logical decoding."),
+ gettext_noop("This much memory can be used by each internal "
+ "reorder buffer before spilling to disk or streaming."),
+ GUC_UNIT_KB
+ },

I think we can remove 'or streaming' from above sentence for now. We
can add it later with later patch where streaming will be allowed.

2.
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable
class="parameter">subscription_name</replaceabl
</para>
</listitem>
</varlistentry>
+
+       <varlistentry>
+        <term><literal>work_mem</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          Limits the amount of memory used to decode changes on the
+          publisher.  If not specified, the publisher will use the default
+          specified by <varname>logical_decoding_work_mem</varname>. When
+          needed, additional data are spilled to disk.
+         </para>
+        </listitem>
+       </varlistentry>

It is not clear why we need this parameter at least with this patch?
I have raised this multiple times [1][2].

bugs_and_review_comments_fix
1.
},
&logical_decoding_work_mem,
- -1, -1, MAX_KILOBYTES,
- check_logical_decoding_work_mem, NULL, NULL
+ 65536, 64, MAX_KILOBYTES,
+ NULL, NULL, NULL

I think the default value should be 1MB similar to
maintenance_work_mem. The same was true before this change.

2. -#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use
maintenance_work_mem
+i#logical_decoding_work_mem = 64MB # min 64kB

It seems the 'i' is a leftover character in the above change. Also,
change the default value considering the previous point.

3.
@@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)

/* update the statistics */
rb->spillCount += 1;
- rb->spillTxns += txn->serialized ? 1 : 0;
+ rb->spillTxns += txn->serialized ? 0 : 1;
rb->spillBytes += size;

Why is this change required? Shouldn't we increase the spillTxns
count only when the txn is serialized?

Prior to this change it was increasing the rb->spillTxns, every time
we try to serialize the changes of the transaction. Now, only we
increase first time when it is not yet serialized.

0002-Track-statistics-for-spilling
1.
+    <row>
+     <entry><structfield>spill_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of transactions spilled to disk after the memory used by
+      logical decoding exceeds <literal>logical_work_mem</literal>. The
+      counter gets incremented both for toplevel transactions and
+      subtransactions.
+      </entry>
+    </row>

The parameter name is wrong here. /logical_work_mem/logical_decoding_work_mem

2.
+    <row>
+     <entry><structfield>spill_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of transactions spilled to disk after the memory used by
+      logical decoding exceeds <literal>logical_work_mem</literal>. The
+      counter gets incremented both for toplevel transactions and
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>spill_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times transactions were spilled to disk. Transactions
+      may get spilled repeatedly, and this counter gets incremented on every
+      such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>spill_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded transaction data spilled to disk.
+      </entry>
+    </row>

In all the above cases, the explanation text starts immediately after
<entry> tag, but the general coding practice is to start from the next
line, see the explanation of nearby parameters.

It seems these parameters are added in pg-stat-wal-receiver-view in
the docs, but in code, it is present as part of pg_stat_replication.
It seems doc needs to be updated. Am, I missing something?

3.
ReorderBufferSerializeTXN()
{
..
/* update the statistics */
rb->spillCount += 1;
rb->spillTxns += txn->serialized ? 0 : 1;
rb->spillBytes += size;

Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
txn->serialized = true;
..
}

I am not able to understand the above code. We are setting the
serialized parameter a few lines after we check it and increment the
spillTxns count. Can you please explain it?

Basically, when the first time we attempt to serialize a transaction,
txn->serialized will be false, that time we will increment the
rb->spillTxns and after that set txn->serialized to true. From next
time onwards if we try to serialize the same transaction we will not
increment the rb->spillTxns so that we count each transaction only
once.

Also, isn't spillTxns count bit confusing, because in some cases it
will include subtransactions and other cases (where the largest picked
transaction is a subtransaction) it won't include it?

I did not understand your comment completely. Basically, every
transaction which we are serializing we will increase the count first
time right? whether it is the main transaction or the sub-transaction.
Am I missing something?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#104Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#103)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Oct 21, 2019 at 10:48 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Oct 18, 2019 at 5:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

3.
@@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)

/* update the statistics */
rb->spillCount += 1;
- rb->spillTxns += txn->serialized ? 1 : 0;
+ rb->spillTxns += txn->serialized ? 0 : 1;
rb->spillBytes += size;

Why is this change required? Shouldn't we increase the spillTxns
count only when the txn is serialized?

Prior to this change it was increasing the rb->spillTxns, every time
we try to serialize the changes of the transaction. Now, only we
increase first time when it is not yet serialized.

3.
ReorderBufferSerializeTXN()
{
..
/* update the statistics */
rb->spillCount += 1;
rb->spillTxns += txn->serialized ? 0 : 1;
rb->spillBytes += size;

Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
txn->serialized = true;
..
}

I am not able to understand the above code. We are setting the
serialized parameter a few lines after we check it and increment the
spillTxns count. Can you please explain it?

Basically, when the first time we attempt to serialize a transaction,
txn->serialized will be false, that time we will increment the
rb->spillTxns and after that set txn->serialized to true. From next
time onwards if we try to serialize the same transaction we will not
increment the rb->spillTxns so that we count each transaction only
once.

Your explanation for both the above comments makes sense to me. Can
you please add some comments along these lines because it is not
apparent why one wants to increase the spillTxns counter when
txn->serialized is false?

Also, isn't spillTxns count bit confusing, because in some cases it
will include subtransactions and other cases (where the largest picked
transaction is a subtransaction) it won't include it?

I did not understand your comment completely. Basically, every
transaction which we are serializing we will increase the count first
time right? whether it is the main transaction or the sub-transaction.

It was not clear to me earlier whether we always increase the
spillTxns counter for subtransactions or not. But now, looking at
code carefully, it is clear that is it is getting increased in every
case. In short, you don't need to do anything for this comment.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#105Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#104)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Oct 21, 2019 at 2:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Oct 21, 2019 at 10:48 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Oct 18, 2019 at 5:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

3.
@@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)

/* update the statistics */
rb->spillCount += 1;
- rb->spillTxns += txn->serialized ? 1 : 0;
+ rb->spillTxns += txn->serialized ? 0 : 1;
rb->spillBytes += size;

Why is this change required? Shouldn't we increase the spillTxns
count only when the txn is serialized?

Prior to this change it was increasing the rb->spillTxns, every time
we try to serialize the changes of the transaction. Now, only we
increase first time when it is not yet serialized.

3.
ReorderBufferSerializeTXN()
{
..
/* update the statistics */
rb->spillCount += 1;
rb->spillTxns += txn->serialized ? 0 : 1;
rb->spillBytes += size;

Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
txn->serialized = true;
..
}

I am not able to understand the above code. We are setting the
serialized parameter a few lines after we check it and increment the
spillTxns count. Can you please explain it?

Basically, when the first time we attempt to serialize a transaction,
txn->serialized will be false, that time we will increment the
rb->spillTxns and after that set txn->serialized to true. From next
time onwards if we try to serialize the same transaction we will not
increment the rb->spillTxns so that we count each transaction only
once.

Your explanation for both the above comments makes sense to me. Can
you please add some comments along these lines because it is not
apparent why one wants to increase the spillTxns counter when
txn->serialized is false?

Ok, I will add comments in the next patch.

Also, isn't spillTxns count bit confusing, because in some cases it
will include subtransactions and other cases (where the largest picked
transaction is a subtransaction) it won't include it?

I did not understand your comment completely. Basically, every
transaction which we are serializing we will increase the count first
time right? whether it is the main transaction or the sub-transaction.

It was not clear to me earlier whether we always increase the
spillTxns counter for subtransactions or not. But now, looking at
code carefully, it is clear that is it is getting increased in every
case. In short, you don't need to do anything for this comment.

ok

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#106Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#102)
3 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Oct 18, 2019 at 5:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Oct 14, 2019 at 3:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Sure, I wasn't really proposing to adding all stats from that patch,
including those related to streaming. We need to extract just those
related to spilling. And yes, it needs to be moved right after 0001.

I have extracted the spilling related code to a separate patch on top
of 0001. I have also fixed some bugs and review comments and attached
as a separate patch. Later I can merge it to the main patch if you
agree with the changes.

Few comments
-------------------------
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer
1.
+ {
+ {"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM,
+ gettext_noop("Sets the maximum memory to be used for logical decoding."),
+ gettext_noop("This much memory can be used by each internal "
+ "reorder buffer before spilling to disk or streaming."),
+ GUC_UNIT_KB
+ },

I think we can remove 'or streaming' from above sentence for now. We
can add it later with later patch where streaming will be allowed.

Done

2.
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable
class="parameter">subscription_name</replaceabl
</para>
</listitem>
</varlistentry>
+
+       <varlistentry>
+        <term><literal>work_mem</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          Limits the amount of memory used to decode changes on the
+          publisher.  If not specified, the publisher will use the default
+          specified by <varname>logical_decoding_work_mem</varname>. When
+          needed, additional data are spilled to disk.
+         </para>
+        </listitem>
+       </varlistentry>

It is not clear why we need this parameter at least with this patch?
I have raised this multiple times [1][2].

I have moved it out as a separate patch (0003) so that if we need that
we need this for the streaming transaction then we can keep this.

bugs_and_review_comments_fix
1.
},
&logical_decoding_work_mem,
- -1, -1, MAX_KILOBYTES,
- check_logical_decoding_work_mem, NULL, NULL
+ 65536, 64, MAX_KILOBYTES,
+ NULL, NULL, NULL

I think the default value should be 1MB similar to
maintenance_work_mem. The same was true before this change.

default value for maintenance_work_mem is also 64MB. Did you mean min value?

2. -#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use
maintenance_work_mem
+i#logical_decoding_work_mem = 64MB # min 64kB

It seems the 'i' is a leftover character in the above change. Also,
change the default value considering the previous point.

oops, fixed.

3.
@@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)

/* update the statistics */
rb->spillCount += 1;
- rb->spillTxns += txn->serialized ? 1 : 0;
+ rb->spillTxns += txn->serialized ? 0 : 1;
rb->spillBytes += size;

Why is this change required? Shouldn't we increase the spillTxns
count only when the txn is serialized?

Already agreed in previous mail so added comments

0002-Track-statistics-for-spilling
1.
+    <row>
+     <entry><structfield>spill_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of transactions spilled to disk after the memory used by
+      logical decoding exceeds <literal>logical_work_mem</literal>. The
+      counter gets incremented both for toplevel transactions and
+      subtransactions.
+      </entry>
+    </row>

The parameter name is wrong here. /logical_work_mem/logical_decoding_work_mem

done

2.
+    <row>
+     <entry><structfield>spill_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of transactions spilled to disk after the memory used by
+      logical decoding exceeds <literal>logical_work_mem</literal>. The
+      counter gets incremented both for toplevel transactions and
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>spill_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times transactions were spilled to disk. Transactions
+      may get spilled repeatedly, and this counter gets incremented on every
+      such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>spill_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded transaction data spilled to disk.
+      </entry>
+    </row>

In all the above cases, the explanation text starts immediately after
<entry> tag, but the general coding practice is to start from the next
line, see the explanation of nearby parameters.

It seems it's mixed, for example, you can see
<entry>Timeline number of last write-ahead log location received and
flushed to disk, the initial value of this field being the timeline
number of the first log location used when WAL receiver is started
</entry>

or
<entry>Timeline number of last write-ahead log location received and
flushed to disk, the initial value of this field being the timeline
number of the first log location used when WAL receiver is started
</entry>

It seems these parameters are added in pg-stat-wal-receiver-view in
the docs, but in code, it is present as part of pg_stat_replication.
It seems doc needs to be updated. Am, I missing something?

Fixed

3.
ReorderBufferSerializeTXN()
{
..
/* update the statistics */
rb->spillCount += 1;
rb->spillTxns += txn->serialized ? 0 : 1;
rb->spillBytes += size;

Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
txn->serialized = true;
..
}

I am not able to understand the above code. We are setting the
serialized parameter a few lines after we check it and increment the
spillTxns count. Can you please explain it?

Also, isn't spillTxns count bit confusing, because in some cases it
will include subtransactions and other cases (where the largest picked
transaction is a subtransaction) it won't include it?

Already discussed in the last mail.

I have merged bugs_and_review_comments_fix.patch changes to 0001 and 0002.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.patchapplication/octet-stream; name=0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.patchDownload
From e51774752ee74ea51509e469e236f6203e8efc01 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 21 Oct 2019 16:59:17 +0530
Subject: [PATCH 1/3] Add logical_decoding_work_mem to limit ReorderBuffer
 memory usage

Instead of deciding to serialize a transaction merely based on the
number of changes in that xact (toplevel or subxact), this makes
the decisions based on amount of memory consumed by the changes.

The memory limit is defined by a new logical_decoding_work_mem GUC,
so for example we can do this

    SET logical_decoding_work_mem = '128kB'

to trigger very aggressive streaming. The minimum value is 64kB.

When adding a change to a transaction, we account for the size in
two places. Firstly, in the ReorderBuffer, which is then used to
decide if we reached the total memory limit. And secondly in the
transaction the change belongs to, so that we can pick the largest
transaction to evict (and serialize to disk).

We still use max_changes_in_memory when loading changes serialized
to disk. The trouble is we can't use the memory limit directly as
there might be multiple subxact serialized, we need to read all of
them but we don't know how many are there (and which subxact to
read first).

We do not serialize the ReorderBufferTXN entries, so if there is a
transaction with many subxacts, most memory may be in this type of
objects. Those records are not included in the memory accounting.

We also do not account for INTERNAL_TUPLECID changes, which are
kept in a separate list and not evicted from memory. Transactions
with many CTID changes may consume significant amounts of memory,
but we can't really do much about that.

The current eviction algorithm is very simple - the transaction is
picked merely by size, while it might be useful to also consider age
(LSN) of the changes for example. With the new Generational memory
allocator, evicting the oldest changes would make it more likely
the memory gets actually pfreed.

The logical_decoding_work_mem may be set either in postgresql.conf,
in which case it serves as the default for all publishers on that
instance, or when creating the subscription, using a work_mem
paramemter in the WITH clause (specifies number of kilobytes).
---
 doc/src/sgml/config.sgml                        |  21 ++
 src/backend/replication/logical/reorderbuffer.c | 293 +++++++++++++++++++++++-
 src/backend/utils/misc/guc.c                    |  13 ++
 src/backend/utils/misc/postgresql.conf.sample   |   1 +
 src/include/replication/reorderbuffer.h         |  16 ++
 5 files changed, 332 insertions(+), 12 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 886632f..291b343 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1716,6 +1716,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-logical-decoding-work-mem" xreflabel="logical_decoding_work_mem">
+      <term><varname>logical_decoding_work_mem</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>logical_decoding_work_mem</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to be used by logical decoding,
+        before some of the decoded changes are either written to local disk.
+        This limits the amount of memory used by logical streaming replication
+        connections. It defaults to 64 megabytes (<literal>64MB</literal>).
+        Since each replication connection only uses a single buffer of this size,
+        and an installation normally doesn't have many such connections
+        concurrently (as limited by <varname>max_wal_senders</varname>), it's
+        safe to set this value significantly higher than <varname>work_mem</varname>,
+        reducing the amount of decoded changes written to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 62e5424..8ed8bd8 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -49,6 +49,34 @@
  *	  GenerationContext for the variable-length transaction data (allocated
  *	  and freed in groups with similar lifespan).
  *
+ *	  To limit the amount of memory used by decoded changes, we track memory
+ *	  used at the reorder buffer level (i.e. total amount of memory), and for
+ *	  each toplevel transaction. When the total amount of used memory exceeds
+ *	  the limit, the toplevel transaction consuming the most memory is then
+ *	  serialized to disk.
+ *
+ *	  Only decoded changes are evicted from memory (spilled to disk), not the
+ *	  transaction records. The number of toplevel transactions is limited,
+ *	  but a transaction with many subtransactions may still consume significant
+ *	  amounts of memory. The transaction records are fairly small, though, and
+ *	  are not included in the memory limit.
+ *
+ *	  The current eviction algorithm is very simple - the transaction is
+ *	  picked merely by size, while it might be useful to also consider age
+ *	  (LSN) of the changes for example. With the new Generational memory
+ *	  allocator, evicting the oldest changes would make it more likely the
+ *	  memory gets actually freed.
+ *
+ *	  We still rely on max_changes_in_memory when loading serialized changes
+ *	  back into memory. At that point we can't use the memory limit directly
+ *	  as we load the subxacts independently. One option do deal with this
+ *	  would be to count the subxacts, and allow each to allocate 1/N of the
+ *	  memory limit. That however does not seem very appealing, because with
+ *	  many subtransactions it may easily cause trashing (short cycles of
+ *	  deserializing and applying very few changes). We probably should give
+ *	  a bit more memory to the oldest subtransactions, because it's likely
+ *	  the source for the next sequence of changes.
+ *
  * -------------------------------------------------------------------------
  */
 #include "postgres.h"
@@ -154,7 +182,8 @@ typedef struct ReorderBufferDiskChange
  * resource management here, but it's not entirely clear what that would look
  * like.
  */
-static const Size max_changes_in_memory = 4096;
+int			logical_decoding_work_mem;
+static const Size max_changes_in_memory = 4096; /* XXX for restore only */
 
 /* ---------------------------------------
  * primary reorderbuffer support routines
@@ -189,7 +218,7 @@ static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTX
  * Disk serialization support functions
  * ---------------------------------------
  */
-static void ReorderBufferCheckSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferCheckMemoryLimit(ReorderBuffer *rb);
 static void ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 										 int fd, ReorderBufferChange *change);
@@ -217,6 +246,14 @@ static void ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static void ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn,
 										  Relation relation, ReorderBufferChange *change);
 
+/*
+ * ---------------------------------------
+ * memory accounting
+ * ---------------------------------------
+ */
+static Size ReorderBufferChangeSize(ReorderBufferChange *change);
+static void ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
+								ReorderBufferChange *change, bool addition);
 
 /*
  * Allocate a new ReorderBuffer and clean out any old serialized state from
@@ -269,6 +306,7 @@ ReorderBufferAllocate(void)
 
 	buffer->outbuf = NULL;
 	buffer->outbufsize = 0;
+	buffer->size = 0;
 
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
@@ -374,6 +412,9 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 void
 ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 {
+	/* update memory accounting info */
+	ReorderBufferChangeMemoryUpdate(rb, change, false);
+
 	/* free contained data */
 	switch (change->action)
 	{
@@ -585,12 +626,18 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	change->lsn = lsn;
+	change->txn = txn;
+
 	Assert(InvalidXLogRecPtr != lsn);
 	dlist_push_tail(&txn->changes, &change->node);
 	txn->nentries++;
 	txn->nentries_mem++;
 
-	ReorderBufferCheckSerializeTXN(rb, txn);
+	/* update memory accounting information */
+	ReorderBufferChangeMemoryUpdate(rb, change, true);
+
+	/* check the memory limits and evict something if needed */
+	ReorderBufferCheckMemoryLimit(rb);
 }
 
 /*
@@ -1217,6 +1264,9 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		change = dlist_container(ReorderBufferChange, node, iter.cur);
 
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
 		ReorderBufferReturnChange(rb, change);
 	}
 
@@ -1229,7 +1279,11 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		ReorderBufferChange *change;
 
 		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
 		ReorderBufferReturnChange(rb, change);
 	}
 
@@ -2082,9 +2136,48 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferQueueChange(rb, xid, lsn, change);
 }
 
+/*
+ * Update the memory accounting info. We track memory used by the whole
+ * reorder buffer and the transaction containing the change.
+ */
+static void
+ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
+								ReorderBufferChange *change,
+								bool addition)
+{
+	Size		sz;
+
+	Assert(change->txn);
+
+	/*
+	 * Ignore tuple CID changes, because those are not evicted when
+	 * reaching memory limit. So we just don't count them, because it
+	 * might easily trigger a pointless attempt to spill/stream.
+	 */
+	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
+		return;
+
+	sz = ReorderBufferChangeSize(change);
+
+	if (addition)
+	{
+		change->txn->size += sz;
+		rb->size += sz;
+	}
+	else
+	{
+		Assert((rb->size >= sz) && (change->txn->size >= sz));
+		change->txn->size -= sz;
+		rb->size -= sz;
+	}
+}
 
 /*
  * Add new (relfilenode, tid) -> (cmin, cmax) mappings.
+ *
+ * We do not include this change type in memory accounting, because we
+ * keep CIDs in a separate list and do not evict them when reaching
+ * the memory limit.
  */
 void
 ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
@@ -2103,6 +2196,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->data.tuplecid.cmax = cmax;
 	change->data.tuplecid.combocid = combocid;
 	change->lsn = lsn;
+	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
@@ -2230,20 +2324,84 @@ ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
 }
 
 /*
- * Check whether the transaction tx should spill its data to disk.
+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
+ *
+ * XXX With many subtransactions this might be quite slow, because we'll have
+ * to walk through all of them. There are some options how we could improve
+ * that: (a) maintain some secondary structure with transactions sorted by
+ * amount of changes, (b) not looking for the entirely largest transaction,
+ * but e.g. for transaction using at least some fraction of the memory limit,
+ * and (c) evicting multiple transactions at once, e.g. to free a given portion
+ * of the memory limit (e.g. 50%).
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTXN(ReorderBuffer *rb)
+{
+	HASH_SEQ_STATUS hash_seq;
+	ReorderBufferTXNByIdEnt	*ent;
+	ReorderBufferTXN *largest = NULL;
+
+	hash_seq_init(&hash_seq, rb->by_txn);
+	while ((ent = hash_seq_search(&hash_seq)) != NULL)
+	{
+		ReorderBufferTXN *txn = ent->txn;
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
+ * Check whether the logical_decoding_work_mem limit was reached, and if yes
+ * pick the transaction to evict and spill the changes to disk.
+ *
+ * XXX At this point we select just a single (largest) transaction, but
+ * we might also adapt a more elaborate eviction strategy - for example
+ * evicting enough transactions to free certain fraction (e.g. 50%) of
+ * the memory limit.
  */
 static void
-ReorderBufferCheckSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
+	ReorderBufferTXN *txn;
+
+	/* bail out if we haven't exceeded the memory limit */
+	if (rb->size < logical_decoding_work_mem * 1024L)
+		return;
+
 	/*
-	 * TODO: improve accounting so we cheaply can take subtransactions into
-	 * account here.
+	 * Pick the largest transaction (or subtransaction) and evict it from
+	 * memory by serializing it to disk.
 	 */
-	if (txn->nentries_mem >= max_changes_in_memory)
-	{
-		ReorderBufferSerializeTXN(rb, txn);
-		Assert(txn->nentries_mem == 0);
-	}
+	txn = ReorderBufferLargestTXN(rb);
+
+	ReorderBufferSerializeTXN(rb, txn);
+
+	/*
+	 * After eviction, the transaction should have no entries in memory, and
+	 * should use 0 bytes for changes.
+	 */
+	Assert(txn->size == 0);
+	Assert(txn->nentries_mem == 0);
+
+	/*
+	 * And furthermore, evicting the transaction should get us below the
+	 * memory limit again - it is not possible that we're still exceeding the
+	 * memory limit after evicting the transaction.
+	 *
+	 * This follows from the simple fact that the selected transaction is at
+	 * least as large as the most recent change (which caused us to go over
+	 * the memory limit). So by evicting it we're definitely back below the
+	 * memory limit.
+	 */
+	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
 /*
@@ -2513,6 +2671,84 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 }
 
 /*
+ * Size of a change in memory.
+ */
+static Size
+ReorderBufferChangeSize(ReorderBufferChange *change)
+{
+	Size		sz = sizeof(ReorderBufferChange);
+
+	switch (change->action)
+	{
+			/* fall through these, they're all similar enough */
+		case REORDER_BUFFER_CHANGE_INSERT:
+		case REORDER_BUFFER_CHANGE_UPDATE:
+		case REORDER_BUFFER_CHANGE_DELETE:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
+			{
+				ReorderBufferTupleBuf *oldtup,
+						   *newtup;
+				Size		oldlen = 0;
+				Size		newlen = 0;
+
+				oldtup = change->data.tp.oldtuple;
+				newtup = change->data.tp.newtuple;
+
+				if (oldtup)
+				{
+					sz += sizeof(HeapTupleData);
+					oldlen = oldtup->tuple.t_len;
+					sz += oldlen;
+				}
+
+				if (newtup)
+				{
+					sz += sizeof(HeapTupleData);
+					newlen = newtup->tuple.t_len;
+					sz += newlen;
+				}
+
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_MESSAGE:
+			{
+				Size		prefix_size = strlen(change->data.msg.prefix) + 1;
+
+				sz += prefix_size + change->data.msg.message_size +
+					sizeof(Size) + sizeof(Size);
+
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
+			{
+				Snapshot	snap;
+
+				snap = change->data.snapshot;
+
+				sz += sizeof(SnapshotData) +
+					sizeof(TransactionId) * snap->xcnt +
+					sizeof(TransactionId) * snap->subxcnt;
+
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_TRUNCATE:
+			{
+				sz += sizeof(Oid) * change->data.truncate.nrelids;
+
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
+		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+			/* ReorderBufferChange contains everything important */
+			break;
+	}
+
+	return sz;
+}
+
+
+/*
  * Restore a number of changes spilled to disk back into memory.
  */
 static Size
@@ -2784,6 +3020,16 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	dlist_push_tail(&txn->changes, &change->node);
 	txn->nentries_mem++;
+
+	/*
+	 * Update memory accounting for the restored change.  We need to do this
+	 * although we don't check the memory limit when restoring the changes in
+	 * this branch (we only do that when initially queueing the changes after
+	 * decoding), because we will release the changes later, and that will
+	 * update the accounting too (subtracting the size from the counters).
+	 * And we don't want to underflow there.
+	 */
+	ReorderBufferChangeMemoryUpdate(rb, change, true);
 }
 
 /*
@@ -3003,6 +3249,19 @@ ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn,
  *
  * We cannot replace unchanged toast tuples though, so those will still point
  * to on-disk toast data.
+ *
+ * While updating the existing change with detoasted tuple data, we need to
+ * update the memory accounting info, because the change size will differ.
+ * Otherwise the accounting may get out of sync, triggering serialization
+ * at unexpected times.
+ *
+ * We simply subtract size of the change before rejiggering the tuple, and
+ * then adding the new size. This makes it look like the change was removed
+ * and then added back, except it only tweaks the accounting info.
+ *
+ * In particular it can't trigger serialization, which would be pointless
+ * anyway as it happens during commit processing right before handing
+ * the change to the output plugin.
  */
 static void
 ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
@@ -3023,6 +3282,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	if (txn->toast_hash == NULL)
 		return;
 
+	/*
+	 * We're going to modify the size of the change, so to make sure the
+	 * accounting is correct we'll make it look like we're removing the
+	 * change now (with the old size), and then re-add it at the end.
+	 */
+	ReorderBufferChangeMemoryUpdate(rb, change, false);
+
 	oldcontext = MemoryContextSwitchTo(rb->context);
 
 	/* we should only have toast tuples in an INSERT or UPDATE */
@@ -3172,6 +3438,9 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	pfree(isnull);
 
 	MemoryContextSwitchTo(oldcontext);
+
+	/* now add the change back, with the correct size */
+	ReorderBufferChangeMemoryUpdate(rb, change, true);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 31a5ef0..49ba9cc 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -65,6 +65,7 @@
 #include "postmaster/postmaster.h"
 #include "postmaster/syslogger.h"
 #include "postmaster/walwriter.h"
+#include "replication/logical.h"
 #include "replication/logicallauncher.h"
 #include "replication/slot.h"
 #include "replication/syncrep.h"
@@ -2251,6 +2252,18 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM,
+			gettext_noop("Sets the maximum memory to be used for logical decoding."),
+			gettext_noop("This much memory can be used by each internal "
+						 "reorder buffer before spilling to disk."),
+			GUC_UNIT_KB
+		},
+		&logical_decoding_work_mem,
+		65536, 64, MAX_KILOBYTES,
+		NULL, NULL, NULL
+	},
+
 	/*
 	 * We use the hopefully-safely-small value of 100kB as the compiled-in
 	 * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0fc23e3..129b3ab 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -130,6 +130,7 @@
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #autovacuum_work_mem = -1		# min 1MB, or -1 to use maintenance_work_mem
+#logical_decoding_work_mem = 64MB	# min 64kB
 #max_stack_depth = 2MB			# min 100kB
 #shared_memory_type = mmap		# the default is the first option
 					# supported by the operating system:
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 4c06a78..4dcef80 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -17,6 +17,8 @@
 #include "utils/snapshot.h"
 #include "utils/timestamp.h"
 
+extern PGDLLIMPORT	int	logical_decoding_work_mem;
+
 /* an individual tuple, stored in one chunk of memory */
 typedef struct ReorderBufferTupleBuf
 {
@@ -63,6 +65,9 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_TRUNCATE
 };
 
+/* forward declaration */
+struct ReorderBufferTXN;
+
 /*
  * a single 'change', can be an insert (with one tuple), an update (old, new),
  * or a delete (old).
@@ -77,6 +82,9 @@ typedef struct ReorderBufferChange
 	/* The type of change. */
 	enum ReorderBufferChangeType action;
 
+	/* Transaction this change belongs to. */
+	struct ReorderBufferTXN *txn;
+
 	RepOriginId origin_id;
 
 	/*
@@ -286,6 +294,11 @@ typedef struct ReorderBufferTXN
 	 */
 	dlist_node	node;
 
+	/*
+	 * Size of this transaction (changes currently in memory, in bytes).
+	 */
+	Size		size;
+
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -386,6 +399,9 @@ struct ReorderBuffer
 	/* buffer for disk<->memory conversions */
 	char	   *outbuf;
 	Size		outbufsize;
+
+	/* memory accounting */
+	Size		size;
 };
 
 
-- 
1.8.3.1

0002-Track-statistics-for-spilling.patchapplication/octet-stream; name=0002-Track-statistics-for-spilling.patchDownload
From e10118b6d211915f040a70b18eb58a4cc32e7051 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 11 Oct 2019 09:07:41 +0530
Subject: [PATCH 2/3] Track statistics for spilling

---
 doc/src/sgml/monitoring.sgml                    | 23 ++++++++++++++
 src/backend/catalog/system_views.sql            |  5 ++-
 src/backend/replication/logical/reorderbuffer.c | 12 +++++++
 src/backend/replication/walsender.c             | 42 +++++++++++++++++++++++--
 src/include/catalog/pg_proc.dat                 |  6 ++--
 src/include/replication/reorderbuffer.h         | 11 +++++++
 src/include/replication/walsender_private.h     |  5 +++
 src/test/regress/expected/rules.out             |  7 +++--
 8 files changed, 103 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 828e908..18be607 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1971,6 +1971,29 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
      <entry><type>timestamp with time zone</type></entry>
      <entry>Send time of last reply message received from standby server</entry>
     </row>
+    <row>
+     <entry><structfield>spill_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded transaction data spilled to disk.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>spill_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of transactions spilled to disk after the memory used by
+      logical decoding exceeds <literal>logical_decoding_work_mem</literal>. The
+      counter gets incremented both for toplevel transactions and
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>spill_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times transactions were spilled to disk. Transactions
+      may get spilled repeatedly, and this counter gets incremented on every
+      such invocation.
+      </entry>
+    </row>
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 9fe4a47..2ee2a06 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -776,7 +776,10 @@ CREATE VIEW pg_stat_replication AS
             W.replay_lag,
             W.sync_priority,
             W.sync_state,
-            W.reply_time
+            W.reply_time,
+            W.spill_txns,
+            W.spill_count,
+            W.spill_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8ed8bd8..1fa0261 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -308,6 +308,10 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	buffer->spillCount = 0;
+	buffer->spillTxns = 0;
+	buffer->spillBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -2415,6 +2419,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	int			fd = -1;
 	XLogSegNo	curOpenSegNo = 0;
 	Size		spilled = 0;
+	Size		size = txn->size;
 
 	elog(DEBUG2, "spill %u changes in XID %u to disk",
 		 (uint32) txn->nentries_mem, txn->xid);
@@ -2473,6 +2478,13 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		spilled++;
 	}
 
+	/* update the statistics */
+	rb->spillCount += 1;
+	rb->spillBytes += size;
+
+	/* Don't consider already serialized transaction. */
+	rb->spillTxns += txn->serialized ? 0 : 1;
+
 	Assert(spilled == txn->nentries_mem);
 	Assert(dlist_is_empty(&txn->changes));
 	txn->nentries_mem = 0;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b0ebe50..af40d01 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -248,6 +248,7 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
+static void UpdateSpillStats(LogicalDecodingContext *ctx);
 static void XLogRead(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count);
 
 
@@ -1261,7 +1262,8 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
 /*
  * LogicalDecodingContext 'update_progress' callback.
  *
- * Write the current position to the lag tracker (see XLogSendPhysical).
+ * Write the current position to the lag tracker (see XLogSendPhysical),
+ * and update the spill statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1280,6 +1282,11 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 
 	LagTrackerWrite(lsn, now);
 	sendTime = now;
+
+	/*
+	 * Update statistics about transactions that spilled to disk.
+	 */
+	UpdateSpillStats(ctx);
 }
 
 /*
@@ -2318,6 +2325,9 @@ InitWalSenderSlot(void)
 			walsnd->state = WALSNDSTATE_STARTUP;
 			walsnd->latch = &MyProc->procLatch;
 			walsnd->replyTime = 0;
+			walsnd->spillTxns = 0;
+			walsnd->spillCount = 0;
+			walsnd->spillBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3219,7 +3229,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	12
+#define PG_STAT_GET_WAL_SENDERS_COLS	15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3274,6 +3284,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int			pid;
 		WalSndState state;
 		TimestampTz replyTime;
+		int64		spillTxns;
+		int64		spillCount;
+		int64		spillBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3294,6 +3307,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		applyLag = walsnd->applyLag;
 		priority = walsnd->sync_standby_priority;
 		replyTime = walsnd->replyTime;
+		spillTxns = walsnd->spillTxns;
+		spillCount = walsnd->spillCount;
+		spillBytes = walsnd->spillBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		memset(nulls, 0, sizeof(nulls));
@@ -3375,6 +3391,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 				nulls[11] = true;
 			else
 				values[11] = TimestampTzGetDatum(replyTime);
+
+			/* spill to disk */
+			values[12] = Int64GetDatum(spillTxns);
+			values[13] = Int64GetDatum(spillCount);
+			values[14] = Int64GetDatum(spillBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3611,3 +3632,20 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
 	Assert(time != 0);
 	return now - time;
 }
+
+static void
+UpdateSpillStats(LogicalDecodingContext *ctx)
+{
+	ReorderBuffer *rb = ctx->reorder;
+
+	SpinLockAcquire(&MyWalSnd->mutex);
+
+	MyWalSnd->spillTxns = rb->spillTxns;
+	MyWalSnd->spillCount = rb->spillCount;
+	MyWalSnd->spillBytes = rb->spillBytes;
+
+	elog(WARNING, "UpdateSpillStats: updating stats %p %ld %ld %ld",
+		 rb, rb->spillTxns, rb->spillCount, rb->spillBytes);
+
+	SpinLockRelease(&MyWalSnd->mutex);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 58ea5b9..fa0a2a1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5166,9 +5166,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 4dcef80..ba7f9f0 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -402,6 +402,17 @@ struct ReorderBuffer
 
 	/* memory accounting */
 	Size		size;
+
+	/*
+	 * Statistics about transactions spilled to disk.
+	 *
+	 * A single transaction may be spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 */
+	int64	spillCount;		/* spill-to-disk invocation counter */
+	int64	spillTxns;		/* number of transactions spilled to disk  */
+	int64	spillBytes;		/* amount of data spilled to disk */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 0dd6d1c..a6b3205 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -80,6 +80,11 @@ typedef struct WalSnd
 	 * Timestamp of the last message received from standby.
 	 */
 	TimestampTz replyTime;
+
+	/* Statistics for transactions spilled to disk. */
+	int64		spillTxns;
+	int64		spillCount;
+	int64		spillBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 210e9cd..750bdc4 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1951,9 +1951,12 @@ pg_stat_replication| SELECT s.pid,
     w.replay_lag,
     w.sync_priority,
     w.sync_state,
-    w.reply_time
+    w.reply_time,
+    w.spill_txns,
+    w.spill_count,
+    w.spill_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON   	((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
1.8.3.1

0003-Support-logical_decoding_work_mem-set-from-create-su.patchapplication/octet-stream; name=0003-Support-logical_decoding_work_mem-set-from-create-su.patchDownload
From bab5dbc8915d074604614fcf3b8486abb04d3a2d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 21 Oct 2019 17:45:47 +0530
Subject: [PATCH 3/3] Support logical_decoding_work_mem set from create
 subscription command

---
 doc/src/sgml/config.sgml                           | 21 +++++++++++
 doc/src/sgml/ref/create_subscription.sgml          | 12 ++++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/commands/subscriptioncmds.c            | 44 +++++++++++++++++++---
 .../libpqwalreceiver/libpqwalreceiver.c            |  3 ++
 src/backend/replication/logical/worker.c           |  1 +
 src/backend/replication/pgoutput/pgoutput.c        | 30 ++++++++++++++-
 src/include/catalog/pg_subscription.h              |  3 ++
 src/include/replication/walreceiver.h              |  1 +
 9 files changed, 108 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 291b343..6b700c6 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1737,6 +1737,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-logical-decoding-work-mem" xreflabel="logical_decoding_work_mem">
+      <term><varname>logical_decoding_work_mem</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>logical_decoding_work_mem</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to be used by logical decoding,
+        before some of the decoded changes are either written to local disk.
+        This limits the amount of memory used by logical streaming replication
+        connections. It defaults to 64 megabytes (<literal>64MB</literal>).
+        Since each replication connection only uses a single buffer of this size,
+        and an installation normally doesn't have many such connections
+        concurrently (as limited by <varname>max_wal_senders</varname>), it's
+        safe to set this value significantly higher than <varname>work_mem</varname>,
+        reducing the amount of decoded changes written to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c24..91790b0 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>work_mem</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          Limits the amount of memory used to decode changes on the
+          publisher.  If not specified, the publisher will use the default
+          specified by <varname>logical_decoding_work_mem</varname>. When
+          needed, additional data are spilled to disk.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index afee283..7e3ba8e 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -71,6 +71,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->workmem = subform->subworkmem;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 2e67a58..d85e831 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -66,7 +66,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, int *logical_wm,
+						   bool *logical_wm_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -97,6 +98,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (logical_wm)
+		*logical_wm_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -182,6 +185,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0 && logical_wm)
+		{
+			if (*logical_wm_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*logical_wm_given = true;
+			*logical_wm = defGetInt32(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -325,6 +338,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	int			logical_wm;
+	bool		logical_wm_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -341,7 +356,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &logical_wm, &logical_wm_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -419,6 +434,12 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (logical_wm_given)
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(logical_wm);
+	else
+		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -682,10 +703,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				int			logical_wm;
+				bool		logical_wm_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &logical_wm, &logical_wm_given);
 
 				if (slotname_given)
 				{
@@ -710,6 +734,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (logical_wm_given)
+				{
+					values[Anum_pg_subscription_subworkmem - 1] =
+						Int32GetDatum(logical_wm);
+					replaces[Anum_pg_subscription_subworkmem - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -721,7 +752,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -759,7 +791,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -796,7 +828,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 6eba08a..65b3266 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -406,6 +406,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		appendStringInfo(&cmd, ", work_mem '%d'",
+						 options->proto.logical.work_mem);
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ff62303..14c0ce8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1726,6 +1726,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.work_mem = MySubscription->workmem;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c08757..317c5d4 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -21,6 +21,7 @@
 #include "replication/origin.h"
 #include "replication/pgoutput.h"
 
+#include "utils/guc.h"
 #include "utils/inval.h"
 #include "utils/int8.h"
 #include "utils/memutils.h"
@@ -90,11 +91,12 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, int *logical_decoding_work_mem)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		work_mem_given = false;
 
 	foreach(lc, options)
 	{
@@ -140,6 +142,29 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0)
+		{
+			int64	parsed;
+
+			if (work_mem_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			work_mem_given = true;
+
+			if (!scanint8(strVal(defel->arg), true, &parsed))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid work_mem")));
+
+			if (parsed > PG_INT32_MAX || parsed < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("work_mem \"%s\" out of range",
+								strVal(defel->arg))));
+
+			*logical_decoding_work_mem = (int)parsed;
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -174,7 +199,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&logical_decoding_work_mem);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3cb13d8..10ea113 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	int32		subworkmem;		/* Memory to use to decode changes. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	int			workmem;		/* Memory to decode changes. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e12a934..4e68a69 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -162,6 +162,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			int			work_mem;	/* Memory limit to use for decoding */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
-- 
1.8.3.1

#107Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#94)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have attempted to test the performance of (Stream + Spill) vs
(Stream + BGW pool) and I can see the similar gain what Alexey had
shown[1].

In addition to this, I have rebased the latest patchset [2] without
the two-phase logical decoding patch set.

Test results:
I have repeated the same test as Alexy[1] for 1kk and 1kk data and
here is my result
Stream + Spill
N time on master(sec) Total xact time (sec)
1kk 6 21
3kk 18 55

Stream + BGW pool
N time on master(sec) Total xact time (sec)
1kk 6 13
3kk 19 35

I think the test results for the master are missing. Also, how about
running these tests over a network (means master and subscriber are
not on the same machine)? In general, yours and Alexy's test results
show that there is merit by having workers applying such transactions.
OTOH, as noted above [1]/messages/by-id/b25ce80e-f536-78c8-d5c8-a5df3e230785@postgrespro.ru, we are also worried about the performance
of Rollbacks if we follow that approach. I am not sure how much we
need to worry about Rollabcks if commits are faster, but can we think
of recording the changes in memory and only write to a file if the
changes are above a certain threshold? I think that might help saving
I/O in many cases. I am not very sure if we do that how much
additional workers can help, but they might still help. I think we
need to do some tests and experiments to figure out what is the best
approach? What do you think?

Tomas, Alexey, do you have any thoughts on this matter? I think it is
important that we figure out the way to proceed in this patch.

[1]: /messages/by-id/b25ce80e-f536-78c8-d5c8-a5df3e230785@postgrespro.ru

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#108Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#107)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Oct 22, 2019 at 10:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have attempted to test the performance of (Stream + Spill) vs
(Stream + BGW pool) and I can see the similar gain what Alexey had
shown[1].

In addition to this, I have rebased the latest patchset [2] without
the two-phase logical decoding patch set.

Test results:
I have repeated the same test as Alexy[1] for 1kk and 1kk data and
here is my result
Stream + Spill
N time on master(sec) Total xact time (sec)
1kk 6 21
3kk 18 55

Stream + BGW pool
N time on master(sec) Total xact time (sec)
1kk 6 13
3kk 19 35

I think the test results for the master are missing.

Yeah, That time, I was planning to compare spill vs bgworker.
Also, how about

running these tests over a network (means master and subscriber are
not on the same machine)?

Yeah, we should do that that will show the merit of streaming the
in-progress transactions.

In general, yours and Alexy's test results

show that there is merit by having workers applying such transactions.
OTOH, as noted above [1], we are also worried about the performance
of Rollbacks if we follow that approach. I am not sure how much we
need to worry about Rollabcks if commits are faster, but can we think
of recording the changes in memory and only write to a file if the
changes are above a certain threshold? I think that might help saving
I/O in many cases. I am not very sure if we do that how much
additional workers can help, but they might still help. I think we
need to do some tests and experiments to figure out what is the best
approach? What do you think?

I agree with the point. I think we might need to do some small
changes and test to see what could be the best method to handle the
streamed changes at the subscriber end.

Tomas, Alexey, do you have any thoughts on this matter? I think it is
important that we figure out the way to proceed in this patch.

[1] - /messages/by-id/b25ce80e-f536-78c8-d5c8-a5df3e230785@postgrespro.ru

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#109Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dilip Kumar (#106)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Oct 22, 2019 at 10:30:16AM +0530, Dilip Kumar wrote:

On Fri, Oct 18, 2019 at 5:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Oct 14, 2019 at 3:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Sure, I wasn't really proposing to adding all stats from that patch,
including those related to streaming. We need to extract just those
related to spilling. And yes, it needs to be moved right after 0001.

I have extracted the spilling related code to a separate patch on top
of 0001. I have also fixed some bugs and review comments and attached
as a separate patch. Later I can merge it to the main patch if you
agree with the changes.

Few comments
-------------------------
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer
1.
+ {
+ {"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM,
+ gettext_noop("Sets the maximum memory to be used for logical decoding."),
+ gettext_noop("This much memory can be used by each internal "
+ "reorder buffer before spilling to disk or streaming."),
+ GUC_UNIT_KB
+ },

I think we can remove 'or streaming' from above sentence for now. We
can add it later with later patch where streaming will be allowed.

Done

2.
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable
class="parameter">subscription_name</replaceabl
</para>
</listitem>
</varlistentry>
+
+       <varlistentry>
+        <term><literal>work_mem</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          Limits the amount of memory used to decode changes on the
+          publisher.  If not specified, the publisher will use the default
+          specified by <varname>logical_decoding_work_mem</varname>. When
+          needed, additional data are spilled to disk.
+         </para>
+        </listitem>
+       </varlistentry>

It is not clear why we need this parameter at least with this patch?
I have raised this multiple times [1][2].

I have moved it out as a separate patch (0003) so that if we need that
we need this for the streaming transaction then we can keep this.

I'm OK with moving it to a separate patch. That being said I think
ability to control memory usage for individual subscriptions is very
useful. Saying "We don't need such parameter" is essentially equivalent
to saying "One size fits all" and I think we know that's not true.

Imagine a system with multiple subscriptions, some of them mostly
replicating OLTP changes, but one or two replicating tables that are
updated in batches. What we'd have is to allow higher limit for the
batch subscriptions, but much lower limit for the OLTP ones (which they
should never hit in practice).

With a single global GUC, you'll either have a high value - risking
OOM when the OLTP subscriptions happen to decode a batch update, or a
low value affecting the batch subscriotions.

It's not strictly necessary (and we already have such limit), so I'm OK
with treating it as an enhancement for the future.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#110Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dilip Kumar (#108)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Oct 22, 2019 at 11:01:48AM +0530, Dilip Kumar wrote:

On Tue, Oct 22, 2019 at 10:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have attempted to test the performance of (Stream + Spill) vs
(Stream + BGW pool) and I can see the similar gain what Alexey had
shown[1].

In addition to this, I have rebased the latest patchset [2] without
the two-phase logical decoding patch set.

Test results:
I have repeated the same test as Alexy[1] for 1kk and 1kk data and
here is my result
Stream + Spill
N time on master(sec) Total xact time (sec)
1kk 6 21
3kk 18 55

Stream + BGW pool
N time on master(sec) Total xact time (sec)
1kk 6 13
3kk 19 35

I think the test results for the master are missing.

Yeah, That time, I was planning to compare spill vs bgworker.
Also, how about

running these tests over a network (means master and subscriber are
not on the same machine)?

Yeah, we should do that that will show the merit of streaming the
in-progress transactions.

Which I agree it's an interesting feature, I think we need to stop
adding more stuff to this patch series - it's already complex enough, so
making it even more (unnecessary) stuff is a distraction and will make
it harder to get anything committed. Typical "scope creep".

I think the current behavior (spill to file) is sufficient for v0 and
can be improved later - that's fine. I don't think we need to bother
with comparisons to master very much, because while it might be a bit
slower in some cases, you can always disable streaming (so if there's a
regression for your workload, you can undo that).

In general, yours and Alexy's test results

show that there is merit by having workers applying such transactions.
OTOH, as noted above [1], we are also worried about the performance
of Rollbacks if we follow that approach. I am not sure how much we
need to worry about Rollabcks if commits are faster, but can we think
of recording the changes in memory and only write to a file if the
changes are above a certain threshold? I think that might help saving
I/O in many cases. I am not very sure if we do that how much
additional workers can help, but they might still help. I think we
need to do some tests and experiments to figure out what is the best
approach? What do you think?

I agree with the point. I think we might need to do some small
changes and test to see what could be the best method to handle the
streamed changes at the subscriber end.

Tomas, Alexey, do you have any thoughts on this matter? I think it is
important that we figure out the way to proceed in this patch.

[1] - /messages/by-id/b25ce80e-f536-78c8-d5c8-a5df3e230785@postgrespro.ru

I think the patch should do the simplest thing possible, i.e. what it
does today. Otherwise we'll never get it committed.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#111Alexey Kondratov
a.kondratov@postgrespro.ru
In reply to: Tomas Vondra (#110)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 22.10.2019 20:22, Tomas Vondra wrote:

On Tue, Oct 22, 2019 at 11:01:48AM +0530, Dilip Kumar wrote:

On Tue, Oct 22, 2019 at 10:46 AM Amit Kapila
<amit.kapila16@gmail.com> wrote:
  In general, yours and Alexy's test results

show that there is merit by having workers applying such transactions.
  OTOH, as noted above [1], we are also worried about the performance
of Rollbacks if we follow that approach.  I am not sure how much we
need to worry about Rollabcks if commits are faster, but can we think
of recording the changes in memory and only write to a file if the
changes are above a certain threshold?  I think that might help saving
I/O in many cases.  I am not very sure if we do that how much
additional workers can help, but they might still help.  I think we
need to do some tests and experiments to figure out what is the best
approach?  What do you think?

I agree with the point.  I think we might need to do some small
changes and test to see what could be the best method to handle the
streamed changes at the subscriber end.

Tomas, Alexey, do you have any thoughts on this matter?  I think it is
important that we figure out the way to proceed in this patch.

[1] -
/messages/by-id/b25ce80e-f536-78c8-d5c8-a5df3e230785@postgrespro.ru

I think the patch should do the simplest thing possible, i.e. what it
does today. Otherwise we'll never get it committed.

I have to agree with Tomas, that keeping things as simple as possible
should be a main priority right now. Otherwise, the entire patch set
will pass next release cycle without being committed at least partially.
In the same time, it resolves important problem from my perspective. It
moves I/O overhead from primary to replica using large transactions
streaming, which is a nice to have feature I guess.

Later it would be possible to replace logical apply worker with
bgworkers pool in a separated patch, if we decide that it is a viable
solution. Anyway, regarding the Amit's questions:

- I doubt that maintaining a separate buffer on the apply side before
spilling to disk would help enough. We already have ReorderBuffer with
logical_work_mem limit, and if we exceeded that limit on the sender
side, then most probably we exceed it on the applier side as well,
excepting the case when this new buffer size will be significantly
higher then logical_work_mem to keep multiple open xacts.

- I still think that we should optimize database for commits, not
rollbacks. BGworkers pool is dramatically slower for rollbacks-only
load, though being at least twice as faster for commits-only. I do not
know how it will perform with real life load, but this drawback may be
inappropriate for such a general purpose database like Postgres.

- Tomas' implementation of streaming with spilling does not have this
bias between commits/aborts. However, it has a noticeable performance
drop (~x5 slower compared with master [1]/messages/by-id/40c38758-04b5-74f4-c963-cf300f9e5dff@postgrespro.ru) for large transaction
consisting of many small rows. Although it is not of an order of
magnitude slower.

Another thing is it that about a year ago I have found some problems
with MVCC/visibility and fixed them somehow [1]/messages/by-id/40c38758-04b5-74f4-c963-cf300f9e5dff@postgrespro.ru. If I get it correctly
Tomas adapted some of those fixes into his patch set, but I think that
this part should be reviewed carefully again. I would be glad to check
it, but now I am a little bit confused with all the patch set variants
in the thread. Which is the last one? Is it still dependent on 2pc decoding?

[1]: /messages/by-id/40c38758-04b5-74f4-c963-cf300f9e5dff@postgrespro.ru
/messages/by-id/40c38758-04b5-74f4-c963-cf300f9e5dff@postgrespro.ru

Thanks for moving this patch forward!

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

#112Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#109)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Oct 22, 2019 at 10:42 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Oct 22, 2019 at 10:30:16AM +0530, Dilip Kumar wrote:

I have moved it out as a separate patch (0003) so that if we need that
we need this for the streaming transaction then we can keep this.

I'm OK with moving it to a separate patch. That being said I think
ability to control memory usage for individual subscriptions is very
useful. Saying "We don't need such parameter" is essentially equivalent
to saying "One size fits all" and I think we know that's not true.

Imagine a system with multiple subscriptions, some of them mostly
replicating OLTP changes, but one or two replicating tables that are
updated in batches. What we'd have is to allow higher limit for the
batch subscriptions, but much lower limit for the OLTP ones (which they
should never hit in practice).

This point is not clear to me. The changes are recorded in
ReorderBuffer which doesn't have any filtering aka it will have all
the changes irrespective of the subscriber. How will it make a
difference to have different limits?

With a single global GUC, you'll either have a high value - risking
OOM when the OLTP subscriptions happen to decode a batch update, or a
low value affecting the batch subscriotions.

It's not strictly necessary (and we already have such limit), so I'm OK
with treating it as an enhancement for the future.

I am fine too if its usage is clear. I might be missing something here.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#113Amit Kapila
amit.kapila16@gmail.com
In reply to: Alexey Kondratov (#111)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Oct 23, 2019 at 12:32 AM Alexey Kondratov
<a.kondratov@postgrespro.ru> wrote:

On 22.10.2019 20:22, Tomas Vondra wrote:

I think the patch should do the simplest thing possible, i.e. what it
does today. Otherwise we'll never get it committed.

I have to agree with Tomas, that keeping things as simple as possible
should be a main priority right now. Otherwise, the entire patch set
will pass next release cycle without being committed at least partially.
In the same time, it resolves important problem from my perspective. It
moves I/O overhead from primary to replica using large transactions
streaming, which is a nice to have feature I guess.

Later it would be possible to replace logical apply worker with
bgworkers pool in a separated patch, if we decide that it is a viable
solution. Anyway, regarding the Amit's questions:

- I doubt that maintaining a separate buffer on the apply side before
spilling to disk would help enough. We already have ReorderBuffer with
logical_work_mem limit, and if we exceeded that limit on the sender
side, then most probably we exceed it on the applier side as well,

I think on the sender side, the limit is for un-filtered changes
(which means on the ReorderBuffer which has all the changes) whereas,
on the receiver side, we will only have the requested changes which
can make a difference?

excepting the case when this new buffer size will be significantly
higher then logical_work_mem to keep multiple open xacts.

I am not sure but I think we can have different controlling parameters
on the subscriber-side.

- I still think that we should optimize database for commits, not
rollbacks. BGworkers pool is dramatically slower for rollbacks-only
load, though being at least twice as faster for commits-only. I do not
know how it will perform with real life load, but this drawback may be
inappropriate for such a general purpose database like Postgres.

- Tomas' implementation of streaming with spilling does not have this
bias between commits/aborts. However, it has a noticeable performance
drop (~x5 slower compared with master [1]) for large transaction
consisting of many small rows. Although it is not of an order of
magnitude slower.

Did you ever identify the reason why it was slower in that case? I
can see the numbers shared by you and Dilip which shows that the
BGWorker pool is a really good idea and will work great for
commit-mostly workload whereas the numbers without that are not very
encouraging, maybe we have not benchmarked enough. This is the reason
I am trying to see if we can do something to get the benefits similar
to what is shown by your idea.

I am not against doing something simple for the first version and then
enhance it later, but it won't be good if we commit it with regression
in some typical cases and depend on the user to use it when it seems
favorable to its case. Also, sometimes it becomes difficult to
generate enthusiasm to enhance the feature once the main patch is
committed. I am not telling that always happens or will happen in
this case. It is better if we put some energy and get things as good
as possible in the first go itself. I am as much interested as you,
Tomas or others are, otherwise, I wouldn't have spent a lot of time on
this to disentangle it from 2PC patch which seems to get stalled due
to lack of interest.

Another thing is it that about a year ago I have found some problems
with MVCC/visibility and fixed them somehow [1]. If I get it correctly
Tomas adapted some of those fixes into his patch set, but I think that
this part should be reviewed carefully again.

Agreed, I have read your emails and could see that you have done very
good work on this project along with Tomas. But unfortunately, it
didn't get committed. At this stage, we are working on just the first
part of the patch which is to allow the data to spill once it crosses
the logical_decoding_work_mem on the master side. I think we need
more problems to discuss and solve once that is done.

I would be glad to check
it, but now I am a little bit confused with all the patch set variants
in the thread. Which is the last one? Is it still dependent on 2pc decoding?

I think the latest patches posted by Dilip are not dependent on
logical decoding, but I haven't studied them yet. You can find those
at [1]/messages/by-id/CAFiTN-vHoksqvV4BZ0479NhugGe4QHq_ezngNdDd-YRQ_2cwug@mail.gmail.com[2]/messages/by-id/CAFiTN-vT+42xRbkw=hBnp44XkAyZaKZVA5hcvAMsYth3rk7vhg@mail.gmail.com. As per discussion in this thread, we are also trying to
see if we can make some part of the patch-series committed first, the
latest patches corresponding to which are posted at [3]/messages/by-id/CAFiTN-vkFB0RBEjVkLWhdgTYShSrSu3kCYObMghgXEwKA1FXRA@mail.gmail.com.

[1]: /messages/by-id/CAFiTN-vHoksqvV4BZ0479NhugGe4QHq_ezngNdDd-YRQ_2cwug@mail.gmail.com
[2]: /messages/by-id/CAFiTN-vT+42xRbkw=hBnp44XkAyZaKZVA5hcvAMsYth3rk7vhg@mail.gmail.com
[3]: /messages/by-id/CAFiTN-vkFB0RBEjVkLWhdgTYShSrSu3kCYObMghgXEwKA1FXRA@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#114Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#106)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Oct 22, 2019 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have merged bugs_and_review_comments_fix.patch changes to 0001 and 0002.

I was wondering whether we have checked the code coverage after this
patch? Previously, the existing tests seem to be covering most parts
of the function ReorderBufferSerializeTXN [1]https://coverage.postgresql.org/src/backend/replication/logical/reorderbuffer.c.gcov.html. After this patch, the
timing to call ReorderBufferSerializeTXN will change, so that might
impact the testing of the same. If it is already covered, then I
would like to either add a new test or extend existing test with the
help of new spill counters. If it is not getting covered, then we
need to think of extending the existing test or write a new test to
cover the function ReorderBufferSerializeTXN.

[1]: https://coverage.postgresql.org/src/backend/replication/logical/reorderbuffer.c.gcov.html

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#115vignesh C
vignesh21@gmail.com
In reply to: Tomas Vondra (#110)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Oct 22, 2019 at 10:52 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I think the patch should do the simplest thing possible, i.e. what it
does today. Otherwise we'll never get it committed.

I found a couple of crashes while reviewing and testing flushing of
open transaction data:
Issue 1:
#0 0x00007f22c5722337 in raise () from /lib64/libc.so.6
#1 0x00007f22c5723a28 in abort () from /lib64/libc.so.6
#2 0x0000000000ec5390 in ExceptionalCondition
(conditionName=0x10ea814 "!dlist_is_empty(head)", errorType=0x10ea804
"FailedAssertion",
fileName=0x10ea7e0 "../../../../src/include/lib/ilist.h",
lineNumber=458) at assert.c:54
#3 0x0000000000b4fb91 in dlist_tail_element_off (head=0x19e4db8,
off=64) at ../../../../src/include/lib/ilist.h:458
#4 0x0000000000b546d0 in ReorderBufferAbortOld (rb=0x191b6b0,
oldestRunningXid=3834) at reorderbuffer.c:1966
#5 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x19af990,
buf=0x7ffcbc26dc50) at decode.c:332
#6 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x19af990,
record=0x19afc50) at decode.c:121
#7 0x0000000000b7109e in XLogSendLogical () at walsender.c:2845
#8 0x0000000000b6f5e4 in WalSndLoop (send_data=0xb70f77
<XLogSendLogical>) at walsender.c:2199
#9 0x0000000000b6c7e1 in StartLogicalReplication (cmd=0x1983168) at
walsender.c:1128
#10 0x0000000000b6da6f in exec_replication_command
(cmd_string=0x18f70a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
(proto_version '1', publication_names '\"pub1\"')")
at walsender.c:1545

Issue 2:
#0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
#1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
#2 0x0000000000ec4e1d in ExceptionalCondition
(conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
errorType=0x10ea284 "FailedAssertion",
fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
#3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:3052
#4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0y x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:1318
#5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2b9d778) at reorderbuffer.c:1257
#6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
oldestRunningXid=3835) at reorderbuffer.c:1973
#7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
buf=0x7ffcbc74cc00) at decode.c:332
#8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
record=0x2b67990) at decode.c:121
#9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845

These failures come randomly.
I'm not able to reproduce this issue with simple test case.
I have attached the test case which I used to test.
I will further try to find a scenario which could reproduce consistently.
Posting it so that it can help someone in identifying the problem
parallelly through code review by experts.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

mix_data_test.ctext/x-c-code; charset=US-ASCII; name=mix_data_test.cDownload
#116Dilip Kumar
dilipbalaut@gmail.com
In reply to: vignesh C (#115)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Oct 30, 2019 at 9:38 AM vignesh C <vignesh21@gmail.com> wrote:

I have noticed one more problem in the logic of setting the logical
decoding work mem from the create subscription command. Suppose in
subscription command we don't give the work mem then it sends the
garbage value to the walsender and the walsender overwrite its value
with the garbage value. After investigating a bit I have found the
reason for the same.

@@ -406,6 +406,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
appendStringInfo(&cmd, "proto_version '%u'",
options->proto.logical.proto_version);

+ appendStringInfo(&cmd, ", work_mem '%d'",
+ options->proto.logical.work_mem);

I think the problem is we are unconditionally sending the work_mem as
part of the CREATE REPLICATION SLOT, without checking whether it's
valid or not.

--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -71,6 +71,7 @@ GetSubscription(Oid subid, bool missing_ok)
  sub->name = pstrdup(NameStr(subform->subname));
  sub->owner = subform->subowner;
  sub->enabled = subform->subenabled;
+ sub->workmem = subform->subworkmem;

Another problem is that there is no handling if the subform->subworkmem is NULL.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#117Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Dilip Kumar (#116)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Hello hackers,

I've done some performance testing of this feature. Following is my
test case (taken from an earlier thread):

postgres=# CREATE TABLE large_test (num1 bigint, num2 double
precision, num3 double precision);
postgres=# \timing on
postgres=# EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1,
num2, num3) SELECT round(random()*10), random(), random()*142 FROM
generate_series(1, 1000000) s(i);

I've kept the publisher and subscriber in two different system.

HEAD:
With 1000000 tuples,
Execution Time: 2576.821 ms, Time: 9.632.158 ms (00:09.632), Spill count: 245
With 10000000 tuples (10 times more),
Execution Time: 30359.509 ms, Time: 95261.024 ms (01:35.261), Spill count: 2442

With the memory accounting patch, following are the performance results:
With 100000 tuples,
logical_decoding_work_mem=64kB, Execution Time: 2414.371 ms, Time:
9648.223 ms (00:09.648), Spill count: 2315
logical_decoding_work_mem=64MB, Execution Time: 2477.830 ms, Time:
9895.161 ms (00:09.895), Spill count 3
With 1000000 tuples (10 times more),
logical_decoding_work_mem=64kB, Execution Time: 38259.227 ms, Time:
105761.978 ms (01:45.762), Spill count: 23149
logical_decoding_work_mem=64MB, Execution Time: 24624.639 ms, Time:
89985.342 ms (01:29.985), Spill count: 23

With logical decoding of in-progress transactions patch and with
streaming on, following are the performance results:
With 100000 tuples,
logical_decoding_work_mem=64kB, Execution Time: 2674.034 ms, Time:
20779.601 ms (00:20.780)
logical_decoding_work_mem=64MB, Execution Time: 2062.404 ms, Time:
9559.953 ms (00:09.560)
With 1000000 tuples (10 times more),
logical_decoding_work_mem=64kB, Execution Time: 26949.588 ms, Time:
196261.892 ms (03:16.262)
logical_decoding_work_mem=64MB, Execution Time: 27084.403 ms, Time:
90079.286 ms (01:30.079)
--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

#118Dilip Kumar
dilipbalaut@gmail.com
In reply to: Kuntal Ghosh (#117)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Nov 4, 2019 at 2:43 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

Hello hackers,

I've done some performance testing of this feature. Following is my
test case (taken from an earlier thread):

postgres=# CREATE TABLE large_test (num1 bigint, num2 double
precision, num3 double precision);
postgres=# \timing on
postgres=# EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1,
num2, num3) SELECT round(random()*10), random(), random()*142 FROM
generate_series(1, 1000000) s(i);

I've kept the publisher and subscriber in two different system.

HEAD:
With 1000000 tuples,
Execution Time: 2576.821 ms, Time: 9.632.158 ms (00:09.632), Spill count: 245
With 10000000 tuples (10 times more),
Execution Time: 30359.509 ms, Time: 95261.024 ms (01:35.261), Spill count: 2442

With the memory accounting patch, following are the performance results:
With 100000 tuples,
logical_decoding_work_mem=64kB, Execution Time: 2414.371 ms, Time:
9648.223 ms (00:09.648), Spill count: 2315
logical_decoding_work_mem=64MB, Execution Time: 2477.830 ms, Time:
9895.161 ms (00:09.895), Spill count 3
With 1000000 tuples (10 times more),
logical_decoding_work_mem=64kB, Execution Time: 38259.227 ms, Time:
105761.978 ms (01:45.762), Spill count: 23149
logical_decoding_work_mem=64MB, Execution Time: 24624.639 ms, Time:
89985.342 ms (01:29.985), Spill count: 23

With logical decoding of in-progress transactions patch and with
streaming on, following are the performance results:
With 100000 tuples,
logical_decoding_work_mem=64kB, Execution Time: 2674.034 ms, Time:
20779.601 ms (00:20.780)
logical_decoding_work_mem=64MB, Execution Time: 2062.404 ms, Time:
9559.953 ms (00:09.560)
With 1000000 tuples (10 times more),
logical_decoding_work_mem=64kB, Execution Time: 26949.588 ms, Time:
196261.892 ms (03:16.262)
logical_decoding_work_mem=64MB, Execution Time: 27084.403 ms, Time:
90079.286 ms (01:30.079)

So your result shows that with "streaming on", performance is
degrading? By any chance did you try to see where is the bottleneck?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#119Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Dilip Kumar (#118)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Nov 4, 2019 at 3:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

So your result shows that with "streaming on", performance is
degrading? By any chance did you try to see where is the bottleneck?

Right. But, as we increase the logical_decoding_work_mem, the
performance improves. I've not analyzed the bottleneck yet. I'm
looking into the same.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

#120vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#114)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Oct 24, 2019 at 7:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Oct 22, 2019 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have merged bugs_and_review_comments_fix.patch changes to 0001 and 0002.

I was wondering whether we have checked the code coverage after this
patch? Previously, the existing tests seem to be covering most parts
of the function ReorderBufferSerializeTXN [1]. After this patch, the
timing to call ReorderBufferSerializeTXN will change, so that might
impact the testing of the same. If it is already covered, then I
would like to either add a new test or extend existing test with the
help of new spill counters. If it is not getting covered, then we
need to think of extending the existing test or write a new test to
cover the function ReorderBufferSerializeTXN.

I have run the tests with coverage and found that
ReorderBufferSerializeTXN is not being hit.
The reason it is not being hit is because of the following check in
ReorderBufferCheckMemoryLimit:
/* bail out if we haven't exceeded the memory limit */
if (rb->size < logical_decoding_work_mem * 1024L)
return;
Previously the tests from contrib/test_decoding could hit
ReorderBufferSerializeTXN function.
I'm checking if we can modify the test or add new test to hit
ReorderBufferSerializeTXN function.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

#121Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#115)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Oct 30, 2019 at 9:38 AM vignesh C <vignesh21@gmail.com> wrote:

On Tue, Oct 22, 2019 at 10:52 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I think the patch should do the simplest thing possible, i.e. what it
does today. Otherwise we'll never get it committed.

I found a couple of crashes while reviewing and testing flushing of
open transaction data:

Thanks for doing these tests. However, I don't think these issues are
anyway related to this patch. It seems to be base code issues
manifested by this patch. See my analysis below.

Issue 1:
#0 0x00007f22c5722337 in raise () from /lib64/libc.so.6
#1 0x00007f22c5723a28 in abort () from /lib64/libc.so.6
#2 0x0000000000ec5390 in ExceptionalCondition
(conditionName=0x10ea814 "!dlist_is_empty(head)", errorType=0x10ea804
"FailedAssertion",
fileName=0x10ea7e0 "../../../../src/include/lib/ilist.h",
lineNumber=458) at assert.c:54
#3 0x0000000000b4fb91 in dlist_tail_element_off (head=0x19e4db8,
off=64) at ../../../../src/include/lib/ilist.h:458
#4 0x0000000000b546d0 in ReorderBufferAbortOld (rb=0x191b6b0,
oldestRunningXid=3834) at reorderbuffer.c:1966
#5 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x19af990,
buf=0x7ffcbc26dc50) at decode.c:332

This seems to be the problem of base code where we abort immediately
after serializing the changes because in that case, the changes list
will be empty. I think you can try to reproduce it via the debugger
or by hacking the code such that it serializes after every change and
then if you abort after one change, it should hit this problem.

Issue 2:
#0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
#1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
#2 0x0000000000ec4e1d in ExceptionalCondition
(conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
errorType=0x10ea284 "FailedAssertion",
fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
#3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:3052
#4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0y x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:1318
#5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2b9d778) at reorderbuffer.c:1257
#6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
oldestRunningXid=3835) at reorderbuffer.c:1973

This seems to be again the problem with base code as we don't update
the final_lsn for subtransactions during ReorderBufferAbortOld. This
can also be reproduced with some hacking in code or via debugger in a
similar way as explained for the previous problem but with a
difference that there must be subtransaction involved in this case.

#7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
buf=0x7ffcbc74cc00) at decode.c:332
#8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
record=0x2b67990) at decode.c:121
#9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845

These failures come randomly.
I'm not able to reproduce this issue with simple test case.

Yeah, it appears to be difficult to reproduce unless you hack the code
to serialize every change or use debugger to forcefully flush the
changes every time.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#122Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#121)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Nov 4, 2019 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Oct 30, 2019 at 9:38 AM vignesh C <vignesh21@gmail.com> wrote:

On Tue, Oct 22, 2019 at 10:52 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I think the patch should do the simplest thing possible, i.e. what it
does today. Otherwise we'll never get it committed.

I found a couple of crashes while reviewing and testing flushing of
open transaction data:

Thanks for doing these tests. However, I don't think these issues are
anyway related to this patch. It seems to be base code issues
manifested by this patch. See my analysis below.

Issue 1:
#0 0x00007f22c5722337 in raise () from /lib64/libc.so.6
#1 0x00007f22c5723a28 in abort () from /lib64/libc.so.6
#2 0x0000000000ec5390 in ExceptionalCondition
(conditionName=0x10ea814 "!dlist_is_empty(head)", errorType=0x10ea804
"FailedAssertion",
fileName=0x10ea7e0 "../../../../src/include/lib/ilist.h",
lineNumber=458) at assert.c:54
#3 0x0000000000b4fb91 in dlist_tail_element_off (head=0x19e4db8,
off=64) at ../../../../src/include/lib/ilist.h:458
#4 0x0000000000b546d0 in ReorderBufferAbortOld (rb=0x191b6b0,
oldestRunningXid=3834) at reorderbuffer.c:1966
#5 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x19af990,
buf=0x7ffcbc26dc50) at decode.c:332

This seems to be the problem of base code where we abort immediately
after serializing the changes because in that case, the changes list
will be empty. I think you can try to reproduce it via the debugger
or by hacking the code such that it serializes after every change and
then if you abort after one change, it should hit this problem.

I think you might need to kill the server after all changes are
serialized otherwise normal abort will hit the ReorderBufferAbort and
that will remove your ReorderBufferTXN entry and you will never hit
this case.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#123vignesh C
vignesh21@gmail.com
In reply to: vignesh C (#120)
2 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Nov 4, 2019 at 3:46 PM vignesh C <vignesh21@gmail.com> wrote:

On Thu, Oct 24, 2019 at 7:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Oct 22, 2019 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have merged bugs_and_review_comments_fix.patch changes to 0001 and 0002.

I was wondering whether we have checked the code coverage after this
patch? Previously, the existing tests seem to be covering most parts
of the function ReorderBufferSerializeTXN [1]. After this patch, the
timing to call ReorderBufferSerializeTXN will change, so that might
impact the testing of the same. If it is already covered, then I
would like to either add a new test or extend existing test with the
help of new spill counters. If it is not getting covered, then we
need to think of extending the existing test or write a new test to
cover the function ReorderBufferSerializeTXN.

I have run the tests with coverage and found that
ReorderBufferSerializeTXN is not being hit.
The reason it is not being hit is because of the following check in
ReorderBufferCheckMemoryLimit:
/* bail out if we haven't exceeded the memory limit */
if (rb->size < logical_decoding_work_mem * 1024L)
return;
Previously the tests from contrib/test_decoding could hit
ReorderBufferSerializeTXN function.
I'm checking if we can modify the test or add new test to hit
ReorderBufferSerializeTXN function.

I have made one change to the configuration file in
contrib/test_decoding directory, with that the coverage seems to be
fine. I have seen that the coverage is almost like the code before
applying the patch. I have attached the test change and the coverage
report for reference. Coverage report includes the core logical work
memory files for base code and by applying
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer and
0002-Track-statistics-for-spilling patches.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Add-logical_decoding_work_mem-configuration-for-test.patchtext/x-patch; charset=US-ASCII; name=0001-Add-logical_decoding_work_mem-configuration-for-test.patchDownload
From b86ddac054ad29ac8a48e1e49432f338ef8f947b Mon Sep 17 00:00:00 2001
From: vignesh <vignesh@localhost.localdomain>
Date: Wed, 6 Nov 2019 09:50:48 +0530
Subject: [PATCH] Add logical_decoding_work_mem configuration for test
 configuration.

Added logical_decoding_work_mem to test configuration, setting it
to minimum value so that all test paths are covered.
---
 contrib/test_decoding/logical.conf | 1 +
 1 file changed, 1 insertion(+)

diff --git a/contrib/test_decoding/logical.conf b/contrib/test_decoding/logical.conf
index 367f706..07c4d3d 100644
--- a/contrib/test_decoding/logical.conf
+++ b/contrib/test_decoding/logical.conf
@@ -1,2 +1,3 @@
 wal_level = logical
 max_replication_slots = 4
+logical_decoding_work_mem = 64kB
-- 
1.8.3.1

coverage.tarapplication/x-tar; name=coverage.tarDownload
#124vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#121)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Nov 4, 2019 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Oct 30, 2019 at 9:38 AM vignesh C <vignesh21@gmail.com> wrote:

On Tue, Oct 22, 2019 at 10:52 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I think the patch should do the simplest thing possible, i.e. what it
does today. Otherwise we'll never get it committed.

I found a couple of crashes while reviewing and testing flushing of
open transaction data:

Thanks for doing these tests. However, I don't think these issues are
anyway related to this patch. It seems to be base code issues
manifested by this patch. See my analysis below.

Issue 1:
#0 0x00007f22c5722337 in raise () from /lib64/libc.so.6
#1 0x00007f22c5723a28 in abort () from /lib64/libc.so.6
#2 0x0000000000ec5390 in ExceptionalCondition
(conditionName=0x10ea814 "!dlist_is_empty(head)", errorType=0x10ea804
"FailedAssertion",
fileName=0x10ea7e0 "../../../../src/include/lib/ilist.h",
lineNumber=458) at assert.c:54
#3 0x0000000000b4fb91 in dlist_tail_element_off (head=0x19e4db8,
off=64) at ../../../../src/include/lib/ilist.h:458
#4 0x0000000000b546d0 in ReorderBufferAbortOld (rb=0x191b6b0,
oldestRunningXid=3834) at reorderbuffer.c:1966
#5 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x19af990,
buf=0x7ffcbc26dc50) at decode.c:332

This seems to be the problem of base code where we abort immediately
after serializing the changes because in that case, the changes list
will be empty. I think you can try to reproduce it via the debugger
or by hacking the code such that it serializes after every change and
then if you abort after one change, it should hit this problem.

Issue 2:
#0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
#1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
#2 0x0000000000ec4e1d in ExceptionalCondition
(conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
errorType=0x10ea284 "FailedAssertion",
fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
#3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:3052
#4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0y x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:1318
#5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2b9d778) at reorderbuffer.c:1257
#6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
oldestRunningXid=3835) at reorderbuffer.c:1973

This seems to be again the problem with base code as we don't update
the final_lsn for subtransactions during ReorderBufferAbortOld. This
can also be reproduced with some hacking in code or via debugger in a
similar way as explained for the previous problem but with a
difference that there must be subtransaction involved in this case.

#7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
buf=0x7ffcbc74cc00) at decode.c:332
#8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
record=0x2b67990) at decode.c:121
#9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845

These failures come randomly.
I'm not able to reproduce this issue with simple test case.

Yeah, it appears to be difficult to reproduce unless you hack the code
to serialize every change or use debugger to forcefully flush the
changes every time.

Thanks Amit for your analysis, I was able to reproduce the above issue
consistently by making some code changes and with help of debugger. I
did one change so that it flushes every time instead of flushing after
the buffer size exceeds the logical_decoding_work_mem, attached one of
the transactions and called abort. When the server restarts after
abort, this problem occurs consistently. I could reproduce the issue
with base code also. It seems like this issue is not an issue of
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer patch and
exists from base code. I will post the issue in hackers with details.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

#125Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#123)
2 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Nov 6, 2019 at 11:33 AM vignesh C <vignesh21@gmail.com> wrote:

I have made one change to the configuration file in
contrib/test_decoding directory, with that the coverage seems to be
fine. I have seen that the coverage is almost like the code before
applying the patch. I have attached the test change and the coverage
report for reference. Coverage report includes the core logical work
memory files for base code and by applying
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer and
0002-Track-statistics-for-spilling patches.

Thanks, I have incorporated your test changes and modified the two
patches. Please see attached.

Changes:
---------------
1. In guc.c, we should include reorderbuffer.h, not logical.h as we
define logical_decoding_work_mem in earlier.

2.
+ *   To limit the amount of memory used by decoded changes, we track memory
+ *   used at the reorder buffer level (i.e. total amount of memory), and for
+ *   each toplevel transaction. When the total amount of used memory exceeds
+ *   the limit, the toplevel transaction consuming the most memory is then
+ *   serialized to disk.

In the above comments, removed 'toplevel' as we track memory usage for
both toplevel and subtransactions.

3. There were still a few mentions of streaming which I have removed.

4. In the docs, the type for stats spill_* was integer whereas it
should be bigint.

5.
+UpdateSpillStats(LogicalDecodingContext *ctx)
+{
+ ReorderBuffer *rb = ctx->reorder;
+
+ SpinLockAcquire(&MyWalSnd->mutex);
+
+ MyWalSnd->spillTxns = rb->spillTxns;
+ MyWalSnd->spillCount = rb->spillCount;
+ MyWalSnd->spillBytes = rb->spillBytes;
+
+ elog(WARNING, "UpdateSpillStats: updating stats %p %ld %ld %ld",
+ rb, rb->spillTxns, rb->spillCount, rb->spillBytes);

Changed the above elog to DEBUG1 as otherwise it was getting printed
very frequently. I think we can make it DEBUG2 if we want.

6. There was an extra space in rules.out due to which test was
failing. I have fixed it.

What do you think?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.patchapplication/octet-stream; name=0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.patchDownload
From 4a00f6da1d7254151bf6eb27f55ba238471ac152 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 21 Oct 2019 16:59:17 +0530
Subject: [PATCH 1/2] Add logical_decoding_work_mem to limit ReorderBuffer
 memory usage

Instead of deciding to serialize a transaction merely based on the
number of changes in that xact (toplevel or subxact), this makes
the decisions based on amount of memory consumed by the changes.

The memory limit is defined by a new logical_decoding_work_mem GUC,
so for example we can do this

    SET logical_decoding_work_mem = '128kB'

to trigger very aggressive streaming. The minimum value is 64kB.

When adding a change to a transaction, we account for the size in
two places. Firstly, in the ReorderBuffer, which is then used to
decide if we reached the total memory limit. And secondly in the
transaction the change belongs to, so that we can pick the largest
transaction to evict (and serialize to disk).

We still use max_changes_in_memory when loading changes serialized
to disk. The trouble is we can't use the memory limit directly as
there might be multiple subxact serialized, we need to read all of
them but we don't know how many are there (and which subxact to
read first).

We do not serialize the ReorderBufferTXN entries, so if there is a
transaction with many subxacts, most memory may be in this type of
objects. Those records are not included in the memory accounting.

We also do not account for INTERNAL_TUPLECID changes, which are
kept in a separate list and not evicted from memory. Transactions
with many CTID changes may consume significant amounts of memory,
but we can't really do much about that.

The current eviction algorithm is very simple - the transaction is
picked merely by size, while it might be useful to also consider age
(LSN) of the changes for example. With the new Generational memory
allocator, evicting the oldest changes would make it more likely
the memory gets actually pfreed.

The logical_decoding_work_mem may be set either in postgresql.conf,
in which case it serves as the default for all publishers on that
instance, or when creating the subscription, using a work_mem
paramemter in the WITH clause (specifies number of kilobytes).
---
 contrib/test_decoding/logical.conf              |   1 +
 doc/src/sgml/config.sgml                        |  21 ++
 src/backend/replication/logical/reorderbuffer.c | 293 +++++++++++++++++++++++-
 src/backend/utils/misc/guc.c                    |  13 ++
 src/backend/utils/misc/postgresql.conf.sample   |   1 +
 src/include/replication/reorderbuffer.h         |  16 ++
 6 files changed, 333 insertions(+), 12 deletions(-)

diff --git a/contrib/test_decoding/logical.conf b/contrib/test_decoding/logical.conf
index 367f706..07c4d3d 100644
--- a/contrib/test_decoding/logical.conf
+++ b/contrib/test_decoding/logical.conf
@@ -1,2 +1,3 @@
 wal_level = logical
 max_replication_slots = 4
+logical_decoding_work_mem = 64kB
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 46bc31d..04ef505 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1732,6 +1732,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-logical-decoding-work-mem" xreflabel="logical_decoding_work_mem">
+      <term><varname>logical_decoding_work_mem</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>logical_decoding_work_mem</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to be used by logical decoding,
+        before some of the decoded changes are written to local disk. This
+        limits the amount of memory used by logical streaming replication
+        connections. It defaults to 64 megabytes (<literal>64MB</literal>).
+        Since each replication connection only uses a single buffer of this size,
+        and an installation normally doesn't have many such connections
+        concurrently (as limited by <varname>max_wal_senders</varname>), it's
+        safe to set this value significantly higher than <varname>work_mem</varname>,
+        reducing the amount of decoded changes written to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 62e5424..3576d4e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -49,6 +49,34 @@
  *	  GenerationContext for the variable-length transaction data (allocated
  *	  and freed in groups with similar lifespan).
  *
+ *	  To limit the amount of memory used by decoded changes, we track memory
+ *	  used at the reorder buffer level (i.e. total amount of memory), and for
+ *	  each transaction. When the total amount of used memory exceeds the
+ *	  limit, the transaction consuming the most memory is then serialized to
+ *	  disk.
+ *
+ *	  Only decoded changes are evicted from memory (spilled to disk), not the
+ *	  transaction records. The number of toplevel transactions is limited,
+ *	  but a transaction with many subtransactions may still consume significant
+ *	  amounts of memory. The transaction records are fairly small, though, and
+ *	  are not included in the memory limit.
+ *
+ *	  The current eviction algorithm is very simple - the transaction is
+ *	  picked merely by size, while it might be useful to also consider age
+ *	  (LSN) of the changes for example. With the new Generational memory
+ *	  allocator, evicting the oldest changes would make it more likely the
+ *	  memory gets actually freed.
+ *
+ *	  We still rely on max_changes_in_memory when loading serialized changes
+ *	  back into memory. At that point we can't use the memory limit directly
+ *	  as we load the subxacts independently. One option do deal with this
+ *	  would be to count the subxacts, and allow each to allocate 1/N of the
+ *	  memory limit. That however does not seem very appealing, because with
+ *	  many subtransactions it may easily cause trashing (short cycles of
+ *	  deserializing and applying very few changes). We probably should give
+ *	  a bit more memory to the oldest subtransactions, because it's likely
+ *	  the source for the next sequence of changes.
+ *
  * -------------------------------------------------------------------------
  */
 #include "postgres.h"
@@ -154,7 +182,8 @@ typedef struct ReorderBufferDiskChange
  * resource management here, but it's not entirely clear what that would look
  * like.
  */
-static const Size max_changes_in_memory = 4096;
+int			logical_decoding_work_mem;
+static const Size max_changes_in_memory = 4096; /* XXX for restore only */
 
 /* ---------------------------------------
  * primary reorderbuffer support routines
@@ -189,7 +218,7 @@ static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTX
  * Disk serialization support functions
  * ---------------------------------------
  */
-static void ReorderBufferCheckSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferCheckMemoryLimit(ReorderBuffer *rb);
 static void ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 										 int fd, ReorderBufferChange *change);
@@ -217,6 +246,14 @@ static void ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static void ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn,
 										  Relation relation, ReorderBufferChange *change);
 
+/*
+ * ---------------------------------------
+ * memory accounting
+ * ---------------------------------------
+ */
+static Size ReorderBufferChangeSize(ReorderBufferChange *change);
+static void ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
+								ReorderBufferChange *change, bool addition);
 
 /*
  * Allocate a new ReorderBuffer and clean out any old serialized state from
@@ -269,6 +306,7 @@ ReorderBufferAllocate(void)
 
 	buffer->outbuf = NULL;
 	buffer->outbufsize = 0;
+	buffer->size = 0;
 
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
@@ -374,6 +412,9 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 void
 ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 {
+	/* update memory accounting info */
+	ReorderBufferChangeMemoryUpdate(rb, change, false);
+
 	/* free contained data */
 	switch (change->action)
 	{
@@ -585,12 +626,18 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	change->lsn = lsn;
+	change->txn = txn;
+
 	Assert(InvalidXLogRecPtr != lsn);
 	dlist_push_tail(&txn->changes, &change->node);
 	txn->nentries++;
 	txn->nentries_mem++;
 
-	ReorderBufferCheckSerializeTXN(rb, txn);
+	/* update memory accounting information */
+	ReorderBufferChangeMemoryUpdate(rb, change, true);
+
+	/* check the memory limits and evict something if needed */
+	ReorderBufferCheckMemoryLimit(rb);
 }
 
 /*
@@ -1217,6 +1264,9 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		change = dlist_container(ReorderBufferChange, node, iter.cur);
 
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
 		ReorderBufferReturnChange(rb, change);
 	}
 
@@ -1229,7 +1279,11 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		ReorderBufferChange *change;
 
 		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
 		ReorderBufferReturnChange(rb, change);
 	}
 
@@ -2082,9 +2136,48 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferQueueChange(rb, xid, lsn, change);
 }
 
+/*
+ * Update the memory accounting info. We track memory used by the whole
+ * reorder buffer and the transaction containing the change.
+ */
+static void
+ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
+								ReorderBufferChange *change,
+								bool addition)
+{
+	Size		sz;
+
+	Assert(change->txn);
+
+	/*
+	 * Ignore tuple CID changes, because those are not evicted when
+	 * reaching memory limit. So we just don't count them, because it
+	 * might easily trigger a pointless attempt to spill.
+	 */
+	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
+		return;
+
+	sz = ReorderBufferChangeSize(change);
+
+	if (addition)
+	{
+		change->txn->size += sz;
+		rb->size += sz;
+	}
+	else
+	{
+		Assert((rb->size >= sz) && (change->txn->size >= sz));
+		change->txn->size -= sz;
+		rb->size -= sz;
+	}
+}
 
 /*
  * Add new (relfilenode, tid) -> (cmin, cmax) mappings.
+ *
+ * We do not include this change type in memory accounting, because we
+ * keep CIDs in a separate list and do not evict them when reaching
+ * the memory limit.
  */
 void
 ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
@@ -2103,6 +2196,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->data.tuplecid.cmax = cmax;
 	change->data.tuplecid.combocid = combocid;
 	change->lsn = lsn;
+	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
@@ -2230,20 +2324,84 @@ ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
 }
 
 /*
- * Check whether the transaction tx should spill its data to disk.
+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
+ *
+ * XXX With many subtransactions this might be quite slow, because we'll have
+ * to walk through all of them. There are some options how we could improve
+ * that: (a) maintain some secondary structure with transactions sorted by
+ * amount of changes, (b) not looking for the entirely largest transaction,
+ * but e.g. for transaction using at least some fraction of the memory limit,
+ * and (c) evicting multiple transactions at once, e.g. to free a given portion
+ * of the memory limit (e.g. 50%).
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTXN(ReorderBuffer *rb)
+{
+	HASH_SEQ_STATUS hash_seq;
+	ReorderBufferTXNByIdEnt	*ent;
+	ReorderBufferTXN *largest = NULL;
+
+	hash_seq_init(&hash_seq, rb->by_txn);
+	while ((ent = hash_seq_search(&hash_seq)) != NULL)
+	{
+		ReorderBufferTXN *txn = ent->txn;
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
+ * Check whether the logical_decoding_work_mem limit was reached, and if yes
+ * pick the transaction to evict and spill the changes to disk.
+ *
+ * XXX At this point we select just a single (largest) transaction, but
+ * we might also adapt a more elaborate eviction strategy - for example
+ * evicting enough transactions to free certain fraction (e.g. 50%) of
+ * the memory limit.
  */
 static void
-ReorderBufferCheckSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
+	ReorderBufferTXN *txn;
+
+	/* bail out if we haven't exceeded the memory limit */
+	if (rb->size < logical_decoding_work_mem * 1024L)
+		return;
+
 	/*
-	 * TODO: improve accounting so we cheaply can take subtransactions into
-	 * account here.
+	 * Pick the largest transaction (or subtransaction) and evict it from
+	 * memory by serializing it to disk.
 	 */
-	if (txn->nentries_mem >= max_changes_in_memory)
-	{
-		ReorderBufferSerializeTXN(rb, txn);
-		Assert(txn->nentries_mem == 0);
-	}
+	txn = ReorderBufferLargestTXN(rb);
+
+	ReorderBufferSerializeTXN(rb, txn);
+
+	/*
+	 * After eviction, the transaction should have no entries in memory, and
+	 * should use 0 bytes for changes.
+	 */
+	Assert(txn->size == 0);
+	Assert(txn->nentries_mem == 0);
+
+	/*
+	 * And furthermore, evicting the transaction should get us below the
+	 * memory limit again - it is not possible that we're still exceeding the
+	 * memory limit after evicting the transaction.
+	 *
+	 * This follows from the simple fact that the selected transaction is at
+	 * least as large as the most recent change (which caused us to go over
+	 * the memory limit). So by evicting it we're definitely back below the
+	 * memory limit.
+	 */
+	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
 /*
@@ -2513,6 +2671,84 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 }
 
 /*
+ * Size of a change in memory.
+ */
+static Size
+ReorderBufferChangeSize(ReorderBufferChange *change)
+{
+	Size		sz = sizeof(ReorderBufferChange);
+
+	switch (change->action)
+	{
+			/* fall through these, they're all similar enough */
+		case REORDER_BUFFER_CHANGE_INSERT:
+		case REORDER_BUFFER_CHANGE_UPDATE:
+		case REORDER_BUFFER_CHANGE_DELETE:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
+			{
+				ReorderBufferTupleBuf *oldtup,
+						   *newtup;
+				Size		oldlen = 0;
+				Size		newlen = 0;
+
+				oldtup = change->data.tp.oldtuple;
+				newtup = change->data.tp.newtuple;
+
+				if (oldtup)
+				{
+					sz += sizeof(HeapTupleData);
+					oldlen = oldtup->tuple.t_len;
+					sz += oldlen;
+				}
+
+				if (newtup)
+				{
+					sz += sizeof(HeapTupleData);
+					newlen = newtup->tuple.t_len;
+					sz += newlen;
+				}
+
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_MESSAGE:
+			{
+				Size		prefix_size = strlen(change->data.msg.prefix) + 1;
+
+				sz += prefix_size + change->data.msg.message_size +
+					sizeof(Size) + sizeof(Size);
+
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
+			{
+				Snapshot	snap;
+
+				snap = change->data.snapshot;
+
+				sz += sizeof(SnapshotData) +
+					sizeof(TransactionId) * snap->xcnt +
+					sizeof(TransactionId) * snap->subxcnt;
+
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_TRUNCATE:
+			{
+				sz += sizeof(Oid) * change->data.truncate.nrelids;
+
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
+		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+			/* ReorderBufferChange contains everything important */
+			break;
+	}
+
+	return sz;
+}
+
+
+/*
  * Restore a number of changes spilled to disk back into memory.
  */
 static Size
@@ -2784,6 +3020,16 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	dlist_push_tail(&txn->changes, &change->node);
 	txn->nentries_mem++;
+
+	/*
+	 * Update memory accounting for the restored change.  We need to do this
+	 * although we don't check the memory limit when restoring the changes in
+	 * this branch (we only do that when initially queueing the changes after
+	 * decoding), because we will release the changes later, and that will
+	 * update the accounting too (subtracting the size from the counters).
+	 * And we don't want to underflow there.
+	 */
+	ReorderBufferChangeMemoryUpdate(rb, change, true);
 }
 
 /*
@@ -3003,6 +3249,19 @@ ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn,
  *
  * We cannot replace unchanged toast tuples though, so those will still point
  * to on-disk toast data.
+ *
+ * While updating the existing change with detoasted tuple data, we need to
+ * update the memory accounting info, because the change size will differ.
+ * Otherwise the accounting may get out of sync, triggering serialization
+ * at unexpected times.
+ *
+ * We simply subtract size of the change before rejiggering the tuple, and
+ * then adding the new size. This makes it look like the change was removed
+ * and then added back, except it only tweaks the accounting info.
+ *
+ * In particular it can't trigger serialization, which would be pointless
+ * anyway as it happens during commit processing right before handing
+ * the change to the output plugin.
  */
 static void
 ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
@@ -3023,6 +3282,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	if (txn->toast_hash == NULL)
 		return;
 
+	/*
+	 * We're going to modify the size of the change, so to make sure the
+	 * accounting is correct we'll make it look like we're removing the
+	 * change now (with the old size), and then re-add it at the end.
+	 */
+	ReorderBufferChangeMemoryUpdate(rb, change, false);
+
 	oldcontext = MemoryContextSwitchTo(rb->context);
 
 	/* we should only have toast tuples in an INSERT or UPDATE */
@@ -3172,6 +3438,9 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	pfree(isnull);
 
 	MemoryContextSwitchTo(oldcontext);
+
+	/* now add the change back, with the correct size */
+	ReorderBufferChangeMemoryUpdate(rb, change, true);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e84c8cc..d899995 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -66,6 +66,7 @@
 #include "postmaster/syslogger.h"
 #include "postmaster/walwriter.h"
 #include "replication/logicallauncher.h"
+#include "replication/reorderbuffer.h"
 #include "replication/slot.h"
 #include "replication/syncrep.h"
 #include "replication/walreceiver.h"
@@ -2253,6 +2254,18 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM,
+			gettext_noop("Sets the maximum memory to be used for logical decoding."),
+			gettext_noop("This much memory can be used by each internal "
+						 "reorder buffer before spilling to disk."),
+			GUC_UNIT_KB
+		},
+		&logical_decoding_work_mem,
+		65536, 64, MAX_KILOBYTES,
+		NULL, NULL, NULL
+	},
+
 	/*
 	 * We use the hopefully-safely-small value of 100kB as the compiled-in
 	 * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index be02a76..46a06ff 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -130,6 +130,7 @@
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #autovacuum_work_mem = -1		# min 1MB, or -1 to use maintenance_work_mem
+#logical_decoding_work_mem = 64MB	# min 64kB
 #max_stack_depth = 2MB			# min 100kB
 #shared_memory_type = mmap		# the default is the first option
 					# supported by the operating system:
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 4c06a78..4dcef80 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -17,6 +17,8 @@
 #include "utils/snapshot.h"
 #include "utils/timestamp.h"
 
+extern PGDLLIMPORT	int	logical_decoding_work_mem;
+
 /* an individual tuple, stored in one chunk of memory */
 typedef struct ReorderBufferTupleBuf
 {
@@ -63,6 +65,9 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_TRUNCATE
 };
 
+/* forward declaration */
+struct ReorderBufferTXN;
+
 /*
  * a single 'change', can be an insert (with one tuple), an update (old, new),
  * or a delete (old).
@@ -77,6 +82,9 @@ typedef struct ReorderBufferChange
 	/* The type of change. */
 	enum ReorderBufferChangeType action;
 
+	/* Transaction this change belongs to. */
+	struct ReorderBufferTXN *txn;
+
 	RepOriginId origin_id;
 
 	/*
@@ -286,6 +294,11 @@ typedef struct ReorderBufferTXN
 	 */
 	dlist_node	node;
 
+	/*
+	 * Size of this transaction (changes currently in memory, in bytes).
+	 */
+	Size		size;
+
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -386,6 +399,9 @@ struct ReorderBuffer
 	/* buffer for disk<->memory conversions */
 	char	   *outbuf;
 	Size		outbufsize;
+
+	/* memory accounting */
+	Size		size;
 };
 
 
-- 
1.8.3.1

0002-Track-statistics-for-spilling.patchapplication/octet-stream; name=0002-Track-statistics-for-spilling.patchDownload
From d2b8ae2ab086cd91cccea79e6c5decaaa66fcfdf Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 11 Oct 2019 09:07:41 +0530
Subject: [PATCH 2/2] Track statistics for spilling

---
 doc/src/sgml/monitoring.sgml                    | 20 ++++++++++++
 src/backend/catalog/system_views.sql            |  5 ++-
 src/backend/replication/logical/reorderbuffer.c | 12 +++++++
 src/backend/replication/walsender.c             | 42 +++++++++++++++++++++++--
 src/include/catalog/pg_proc.dat                 |  6 ++--
 src/include/replication/reorderbuffer.h         | 11 +++++++
 src/include/replication/walsender_private.h     |  5 +++
 src/test/regress/expected/rules.out             |  7 +++--
 8 files changed, 100 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d18b271..eea65ee 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1971,6 +1971,26 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
      <entry><type>timestamp with time zone</type></entry>
      <entry>Send time of last reply message received from standby server</entry>
     </row>
+    <row>
+     <entry><structfield>spill_bytes</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Amount of decoded transaction data spilled to disk.</entry>
+    </row>
+    <row>
+     <entry><structfield>spill_txns</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of transactions spilled to disk after the memory used by
+      logical decoding exceeds <literal>logical_decoding_work_mem</literal>. The
+      counter gets incremented both for toplevel transactions and
+      subtransactions.</entry>
+    </row>
+    <row>
+     <entry><structfield>spill_count</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of times transactions were spilled to disk. Transactions
+      may get spilled repeatedly, and this counter gets incremented on every
+      such invocation.</entry>
+    </row>
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 9fe4a47..2ee2a06 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -776,7 +776,10 @@ CREATE VIEW pg_stat_replication AS
             W.replay_lag,
             W.sync_priority,
             W.sync_state,
-            W.reply_time
+            W.reply_time,
+            W.spill_txns,
+            W.spill_count,
+            W.spill_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3576d4e..a0c66e6 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -308,6 +308,10 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	buffer->spillCount = 0;
+	buffer->spillTxns = 0;
+	buffer->spillBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -2415,6 +2419,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	int			fd = -1;
 	XLogSegNo	curOpenSegNo = 0;
 	Size		spilled = 0;
+	Size		size = txn->size;
 
 	elog(DEBUG2, "spill %u changes in XID %u to disk",
 		 (uint32) txn->nentries_mem, txn->xid);
@@ -2473,6 +2478,13 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		spilled++;
 	}
 
+	/* update the statistics */
+	rb->spillCount += 1;
+	rb->spillBytes += size;
+
+	/* Don't consider already serialized transaction. */
+	rb->spillTxns += txn->serialized ? 0 : 1;
+
 	Assert(spilled == txn->nentries_mem);
 	Assert(dlist_is_empty(&txn->changes));
 	txn->nentries_mem = 0;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 7f56715..d7ef634 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -248,6 +248,7 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
+static void UpdateSpillStats(LogicalDecodingContext *ctx);
 static void XLogRead(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count);
 
 
@@ -1261,7 +1262,8 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
 /*
  * LogicalDecodingContext 'update_progress' callback.
  *
- * Write the current position to the lag tracker (see XLogSendPhysical).
+ * Write the current position to the lag tracker (see XLogSendPhysical),
+ * and update the spill statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1280,6 +1282,11 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 
 	LagTrackerWrite(lsn, now);
 	sendTime = now;
+
+	/*
+	 * Update statistics about transactions that spilled to disk.
+	 */
+	UpdateSpillStats(ctx);
 }
 
 /*
@@ -2318,6 +2325,9 @@ InitWalSenderSlot(void)
 			walsnd->state = WALSNDSTATE_STARTUP;
 			walsnd->latch = &MyProc->procLatch;
 			walsnd->replyTime = 0;
+			walsnd->spillTxns = 0;
+			walsnd->spillCount = 0;
+			walsnd->spillBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3219,7 +3229,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	12
+#define PG_STAT_GET_WAL_SENDERS_COLS	15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3274,6 +3284,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int			pid;
 		WalSndState state;
 		TimestampTz replyTime;
+		int64		spillTxns;
+		int64		spillCount;
+		int64		spillBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3294,6 +3307,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		applyLag = walsnd->applyLag;
 		priority = walsnd->sync_standby_priority;
 		replyTime = walsnd->replyTime;
+		spillTxns = walsnd->spillTxns;
+		spillCount = walsnd->spillCount;
+		spillBytes = walsnd->spillBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		memset(nulls, 0, sizeof(nulls));
@@ -3375,6 +3391,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 				nulls[11] = true;
 			else
 				values[11] = TimestampTzGetDatum(replyTime);
+
+			/* spill to disk */
+			values[12] = Int64GetDatum(spillTxns);
+			values[13] = Int64GetDatum(spillCount);
+			values[14] = Int64GetDatum(spillBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3611,3 +3632,20 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
 	Assert(time != 0);
 	return now - time;
 }
+
+static void
+UpdateSpillStats(LogicalDecodingContext *ctx)
+{
+	ReorderBuffer *rb = ctx->reorder;
+
+	SpinLockAcquire(&MyWalSnd->mutex);
+
+	MyWalSnd->spillTxns = rb->spillTxns;
+	MyWalSnd->spillCount = rb->spillCount;
+	MyWalSnd->spillBytes = rb->spillBytes;
+
+	elog(DEBUG1, "UpdateSpillStats: updating stats %p %ld %ld %ld",
+		 rb, rb->spillTxns, rb->spillCount, rb->spillBytes);
+
+	SpinLockRelease(&MyWalSnd->mutex);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 58ea5b9..fa0a2a1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5166,9 +5166,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 4dcef80..ba7f9f0 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -402,6 +402,17 @@ struct ReorderBuffer
 
 	/* memory accounting */
 	Size		size;
+
+	/*
+	 * Statistics about transactions spilled to disk.
+	 *
+	 * A single transaction may be spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 */
+	int64	spillCount;		/* spill-to-disk invocation counter */
+	int64	spillTxns;		/* number of transactions spilled to disk  */
+	int64	spillBytes;		/* amount of data spilled to disk */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 0dd6d1c..a6b3205 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -80,6 +80,11 @@ typedef struct WalSnd
 	 * Timestamp of the last message received from standby.
 	 */
 	TimestampTz replyTime;
+
+	/* Statistics for transactions spilled to disk. */
+	int64		spillTxns;
+	int64		spillCount;
+	int64		spillBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 210e9cd..4f1998e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1951,9 +1951,12 @@ pg_stat_replication| SELECT s.pid,
     w.replay_lag,
     w.sync_priority,
     w.sync_state,
-    w.reply_time
+    w.reply_time,
+    w.spill_txns,
+    w.spill_count,
+    w.spill_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
1.8.3.1

#126Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#125)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Nov 7, 2019 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Nov 6, 2019 at 11:33 AM vignesh C <vignesh21@gmail.com> wrote:

I have made one change to the configuration file in
contrib/test_decoding directory, with that the coverage seems to be
fine. I have seen that the coverage is almost like the code before
applying the patch. I have attached the test change and the coverage
report for reference. Coverage report includes the core logical work
memory files for base code and by applying
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer and
0002-Track-statistics-for-spilling patches.

Thanks, I have incorporated your test changes and modified the two
patches. Please see attached.

Changes:
---------------
1. In guc.c, we should include reorderbuffer.h, not logical.h as we
define logical_decoding_work_mem in earlier.

Yeah Right.

2.
+ *   To limit the amount of memory used by decoded changes, we track memory
+ *   used at the reorder buffer level (i.e. total amount of memory), and for
+ *   each toplevel transaction. When the total amount of used memory exceeds
+ *   the limit, the toplevel transaction consuming the most memory is then
+ *   serialized to disk.

In the above comments, removed 'toplevel' as we track memory usage for
both toplevel and subtransactions.

Correct.

3. There were still a few mentions of streaming which I have removed.

ok

4. In the docs, the type for stats spill_* was integer whereas it
should be bigint.

ok

5.
+UpdateSpillStats(LogicalDecodingContext *ctx)
+{
+ ReorderBuffer *rb = ctx->reorder;
+
+ SpinLockAcquire(&MyWalSnd->mutex);
+
+ MyWalSnd->spillTxns = rb->spillTxns;
+ MyWalSnd->spillCount = rb->spillCount;
+ MyWalSnd->spillBytes = rb->spillBytes;
+
+ elog(WARNING, "UpdateSpillStats: updating stats %p %ld %ld %ld",
+ rb, rb->spillTxns, rb->spillCount, rb->spillBytes);

Changed the above elog to DEBUG1 as otherwise it was getting printed
very frequently. I think we can make it DEBUG2 if we want.

Yeah, it should not be WARNING.

6. There was an extra space in rules.out due to which test was
failing. I have fixed it.

My Bad. I have induced while separating out the changes for the spilling.

What do you think?

I have reviewed your changes and looks fine to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#127Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#126)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Nov 7, 2019 at 3:50 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Nov 7, 2019 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

What do you think?

I have reviewed your changes and looks fine to me.

Okay, thanks. I am also happy with the two patches I have posted in
my last email [1]/messages/by-id/CAA4eK1Kdmi6VVguKEHV6Ho2isCPVFdQtt0WLsK10fiuE59_0Yw@mail.gmail.com.

Tomas, would you like to take a look at those patches and commit them
if you are happy or would you like me to do the same?

Some notes before commit:
--------------------------------------
1.
Commit message need to be changed for the first patch
-------------------------------------------------------------------------
A.

The memory limit is defined by a new logical_decoding_work_mem GUC, so for example we can do this

SET logical_decoding_work_mem = '128kB'

to trigger very aggressive streaming. The minimum value is 64kB.

I think this patch doesn't contain streaming, so we either need to
reword it or remove it.

B.

The logical_decoding_work_mem may be set either in postgresql.conf, in which case it serves as the default for all publishers on that instance, or when creating the
subscription, using a work_mem paramemter in the WITH clause (specifies number of kilobytes).

We need to reword this as we have decided to remove the setting from
the subscription side as of now.

2. I think we can change the message level in UpdateSpillStats() to DEBUG2.

3. I think we need catversion bump for the second patch.

4. I think we can combine both patches and commit as one patch, but it
is okay to commit them separately as well.

[1]: /messages/by-id/CAA4eK1Kdmi6VVguKEHV6Ho2isCPVFdQtt0WLsK10fiuE59_0Yw@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#128Alexey Kondratov
a.kondratov@postgrespro.ru
In reply to: Kuntal Ghosh (#119)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 04.11.2019 13:05, Kuntal Ghosh wrote:

On Mon, Nov 4, 2019 at 3:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

So your result shows that with "streaming on", performance is
degrading? By any chance did you try to see where is the bottleneck?

Right. But, as we increase the logical_decoding_work_mem, the
performance improves. I've not analyzed the bottleneck yet. I'm
looking into the same.

My guess is that 64 kB is just too small value. In the table schema used
for tests every rows takes at least 24 bytes for storing column values.
Thus, with this logical_decoding_work_mem value the limit should be hit
after about 2500+ rows, or about 400 times during transaction of 1000000
rows size.

It is just too frequent, while ReorderBufferStreamTXN includes a whole
bunch of logic, e.g. it always starts internal transaction:

/*
 * Decoding needs access to syscaches et al., which in turn use
 * heavyweight locks and such. Thus we need to have enough state around to
 * keep track of those.  The easiest way is to simply use a transaction
 * internally.  That also allows us to easily enforce that nothing writes
 * to the database by checking for xid assignments. ...
 */

Also it issues separated stream_start/stop messages around each streamed
transaction chunk. So if streaming starts and stops too frequently it
adds additional overhead and may even interfere with current in-progress
transaction.

If I get it correctly, then it is rather expected with too small values
of logical_decoding_work_mem. Probably it may be optimized, but I am not
sure that it is worth doing right now.

Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

#129Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Alexey Kondratov (#128)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Nov 12, 2019 at 4:12 PM Alexey Kondratov
<a.kondratov@postgrespro.ru> wrote:

On 04.11.2019 13:05, Kuntal Ghosh wrote:

On Mon, Nov 4, 2019 at 3:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

So your result shows that with "streaming on", performance is
degrading? By any chance did you try to see where is the bottleneck?

Right. But, as we increase the logical_decoding_work_mem, the
performance improves. I've not analyzed the bottleneck yet. I'm
looking into the same.

My guess is that 64 kB is just too small value. In the table schema used
for tests every rows takes at least 24 bytes for storing column values.
Thus, with this logical_decoding_work_mem value the limit should be hit
after about 2500+ rows, or about 400 times during transaction of 1000000
rows size.

It is just too frequent, while ReorderBufferStreamTXN includes a whole
bunch of logic, e.g. it always starts internal transaction:

/*
* Decoding needs access to syscaches et al., which in turn use
* heavyweight locks and such. Thus we need to have enough state around to
* keep track of those. The easiest way is to simply use a transaction
* internally. That also allows us to easily enforce that nothing writes
* to the database by checking for xid assignments. ...
*/

Also it issues separated stream_start/stop messages around each streamed
transaction chunk. So if streaming starts and stops too frequently it
adds additional overhead and may even interfere with current in-progress
transaction.

Yeah, I've also found the same. With stream_start/stop message, it
writes 1 byte of checksum and 4 bytes of number of sub-transactions
which increases the write amplification significantly.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

#130Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#94)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

As mentioned by me a few days back that the first patch in this series
is ready to go [1]/messages/by-id/CAA4eK1JM0=RwODZQrn8DTQ3dbcb9xwKDdHCmVOryAk_xoKf9Nw@mail.gmail.com (I am hoping Tomas will pick it up), so I have
started the review of other patches

Review/Questions on 0002-Immediately-WAL-log-assignments.patch
-------------------------------------------------------------------------------------------------
1. This patch adds the top_xid in WAL whenever the first time WAL for
a subtransaction XID is written to correctly decode the changes of
in-progress transaction. This patch also removes logging and applying
WAL for XLOG_XACT_ASSIGNMENT which might have some effect. As replay
of that, it prunes KnownAssignedXids to prevent overflow of that
array. See comments in procarray.c (KnownAssignedTransactionIds
sub-module). Can you please explain how after removing the WAL for
XLOG_XACT_ASSIGNMENT, we will handle that or I am missing something
and there is no impact of same?

2.
+#define XLOG_INCLUDE_INVALS 0x08 /* include invalidations */

This doesn't seem to be used in this patch.

[1]: /messages/by-id/CAA4eK1JM0=RwODZQrn8DTQ3dbcb9xwKDdHCmVOryAk_xoKf9Nw@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#131Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#130)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Nov 13, 2019 at 5:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

As mentioned by me a few days back that the first patch in this series
is ready to go [1] (I am hoping Tomas will pick it up), so I have
started the review of other patches

Review/Questions on 0002-Immediately-WAL-log-assignments.patch
-------------------------------------------------------------------------------------------------
1. This patch adds the top_xid in WAL whenever the first time WAL for
a subtransaction XID is written to correctly decode the changes of
in-progress transaction. This patch also removes logging and applying
WAL for XLOG_XACT_ASSIGNMENT which might have some effect. As replay
of that, it prunes KnownAssignedXids to prevent overflow of that
array. See comments in procarray.c (KnownAssignedTransactionIds
sub-module). Can you please explain how after removing the WAL for
XLOG_XACT_ASSIGNMENT, we will handle that or I am missing something
and there is no impact of same?

It seems like a problem to me as well. One option could be that
since now we are adding the top transaction id in the first WAL of the
subtransaction we can directly update the pg_subtrans and avoid adding
sub transaction id in the KnownAssignedXids and mark it as
lastOverflowedXid. But, I don't think we should go in that direction
otherwise it will impact the performance of visibility check on the
hot-standby. Let's see what Tomas has in mind.

2.
+#define XLOG_INCLUDE_INVALS 0x08 /* include invalidations */

This doesn't seem to be used in this patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#132Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#131)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Nov 14, 2019 at 9:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Nov 13, 2019 at 5:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

As mentioned by me a few days back that the first patch in this series
is ready to go [1] (I am hoping Tomas will pick it up), so I have
started the review of other patches

Review/Questions on 0002-Immediately-WAL-log-assignments.patch
-------------------------------------------------------------------------------------------------
1. This patch adds the top_xid in WAL whenever the first time WAL for
a subtransaction XID is written to correctly decode the changes of
in-progress transaction. This patch also removes logging and applying
WAL for XLOG_XACT_ASSIGNMENT which might have some effect. As replay
of that, it prunes KnownAssignedXids to prevent overflow of that
array. See comments in procarray.c (KnownAssignedTransactionIds
sub-module). Can you please explain how after removing the WAL for
XLOG_XACT_ASSIGNMENT, we will handle that or I am missing something
and there is no impact of same?

It seems like a problem to me as well. One option could be that
since now we are adding the top transaction id in the first WAL of the
subtransaction we can directly update the pg_subtrans and avoid adding
sub transaction id in the KnownAssignedXids and mark it as
lastOverflowedXid.

Hmm, I am not sure if we can do that easily because I think in
RecordKnownAssignedTransactionIds, we add those based on the gap via
KnownAssignedXidsAdd and only remove them later while applying WAL for
XLOG_XACT_ASSIGNMENT. I think if we really want to go in this
direction then for each WAL record we need to check if it has
XLR_BLOCK_ID_TOPLEVEL_XID set and then call function
ProcArrayApplyXidAssignment() with the required information. I think
this line of attack has WAL overhead both on master whenever
subtransactions are involved and also on hot-standby for doing the
work for each subtransaction separately. The WAL apply needs to
acquire and release PROCArrayLock in exclusive mode for each
subtransaction whereas now it does it once for
PGPROC_MAX_CACHED_SUBXIDS number of subtransactions which can conflict
with queries running on standby.

The other idea could be that we keep the current XLOG_XACT_ASSIGNMENT
mechanism (WAL logging and apply of same on hot-standby) as it is and
additionally log top_xid the first time when WAL is written for a
subtransaction only when wal_level >= WAL_LEVEL_LOGICAL. Then use the
same for logical decoding. The advantage of this approach is that we
will incur the overhead of additional transactionid only when required
especially not with default server configuration.

Thoughts?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#133Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#132)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Nov 14, 2019 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Nov 14, 2019 at 9:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Nov 13, 2019 at 5:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

As mentioned by me a few days back that the first patch in this series
is ready to go [1] (I am hoping Tomas will pick it up), so I have
started the review of other patches

Review/Questions on 0002-Immediately-WAL-log-assignments.patch
-------------------------------------------------------------------------------------------------
1. This patch adds the top_xid in WAL whenever the first time WAL for
a subtransaction XID is written to correctly decode the changes of
in-progress transaction. This patch also removes logging and applying
WAL for XLOG_XACT_ASSIGNMENT which might have some effect. As replay
of that, it prunes KnownAssignedXids to prevent overflow of that
array. See comments in procarray.c (KnownAssignedTransactionIds
sub-module). Can you please explain how after removing the WAL for
XLOG_XACT_ASSIGNMENT, we will handle that or I am missing something
and there is no impact of same?

It seems like a problem to me as well. One option could be that
since now we are adding the top transaction id in the first WAL of the
subtransaction we can directly update the pg_subtrans and avoid adding
sub transaction id in the KnownAssignedXids and mark it as
lastOverflowedXid.

Hmm, I am not sure if we can do that easily because I think in
RecordKnownAssignedTransactionIds, we add those based on the gap via
KnownAssignedXidsAdd and only remove them later while applying WAL for
XLOG_XACT_ASSIGNMENT. I think if we really want to go in this
direction then for each WAL record we need to check if it has
XLR_BLOCK_ID_TOPLEVEL_XID set and then call function
ProcArrayApplyXidAssignment() with the required information. I think
this line of attack has WAL overhead both on master whenever
subtransactions are involved and also on hot-standby for doing the
work for each subtransaction separately. The WAL apply needs to
acquire and release PROCArrayLock in exclusive mode for each
subtransaction whereas now it does it once for
PGPROC_MAX_CACHED_SUBXIDS number of subtransactions which can conflict
with queries running on standby.

Right

The other idea could be that we keep the current XLOG_XACT_ASSIGNMENT
mechanism (WAL logging and apply of same on hot-standby) as it is and
additionally log top_xid the first time when WAL is written for a
subtransaction only when wal_level >= WAL_LEVEL_LOGICAL. Then use the
same for logical decoding. The advantage of this approach is that we
will incur the overhead of additional transactionid only when required
especially not with default server configuration.

Thoughts?

The idea seems reasonable to me.

Apart from this, I have another question in
0003-Issue-individual-invalidations-with-wal_level-logical.patch

@@ -543,6 +588,18 @@ RegisterSnapshotInvalidation(Oid dbId, Oid relId)
 {
  AddSnapshotInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
     dbId, relId);
+
+ /* Issue an invalidation WAL record (when wal_level=logical) */
+ if (XLogLogicalInfoActive())
+ {
+ SharedInvalidationMessage msg;
+
+ msg.sn.id = SHAREDINVALSNAPSHOT_ID;
+ msg.sn.dbId = dbId;
+ msg.sn.relId = relId;
+
+ LogLogicalInvalidations(1, &msg, false);
+ }
 }

I am not sure why do we need to explicitly WAL log the snapshot
invalidation? because this is logged for invalidating the catalog
snapshot and for logical decoding we use HistoricSnapshot, not the
catalog snapshot. I might be missing something?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#134Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#133)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Nov 14, 2019 at 3:40 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Apart from this, I have another question in
0003-Issue-individual-invalidations-with-wal_level-logical.patch

@@ -543,6 +588,18 @@ RegisterSnapshotInvalidation(Oid dbId, Oid relId)
{
AddSnapshotInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
dbId, relId);
+
+ /* Issue an invalidation WAL record (when wal_level=logical) */
+ if (XLogLogicalInfoActive())
+ {
+ SharedInvalidationMessage msg;
+
+ msg.sn.id = SHAREDINVALSNAPSHOT_ID;
+ msg.sn.dbId = dbId;
+ msg.sn.relId = relId;
+
+ LogLogicalInvalidations(1, &msg, false);
+ }
}

I am not sure why do we need to explicitly WAL log the snapshot
invalidation? because this is logged for invalidating the catalog
snapshot and for logical decoding we use HistoricSnapshot, not the
catalog snapshot.

I think it has been logged because without this patch as well we log
all the invalidation messages at commit time and process them during
decoding. However, I agree that this particular invalidation message
is not required for logical decoding for the reason you mentioned. I
think as we are explicitly logging invalidations, so it is better to
avoid this if we can.

Few other comments on this patch:
1.
+ case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+ /*
+ * Execute the invalidation message locally.
+ *
+ * XXX Do we need to care about relcacheInitFileInval and
+ * the other fields added to ReorderBufferChange, or just
+ * about the message itself?
+ */
+ LocalExecuteInvalidationMessage(&change->data.inval.msg);
+ break;

Here, why are we executing messages individually? Can't we just
follow what we do in DecodeCommit which is to record the invalidations
in ReorderBufferTXN as we encounter them and then allow them to
execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a
reason why we don't do ReorderBufferXidSetCatalogChanges when we
receive any invalidation message?

2.
@@ -3025,8 +3073,8 @@ ReorderBufferRestoreChange(ReorderBuffer *rb,
ReorderBufferTXN *txn,
  * although we don't check the memory limit when restoring the changes in
  * this branch (we only do that when initially queueing the changes after
  * decoding), because we will release the changes later, and that will
- * update the accounting too (subtracting the size from the counters).
- * And we don't want to underflow there.
+ * update the accounting too (subtracting the size from the counters). And
+ * we don't want to underflow there.
  */

This seems like an unrelated change.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#135Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#134)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Nov 14, 2019 at 3:40 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Apart from this, I have another question in
0003-Issue-individual-invalidations-with-wal_level-logical.patch

@@ -543,6 +588,18 @@ RegisterSnapshotInvalidation(Oid dbId, Oid relId)
{
AddSnapshotInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
dbId, relId);
+
+ /* Issue an invalidation WAL record (when wal_level=logical) */
+ if (XLogLogicalInfoActive())
+ {
+ SharedInvalidationMessage msg;
+
+ msg.sn.id = SHAREDINVALSNAPSHOT_ID;
+ msg.sn.dbId = dbId;
+ msg.sn.relId = relId;
+
+ LogLogicalInvalidations(1, &msg, false);
+ }
}

I am not sure why do we need to explicitly WAL log the snapshot
invalidation? because this is logged for invalidating the catalog
snapshot and for logical decoding we use HistoricSnapshot, not the
catalog snapshot.

I think it has been logged because without this patch as well we log
all the invalidation messages at commit time and process them during
decoding. However, I agree that this particular invalidation message
is not required for logical decoding for the reason you mentioned. I
think as we are explicitly logging invalidations, so it is better to
avoid this if we can.

Ok

Few other comments on this patch:
1.
+ case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+ /*
+ * Execute the invalidation message locally.
+ *
+ * XXX Do we need to care about relcacheInitFileInval and
+ * the other fields added to ReorderBufferChange, or just
+ * about the message itself?
+ */
+ LocalExecuteInvalidationMessage(&change->data.inval.msg);
+ break;

Here, why are we executing messages individually? Can't we just
follow what we do in DecodeCommit which is to record the invalidations
in ReorderBufferTXN as we encounter them and then allow them to
execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a
reason why we don't do ReorderBufferXidSetCatalogChanges when we
receive any invalidation message?

IMHO, the reason is that in DecodeCommit, we get all the invalidation
at one time so, at REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID, we don't
know which invalidation message to execute so for being safe we have
to execute all. But, since we are logging all invalidation
individually, we exactly know at this stage which cache to invalidate.
So it is better to only invalidate required cache not all.

2.
@@ -3025,8 +3073,8 @@ ReorderBufferRestoreChange(ReorderBuffer *rb,
ReorderBufferTXN *txn,
* although we don't check the memory limit when restoring the changes in
* this branch (we only do that when initially queueing the changes after
* decoding), because we will release the changes later, and that will
- * update the accounting too (subtracting the size from the counters).
- * And we don't want to underflow there.
+ * update the accounting too (subtracting the size from the counters). And
+ * we don't want to underflow there.
*/

This seems like an unrelated change.

Indeed.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#136Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#135)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Few other comments on this patch:
1.
+ case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+ /*
+ * Execute the invalidation message locally.
+ *
+ * XXX Do we need to care about relcacheInitFileInval and
+ * the other fields added to ReorderBufferChange, or just
+ * about the message itself?
+ */
+ LocalExecuteInvalidationMessage(&change->data.inval.msg);
+ break;

Here, why are we executing messages individually? Can't we just
follow what we do in DecodeCommit which is to record the invalidations
in ReorderBufferTXN as we encounter them and then allow them to
execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a
reason why we don't do ReorderBufferXidSetCatalogChanges when we
receive any invalidation message?

IMHO, the reason is that in DecodeCommit, we get all the invalidation
at one time so, at REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID, we don't
know which invalidation message to execute so for being safe we have
to execute all. But, since we are logging all invalidation
individually, we exactly know at this stage which cache to invalidate.
So it is better to only invalidate required cache not all.

In that case, invalidations can be processed multiple times, the first
time when these individual WAL logs for invalidation are processed and
then later at commit time when we accumulate all invalidation messages
and then execute them for REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.
Can we avoid to execute invalidations from other places after this
patch which also includes executing them as part of XLOG_INVALIDATIONS
processing?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#137Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#127)
2 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Nov 7, 2019 at 5:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Some notes before commit:
--------------------------------------
1.
Commit message need to be changed for the first patch
-------------------------------------------------------------------------
A.

The memory limit is defined by a new logical_decoding_work_mem GUC, so for example we can do this

SET logical_decoding_work_mem = '128kB'

to trigger very aggressive streaming. The minimum value is 64kB.

I think this patch doesn't contain streaming, so we either need to
reword it or remove it.

B.

The logical_decoding_work_mem may be set either in postgresql.conf, in which case it serves as the default for all publishers on that instance, or when creating the
subscription, using a work_mem paramemter in the WITH clause (specifies number of kilobytes).

We need to reword this as we have decided to remove the setting from
the subscription side as of now.

2. I think we can change the message level in UpdateSpillStats() to DEBUG2.

I have made these modifications and additionally ran pgindent.

4. I think we can combine both patches and commit as one patch, but it
is okay to commit them separately as well.

I am not sure if this is a good idea, so still kept them as separate.

Tomas, do let me know if you want to commit these or if you have any
comments, otherwise, I will commit these on Tuesday (19-Nov)?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.nov16.patchapplication/octet-stream; name=0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.nov16.patchDownload
From fe0f6a2ab4c7d00639e5b3c977837ab695fac1b1 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 16 Nov 2019 17:49:33 +0530
Subject: [PATCH 1/2] Add logical_decoding_work_mem to limit ReorderBuffer
 memory usage.

Instead of deciding to serialize a transaction merely based on the
number of changes in that xact (toplevel or subxact), this makes
the decisions based on amount of memory consumed by the changes.

The memory limit is defined by a new logical_decoding_work_mem GUC,
so for example we can do this

    SET logical_decoding_work_mem = '128kB'

to reduce the memory usage of walsenders or set the higher value to
reduce disk writes. The minimum value is 64kB.

When adding a change to a transaction, we account for the size in
two places. Firstly, in the ReorderBuffer, which is then used to
decide if we reached the total memory limit. And secondly in the
transaction the change belongs to, so that we can pick the largest
transaction to evict (and serialize to disk).

We still use max_changes_in_memory when loading changes serialized
to disk. The trouble is we can't use the memory limit directly as
there might be multiple subxact serialized, we need to read all of
them but we don't know how many are there (and which subxact to
read first).

We do not serialize the ReorderBufferTXN entries, so if there is a
transaction with many subxacts, most memory may be in this type of
objects. Those records are not included in the memory accounting.

We also do not account for INTERNAL_TUPLECID changes, which are
kept in a separate list and not evicted from memory. Transactions
with many CTID changes may consume significant amounts of memory,
but we can't really do much about that.

The current eviction algorithm is very simple - the transaction is
picked merely by size, while it might be useful to also consider age
(LSN) of the changes for example. With the new Generational memory
allocator, evicting the oldest changes would make it more likely
the memory gets actually pfreed.

The logical_decoding_work_mem can be set in postgresql.conf, in which
case it serves as the default for all publishers on that instance.

Author: Tomas Vondra, with changes by Dilip Kumar and Amit Kapila
Reviewed-by: Dilip Kumar and Amit Kapila
Tested-By: Vignesh C
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/logical.conf              |   1 +
 doc/src/sgml/config.sgml                        |  21 ++
 src/backend/replication/logical/reorderbuffer.c | 293 +++++++++++++++++++++++-
 src/backend/utils/misc/guc.c                    |  13 ++
 src/backend/utils/misc/postgresql.conf.sample   |   1 +
 src/include/replication/reorderbuffer.h         |  16 ++
 6 files changed, 333 insertions(+), 12 deletions(-)

diff --git a/contrib/test_decoding/logical.conf b/contrib/test_decoding/logical.conf
index 367f706..07c4d3d 100644
--- a/contrib/test_decoding/logical.conf
+++ b/contrib/test_decoding/logical.conf
@@ -1,2 +1,3 @@
 wal_level = logical
 max_replication_slots = 4
+logical_decoding_work_mem = 64kB
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f837703..d4d1fe4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1732,6 +1732,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-logical-decoding-work-mem" xreflabel="logical_decoding_work_mem">
+      <term><varname>logical_decoding_work_mem</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>logical_decoding_work_mem</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to be used by logical decoding,
+        before some of the decoded changes are written to local disk. This
+        limits the amount of memory used by logical streaming replication
+        connections. It defaults to 64 megabytes (<literal>64MB</literal>).
+        Since each replication connection only uses a single buffer of this size,
+        and an installation normally doesn't have many such connections
+        concurrently (as limited by <varname>max_wal_senders</varname>), it's
+        safe to set this value significantly higher than <varname>work_mem</varname>,
+        reducing the amount of decoded changes written to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 62e5424..d82a5f1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -49,6 +49,34 @@
  *	  GenerationContext for the variable-length transaction data (allocated
  *	  and freed in groups with similar lifespan).
  *
+ *	  To limit the amount of memory used by decoded changes, we track memory
+ *	  used at the reorder buffer level (i.e. total amount of memory), and for
+ *	  each transaction. When the total amount of used memory exceeds the
+ *	  limit, the transaction consuming the most memory is then serialized to
+ *	  disk.
+ *
+ *	  Only decoded changes are evicted from memory (spilled to disk), not the
+ *	  transaction records. The number of toplevel transactions is limited,
+ *	  but a transaction with many subtransactions may still consume significant
+ *	  amounts of memory. The transaction records are fairly small, though, and
+ *	  are not included in the memory limit.
+ *
+ *	  The current eviction algorithm is very simple - the transaction is
+ *	  picked merely by size, while it might be useful to also consider age
+ *	  (LSN) of the changes for example. With the new Generational memory
+ *	  allocator, evicting the oldest changes would make it more likely the
+ *	  memory gets actually freed.
+ *
+ *	  We still rely on max_changes_in_memory when loading serialized changes
+ *	  back into memory. At that point we can't use the memory limit directly
+ *	  as we load the subxacts independently. One option do deal with this
+ *	  would be to count the subxacts, and allow each to allocate 1/N of the
+ *	  memory limit. That however does not seem very appealing, because with
+ *	  many subtransactions it may easily cause trashing (short cycles of
+ *	  deserializing and applying very few changes). We probably should give
+ *	  a bit more memory to the oldest subtransactions, because it's likely
+ *	  the source for the next sequence of changes.
+ *
  * -------------------------------------------------------------------------
  */
 #include "postgres.h"
@@ -154,7 +182,8 @@ typedef struct ReorderBufferDiskChange
  * resource management here, but it's not entirely clear what that would look
  * like.
  */
-static const Size max_changes_in_memory = 4096;
+int			logical_decoding_work_mem;
+static const Size max_changes_in_memory = 4096; /* XXX for restore only */
 
 /* ---------------------------------------
  * primary reorderbuffer support routines
@@ -189,7 +218,7 @@ static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTX
  * Disk serialization support functions
  * ---------------------------------------
  */
-static void ReorderBufferCheckSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferCheckMemoryLimit(ReorderBuffer *rb);
 static void ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 										 int fd, ReorderBufferChange *change);
@@ -217,6 +246,14 @@ static void ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static void ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn,
 										  Relation relation, ReorderBufferChange *change);
 
+/*
+ * ---------------------------------------
+ * memory accounting
+ * ---------------------------------------
+ */
+static Size ReorderBufferChangeSize(ReorderBufferChange *change);
+static void ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
+											ReorderBufferChange *change, bool addition);
 
 /*
  * Allocate a new ReorderBuffer and clean out any old serialized state from
@@ -269,6 +306,7 @@ ReorderBufferAllocate(void)
 
 	buffer->outbuf = NULL;
 	buffer->outbufsize = 0;
+	buffer->size = 0;
 
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
@@ -374,6 +412,9 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 void
 ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 {
+	/* update memory accounting info */
+	ReorderBufferChangeMemoryUpdate(rb, change, false);
+
 	/* free contained data */
 	switch (change->action)
 	{
@@ -585,12 +626,18 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	change->lsn = lsn;
+	change->txn = txn;
+
 	Assert(InvalidXLogRecPtr != lsn);
 	dlist_push_tail(&txn->changes, &change->node);
 	txn->nentries++;
 	txn->nentries_mem++;
 
-	ReorderBufferCheckSerializeTXN(rb, txn);
+	/* update memory accounting information */
+	ReorderBufferChangeMemoryUpdate(rb, change, true);
+
+	/* check the memory limits and evict something if needed */
+	ReorderBufferCheckMemoryLimit(rb);
 }
 
 /*
@@ -1217,6 +1264,9 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		change = dlist_container(ReorderBufferChange, node, iter.cur);
 
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
 		ReorderBufferReturnChange(rb, change);
 	}
 
@@ -1229,7 +1279,11 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		ReorderBufferChange *change;
 
 		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
 		ReorderBufferReturnChange(rb, change);
 	}
 
@@ -2082,9 +2136,48 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferQueueChange(rb, xid, lsn, change);
 }
 
+/*
+ * Update the memory accounting info. We track memory used by the whole
+ * reorder buffer and the transaction containing the change.
+ */
+static void
+ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
+								ReorderBufferChange *change,
+								bool addition)
+{
+	Size		sz;
+
+	Assert(change->txn);
+
+	/*
+	 * Ignore tuple CID changes, because those are not evicted when reaching
+	 * memory limit. So we just don't count them, because it might easily
+	 * trigger a pointless attempt to spill.
+	 */
+	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
+		return;
+
+	sz = ReorderBufferChangeSize(change);
+
+	if (addition)
+	{
+		change->txn->size += sz;
+		rb->size += sz;
+	}
+	else
+	{
+		Assert((rb->size >= sz) && (change->txn->size >= sz));
+		change->txn->size -= sz;
+		rb->size -= sz;
+	}
+}
 
 /*
  * Add new (relfilenode, tid) -> (cmin, cmax) mappings.
+ *
+ * We do not include this change type in memory accounting, because we
+ * keep CIDs in a separate list and do not evict them when reaching
+ * the memory limit.
  */
 void
 ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
@@ -2103,6 +2196,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->data.tuplecid.cmax = cmax;
 	change->data.tuplecid.combocid = combocid;
 	change->lsn = lsn;
+	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
@@ -2230,20 +2324,84 @@ ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
 }
 
 /*
- * Check whether the transaction tx should spill its data to disk.
+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
+ *
+ * XXX With many subtransactions this might be quite slow, because we'll have
+ * to walk through all of them. There are some options how we could improve
+ * that: (a) maintain some secondary structure with transactions sorted by
+ * amount of changes, (b) not looking for the entirely largest transaction,
+ * but e.g. for transaction using at least some fraction of the memory limit,
+ * and (c) evicting multiple transactions at once, e.g. to free a given portion
+ * of the memory limit (e.g. 50%).
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTXN(ReorderBuffer *rb)
+{
+	HASH_SEQ_STATUS hash_seq;
+	ReorderBufferTXNByIdEnt *ent;
+	ReorderBufferTXN *largest = NULL;
+
+	hash_seq_init(&hash_seq, rb->by_txn);
+	while ((ent = hash_seq_search(&hash_seq)) != NULL)
+	{
+		ReorderBufferTXN *txn = ent->txn;
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
+ * Check whether the logical_decoding_work_mem limit was reached, and if yes
+ * pick the transaction to evict and spill the changes to disk.
+ *
+ * XXX At this point we select just a single (largest) transaction, but
+ * we might also adapt a more elaborate eviction strategy - for example
+ * evicting enough transactions to free certain fraction (e.g. 50%) of
+ * the memory limit.
  */
 static void
-ReorderBufferCheckSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
+	ReorderBufferTXN *txn;
+
+	/* bail out if we haven't exceeded the memory limit */
+	if (rb->size < logical_decoding_work_mem * 1024L)
+		return;
+
 	/*
-	 * TODO: improve accounting so we cheaply can take subtransactions into
-	 * account here.
+	 * Pick the largest transaction (or subtransaction) and evict it from
+	 * memory by serializing it to disk.
 	 */
-	if (txn->nentries_mem >= max_changes_in_memory)
-	{
-		ReorderBufferSerializeTXN(rb, txn);
-		Assert(txn->nentries_mem == 0);
-	}
+	txn = ReorderBufferLargestTXN(rb);
+
+	ReorderBufferSerializeTXN(rb, txn);
+
+	/*
+	 * After eviction, the transaction should have no entries in memory, and
+	 * should use 0 bytes for changes.
+	 */
+	Assert(txn->size == 0);
+	Assert(txn->nentries_mem == 0);
+
+	/*
+	 * And furthermore, evicting the transaction should get us below the
+	 * memory limit again - it is not possible that we're still exceeding the
+	 * memory limit after evicting the transaction.
+	 *
+	 * This follows from the simple fact that the selected transaction is at
+	 * least as large as the most recent change (which caused us to go over
+	 * the memory limit). So by evicting it we're definitely back below the
+	 * memory limit.
+	 */
+	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
 /*
@@ -2513,6 +2671,84 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 }
 
 /*
+ * Size of a change in memory.
+ */
+static Size
+ReorderBufferChangeSize(ReorderBufferChange *change)
+{
+	Size		sz = sizeof(ReorderBufferChange);
+
+	switch (change->action)
+	{
+			/* fall through these, they're all similar enough */
+		case REORDER_BUFFER_CHANGE_INSERT:
+		case REORDER_BUFFER_CHANGE_UPDATE:
+		case REORDER_BUFFER_CHANGE_DELETE:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
+			{
+				ReorderBufferTupleBuf *oldtup,
+						   *newtup;
+				Size		oldlen = 0;
+				Size		newlen = 0;
+
+				oldtup = change->data.tp.oldtuple;
+				newtup = change->data.tp.newtuple;
+
+				if (oldtup)
+				{
+					sz += sizeof(HeapTupleData);
+					oldlen = oldtup->tuple.t_len;
+					sz += oldlen;
+				}
+
+				if (newtup)
+				{
+					sz += sizeof(HeapTupleData);
+					newlen = newtup->tuple.t_len;
+					sz += newlen;
+				}
+
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_MESSAGE:
+			{
+				Size		prefix_size = strlen(change->data.msg.prefix) + 1;
+
+				sz += prefix_size + change->data.msg.message_size +
+					sizeof(Size) + sizeof(Size);
+
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
+			{
+				Snapshot	snap;
+
+				snap = change->data.snapshot;
+
+				sz += sizeof(SnapshotData) +
+					sizeof(TransactionId) * snap->xcnt +
+					sizeof(TransactionId) * snap->subxcnt;
+
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_TRUNCATE:
+			{
+				sz += sizeof(Oid) * change->data.truncate.nrelids;
+
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
+		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+			/* ReorderBufferChange contains everything important */
+			break;
+	}
+
+	return sz;
+}
+
+
+/*
  * Restore a number of changes spilled to disk back into memory.
  */
 static Size
@@ -2784,6 +3020,16 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	dlist_push_tail(&txn->changes, &change->node);
 	txn->nentries_mem++;
+
+	/*
+	 * Update memory accounting for the restored change.  We need to do this
+	 * although we don't check the memory limit when restoring the changes in
+	 * this branch (we only do that when initially queueing the changes after
+	 * decoding), because we will release the changes later, and that will
+	 * update the accounting too (subtracting the size from the counters). And
+	 * we don't want to underflow there.
+	 */
+	ReorderBufferChangeMemoryUpdate(rb, change, true);
 }
 
 /*
@@ -3003,6 +3249,19 @@ ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn,
  *
  * We cannot replace unchanged toast tuples though, so those will still point
  * to on-disk toast data.
+ *
+ * While updating the existing change with detoasted tuple data, we need to
+ * update the memory accounting info, because the change size will differ.
+ * Otherwise the accounting may get out of sync, triggering serialization
+ * at unexpected times.
+ *
+ * We simply subtract size of the change before rejiggering the tuple, and
+ * then adding the new size. This makes it look like the change was removed
+ * and then added back, except it only tweaks the accounting info.
+ *
+ * In particular it can't trigger serialization, which would be pointless
+ * anyway as it happens during commit processing right before handing
+ * the change to the output plugin.
  */
 static void
 ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
@@ -3023,6 +3282,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	if (txn->toast_hash == NULL)
 		return;
 
+	/*
+	 * We're going to modify the size of the change, so to make sure the
+	 * accounting is correct we'll make it look like we're removing the change
+	 * now (with the old size), and then re-add it at the end.
+	 */
+	ReorderBufferChangeMemoryUpdate(rb, change, false);
+
 	oldcontext = MemoryContextSwitchTo(rb->context);
 
 	/* we should only have toast tuples in an INSERT or UPDATE */
@@ -3172,6 +3438,9 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	pfree(isnull);
 
 	MemoryContextSwitchTo(oldcontext);
+
+	/* now add the change back, with the correct size */
+	ReorderBufferChangeMemoryUpdate(rb, change, true);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4b3769b..ba4edde 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -66,6 +66,7 @@
 #include "postmaster/syslogger.h"
 #include "postmaster/walwriter.h"
 #include "replication/logicallauncher.h"
+#include "replication/reorderbuffer.h"
 #include "replication/slot.h"
 #include "replication/syncrep.h"
 #include "replication/walreceiver.h"
@@ -2257,6 +2258,18 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM,
+			gettext_noop("Sets the maximum memory to be used for logical decoding."),
+			gettext_noop("This much memory can be used by each internal "
+						 "reorder buffer before spilling to disk."),
+			GUC_UNIT_KB
+		},
+		&logical_decoding_work_mem,
+		65536, 64, MAX_KILOBYTES,
+		NULL, NULL, NULL
+	},
+
 	/*
 	 * We use the hopefully-safely-small value of 100kB as the compiled-in
 	 * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index be02a76..46a06ff 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -130,6 +130,7 @@
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #autovacuum_work_mem = -1		# min 1MB, or -1 to use maintenance_work_mem
+#logical_decoding_work_mem = 64MB	# min 64kB
 #max_stack_depth = 2MB			# min 100kB
 #shared_memory_type = mmap		# the default is the first option
 					# supported by the operating system:
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 4c06a78..7c94d92 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -17,6 +17,8 @@
 #include "utils/snapshot.h"
 #include "utils/timestamp.h"
 
+extern PGDLLIMPORT int logical_decoding_work_mem;
+
 /* an individual tuple, stored in one chunk of memory */
 typedef struct ReorderBufferTupleBuf
 {
@@ -63,6 +65,9 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_TRUNCATE
 };
 
+/* forward declaration */
+struct ReorderBufferTXN;
+
 /*
  * a single 'change', can be an insert (with one tuple), an update (old, new),
  * or a delete (old).
@@ -77,6 +82,9 @@ typedef struct ReorderBufferChange
 	/* The type of change. */
 	enum ReorderBufferChangeType action;
 
+	/* Transaction this change belongs to. */
+	struct ReorderBufferTXN *txn;
+
 	RepOriginId origin_id;
 
 	/*
@@ -286,6 +294,11 @@ typedef struct ReorderBufferTXN
 	 */
 	dlist_node	node;
 
+	/*
+	 * Size of this transaction (changes currently in memory, in bytes).
+	 */
+	Size		size;
+
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -386,6 +399,9 @@ struct ReorderBuffer
 	/* buffer for disk<->memory conversions */
 	char	   *outbuf;
 	Size		outbufsize;
+
+	/* memory accounting */
+	Size		size;
 };
 
 
-- 
1.8.3.1

0002-Track-statistics-for-spilling-of-changes-from-Reorde.nov16.patchapplication/octet-stream; name=0002-Track-statistics-for-spilling-of-changes-from-Reorde.nov16.patchDownload
From e5191f045484d5cd27578868598dec94fcbc06db Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 16 Nov 2019 18:24:00 +0530
Subject: [PATCH 2/2] Track statistics for spilling of changes from
 ReorderBuffer.

This adds the statistics about transactions spilled to disk from
ReorderBuffer.  Users can query the pg_stat_replication view to check
these stats.

Author: Tomas Vondra, with bug-fixes and minor changes by Dilip Kumar
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 doc/src/sgml/monitoring.sgml                    | 20 ++++++++++++
 src/backend/catalog/system_views.sql            |  5 ++-
 src/backend/replication/logical/reorderbuffer.c | 12 +++++++
 src/backend/replication/walsender.c             | 42 +++++++++++++++++++++++--
 src/include/catalog/pg_proc.dat                 |  6 ++--
 src/include/replication/reorderbuffer.h         | 11 +++++++
 src/include/replication/walsender_private.h     |  5 +++
 src/test/regress/expected/rules.out             |  7 +++--
 8 files changed, 100 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 901fee9..a3c5f86 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1972,6 +1972,26 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
      <entry><type>timestamp with time zone</type></entry>
      <entry>Send time of last reply message received from standby server</entry>
     </row>
+    <row>
+     <entry><structfield>spill_bytes</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Amount of decoded transaction data spilled to disk.</entry>
+    </row>
+    <row>
+     <entry><structfield>spill_txns</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of transactions spilled to disk after the memory used by
+      logical decoding exceeds <literal>logical_decoding_work_mem</literal>. The
+      counter gets incremented both for toplevel transactions and
+      subtransactions.</entry>
+    </row>
+    <row>
+     <entry><structfield>spill_count</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of times transactions were spilled to disk. Transactions
+      may get spilled repeatedly, and this counter gets incremented on every
+      such invocation.</entry>
+    </row>
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 4456fef..f7800f0 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -776,7 +776,10 @@ CREATE VIEW pg_stat_replication AS
             W.replay_lag,
             W.sync_priority,
             W.sync_state,
-            W.reply_time
+            W.reply_time,
+            W.spill_txns,
+            W.spill_count,
+            W.spill_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d82a5f1..53affeb 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -308,6 +308,10 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	buffer->spillCount = 0;
+	buffer->spillTxns = 0;
+	buffer->spillBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -2415,6 +2419,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	int			fd = -1;
 	XLogSegNo	curOpenSegNo = 0;
 	Size		spilled = 0;
+	Size		size = txn->size;
 
 	elog(DEBUG2, "spill %u changes in XID %u to disk",
 		 (uint32) txn->nentries_mem, txn->xid);
@@ -2473,6 +2478,13 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		spilled++;
 	}
 
+	/* update the statistics */
+	rb->spillCount += 1;
+	rb->spillBytes += size;
+
+	/* Don't consider already serialized transaction. */
+	rb->spillTxns += txn->serialized ? 0 : 1;
+
 	Assert(spilled == txn->nentries_mem);
 	Assert(dlist_is_empty(&txn->changes));
 	txn->nentries_mem = 0;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 7f56715..fa75872 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -248,6 +248,7 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
+static void UpdateSpillStats(LogicalDecodingContext *ctx);
 static void XLogRead(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count);
 
 
@@ -1261,7 +1262,8 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
 /*
  * LogicalDecodingContext 'update_progress' callback.
  *
- * Write the current position to the lag tracker (see XLogSendPhysical).
+ * Write the current position to the lag tracker (see XLogSendPhysical),
+ * and update the spill statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1280,6 +1282,11 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 
 	LagTrackerWrite(lsn, now);
 	sendTime = now;
+
+	/*
+	 * Update statistics about transactions that spilled to disk.
+	 */
+	UpdateSpillStats(ctx);
 }
 
 /*
@@ -2318,6 +2325,9 @@ InitWalSenderSlot(void)
 			walsnd->state = WALSNDSTATE_STARTUP;
 			walsnd->latch = &MyProc->procLatch;
 			walsnd->replyTime = 0;
+			walsnd->spillTxns = 0;
+			walsnd->spillCount = 0;
+			walsnd->spillBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3219,7 +3229,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	12
+#define PG_STAT_GET_WAL_SENDERS_COLS	15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3274,6 +3284,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int			pid;
 		WalSndState state;
 		TimestampTz replyTime;
+		int64		spillTxns;
+		int64		spillCount;
+		int64		spillBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3294,6 +3307,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		applyLag = walsnd->applyLag;
 		priority = walsnd->sync_standby_priority;
 		replyTime = walsnd->replyTime;
+		spillTxns = walsnd->spillTxns;
+		spillCount = walsnd->spillCount;
+		spillBytes = walsnd->spillBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		memset(nulls, 0, sizeof(nulls));
@@ -3375,6 +3391,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 				nulls[11] = true;
 			else
 				values[11] = TimestampTzGetDatum(replyTime);
+
+			/* spill to disk */
+			values[12] = Int64GetDatum(spillTxns);
+			values[13] = Int64GetDatum(spillCount);
+			values[14] = Int64GetDatum(spillBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3611,3 +3632,20 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
 	Assert(time != 0);
 	return now - time;
 }
+
+static void
+UpdateSpillStats(LogicalDecodingContext *ctx)
+{
+	ReorderBuffer *rb = ctx->reorder;
+
+	SpinLockAcquire(&MyWalSnd->mutex);
+
+	MyWalSnd->spillTxns = rb->spillTxns;
+	MyWalSnd->spillCount = rb->spillCount;
+	MyWalSnd->spillBytes = rb->spillBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %ld %ld %ld",
+		 rb, rb->spillTxns, rb->spillCount, rb->spillBytes);
+
+	SpinLockRelease(&MyWalSnd->mutex);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 58ea5b9..fa0a2a1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5166,9 +5166,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 7c94d92..0867ee9 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -402,6 +402,17 @@ struct ReorderBuffer
 
 	/* memory accounting */
 	Size		size;
+
+	/*
+	 * Statistics about transactions spilled to disk.
+	 *
+	 * A single transaction may be spilled repeatedly, which is why we keep
+	 * two different counters. For spilling, the transaction counter includes
+	 * both toplevel transactions and subtransactions.
+	 */
+	int64		spillCount;		/* spill-to-disk invocation counter */
+	int64		spillTxns;		/* number of transactions spilled to disk  */
+	int64		spillBytes;		/* amount of data spilled to disk */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 0dd6d1c..a6b3205 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -80,6 +80,11 @@ typedef struct WalSnd
 	 * Timestamp of the last message received from standby.
 	 */
 	TimestampTz replyTime;
+
+	/* Statistics for transactions spilled to disk. */
+	int64		spillTxns;
+	int64		spillCount;
+	int64		spillBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 14e7214..22e6c86 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1952,9 +1952,12 @@ pg_stat_replication| SELECT s.pid,
     w.replay_lag,
     w.sync_priority,
     w.sync_state,
-    w.reply_time
+    w.reply_time,
+    w.spill_txns,
+    w.spill_count,
+    w.spill_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
1.8.3.1

#138Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#136)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Few other comments on this patch:
1.
+ case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+ /*
+ * Execute the invalidation message locally.
+ *
+ * XXX Do we need to care about relcacheInitFileInval and
+ * the other fields added to ReorderBufferChange, or just
+ * about the message itself?
+ */
+ LocalExecuteInvalidationMessage(&change->data.inval.msg);
+ break;

Here, why are we executing messages individually? Can't we just
follow what we do in DecodeCommit which is to record the invalidations
in ReorderBufferTXN as we encounter them and then allow them to
execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a
reason why we don't do ReorderBufferXidSetCatalogChanges when we
receive any invalidation message?

I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
commit. Because this is required to add any committed transaction to
the snapshot if it has done any catalog changes. So I think there is
no point in setting that flag every time we get an invalidation
message.

IMHO, the reason is that in DecodeCommit, we get all the invalidation
at one time so, at REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID, we don't
know which invalidation message to execute so for being safe we have
to execute all. But, since we are logging all invalidation
individually, we exactly know at this stage which cache to invalidate.
So it is better to only invalidate required cache not all.

In that case, invalidations can be processed multiple times, the first
time when these individual WAL logs for invalidation are processed and
then later at commit time when we accumulate all invalidation messages
and then execute them for REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.
Can we avoid to execute invalidations from other places after this
patch which also includes executing them as part of XLOG_INVALIDATIONS
processing?

I think we can avoid invalidation which is done as part of
REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. I need to further
investigate the invalidation which is done as part of
XLOG_INVALIDATIONS.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#139Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#138)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Few other comments on this patch:
1.
+ case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+ /*
+ * Execute the invalidation message locally.
+ *
+ * XXX Do we need to care about relcacheInitFileInval and
+ * the other fields added to ReorderBufferChange, or just
+ * about the message itself?
+ */
+ LocalExecuteInvalidationMessage(&change->data.inval.msg);
+ break;

Here, why are we executing messages individually? Can't we just
follow what we do in DecodeCommit which is to record the invalidations
in ReorderBufferTXN as we encounter them and then allow them to
execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a
reason why we don't do ReorderBufferXidSetCatalogChanges when we
receive any invalidation message?

I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
commit. Because this is required to add any committed transaction to
the snapshot if it has done any catalog changes.

Hmm, this is also used to build cid hash map (see
ReorderBufferBuildTupleCidHash) which we need to use while streaming
changes for the in-progress transactions. So, I think that it would
be required earlier (before commit) as well.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#140Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#137)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Nov 16, 2019 at 6:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Nov 7, 2019 at 5:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Some notes before commit:
--------------------------------------
1.
Commit message need to be changed for the first patch
-------------------------------------------------------------------------
A.

The memory limit is defined by a new logical_decoding_work_mem GUC, so for example we can do this

SET logical_decoding_work_mem = '128kB'

to trigger very aggressive streaming. The minimum value is 64kB.

I think this patch doesn't contain streaming, so we either need to
reword it or remove it.

B.

The logical_decoding_work_mem may be set either in postgresql.conf, in which case it serves as the default for all publishers on that instance, or when creating the
subscription, using a work_mem paramemter in the WITH clause (specifies number of kilobytes).

We need to reword this as we have decided to remove the setting from
the subscription side as of now.

2. I think we can change the message level in UpdateSpillStats() to DEBUG2.

I have made these modifications and additionally ran pgindent.

4. I think we can combine both patches and commit as one patch, but it
is okay to commit them separately as well.

I am not sure if this is a good idea, so still kept them as separate.

I have committed the first patch. I will commit the second one
related to stats of spilled xacts on Thursday. The second patch needs
catalog version bump as well because we are modifying the catalog
contents in that patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#141Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#139)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Nov 19, 2019 at 5:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Few other comments on this patch:
1.
+ case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+ /*
+ * Execute the invalidation message locally.
+ *
+ * XXX Do we need to care about relcacheInitFileInval and
+ * the other fields added to ReorderBufferChange, or just
+ * about the message itself?
+ */
+ LocalExecuteInvalidationMessage(&change->data.inval.msg);
+ break;

Here, why are we executing messages individually? Can't we just
follow what we do in DecodeCommit which is to record the invalidations
in ReorderBufferTXN as we encounter them and then allow them to
execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a
reason why we don't do ReorderBufferXidSetCatalogChanges when we
receive any invalidation message?

I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
commit. Because this is required to add any committed transaction to
the snapshot if it has done any catalog changes.

Hmm, this is also used to build cid hash map (see
ReorderBufferBuildTupleCidHash) which we need to use while streaming
changes for the in-progress transactions. So, I think that it would
be required earlier (before commit) as well.

Oh right, I guess I missed that part.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#142Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#141)
13 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Nov 20, 2019 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Nov 19, 2019 at 5:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Few other comments on this patch:
1.
+ case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+ /*
+ * Execute the invalidation message locally.
+ *
+ * XXX Do we need to care about relcacheInitFileInval and
+ * the other fields added to ReorderBufferChange, or just
+ * about the message itself?
+ */
+ LocalExecuteInvalidationMessage(&change->data.inval.msg);
+ break;

Here, why are we executing messages individually? Can't we just
follow what we do in DecodeCommit which is to record the invalidations
in ReorderBufferTXN as we encounter them and then allow them to
execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a
reason why we don't do ReorderBufferXidSetCatalogChanges when we
receive any invalidation message?

I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
commit. Because this is required to add any committed transaction to
the snapshot if it has done any catalog changes.

Hmm, this is also used to build cid hash map (see
ReorderBufferBuildTupleCidHash) which we need to use while streaming
changes for the in-progress transactions. So, I think that it would
be required earlier (before commit) as well.

Oh right, I guess I missed that part.

Attached a new rebased version of the patch set. I have fixed all
the issues discussed up-thread and agreed upon.

Pending Issues:
1. The default value of the logical_decoding_work_mem is set to 64kb
in test_decoding/logical.conf. So we need to change the expected
output files for the test decoding module.
2. Need to complete the patch for concurrent abort handling of the
(sub)transaction. There are some pending issues with the existing
patch[1]/messages/by-id/CAFiTN-ud98kWHCo2YKS55H8rGw3_A7ESyssHwU0xPU6KJsoy6A@mail.gmail.com.

[1]: /messages/by-id/CAFiTN-ud98kWHCo2YKS55H8rGw3_A7ESyssHwU0xPU6KJsoy6A@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0002-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=0002-Immediately-WAL-log-assignments.patchDownload
From 33439d94691e636a4439d4e234d76fc9916b8359 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 09:53:08 +0530
Subject: [PATCH 02/13] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So instead we write the assignment info into WAL immediately, as
part of the next WAL record (to minimize overhead).
---
 src/backend/access/transam/xact.c        | 45 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 39 +++++++++++++--------------
 src/include/access/xact.h                |  3 +++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 +++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 98 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8fe38c3..4a853f3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -190,6 +190,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -5096,6 +5097,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6002,3 +6004,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index aa9dca0..a8a8084 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -88,11 +88,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -194,6 +196,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -397,7 +403,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -743,6 +749,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		curinsert_flags |= XLOG_INCLUDE_XID;
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 7f24f0c..4ef2661 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1071,6 +1071,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1109,6 +1110,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index bc532d0..897b755 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel_xid is valid, we need to assign the subxact to the
+	 * toplevel transaction. We need to do this for all records, hence we
+	 * do it before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 record->toplevel_xid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrIds) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(r->toplevel_xid))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 42b76cb..0c7daf4 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d519252..b492d3e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -227,6 +227,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 1bbee38..c37a83d 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -148,6 +148,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -243,6 +245,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 9375e54..bcfba0a 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

0001-Track-statistics-for-spilling-of-changes-from-Reorde.patchapplication/octet-stream; name=0001-Track-statistics-for-spilling-of-changes-from-Reorde.patchDownload
From 45bdc8b9116844c2a458110363b14e820911bccb Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 16 Nov 2019 18:24:00 +0530
Subject: [PATCH 01/13] Track statistics for spilling of changes from
 ReorderBuffer.

This adds the statistics about transactions spilled to disk from
ReorderBuffer.  Users can query the pg_stat_replication view to check
these stats.

Author: Tomas Vondra, with bug-fixes and minor changes by Dilip Kumar
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 doc/src/sgml/monitoring.sgml                    | 20 ++++++++++++
 src/backend/catalog/system_views.sql            |  5 ++-
 src/backend/replication/logical/reorderbuffer.c | 12 +++++++
 src/backend/replication/walsender.c             | 42 +++++++++++++++++++++++--
 src/include/catalog/pg_proc.dat                 |  6 ++--
 src/include/replication/reorderbuffer.h         | 11 +++++++
 src/include/replication/walsender_private.h     |  5 +++
 src/test/regress/expected/rules.out             |  7 +++--
 8 files changed, 100 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 901fee9..a3c5f86 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1972,6 +1972,26 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
      <entry><type>timestamp with time zone</type></entry>
      <entry>Send time of last reply message received from standby server</entry>
     </row>
+    <row>
+     <entry><structfield>spill_bytes</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Amount of decoded transaction data spilled to disk.</entry>
+    </row>
+    <row>
+     <entry><structfield>spill_txns</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of transactions spilled to disk after the memory used by
+      logical decoding exceeds <literal>logical_decoding_work_mem</literal>. The
+      counter gets incremented both for toplevel transactions and
+      subtransactions.</entry>
+    </row>
+    <row>
+     <entry><structfield>spill_count</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of times transactions were spilled to disk. Transactions
+      may get spilled repeatedly, and this counter gets incremented on every
+      such invocation.</entry>
+    </row>
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 4456fef..f7800f0 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -776,7 +776,10 @@ CREATE VIEW pg_stat_replication AS
             W.replay_lag,
             W.sync_priority,
             W.sync_state,
-            W.reply_time
+            W.reply_time,
+            W.spill_txns,
+            W.spill_count,
+            W.spill_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d82a5f1..53affeb 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -308,6 +308,10 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	buffer->spillCount = 0;
+	buffer->spillTxns = 0;
+	buffer->spillBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -2415,6 +2419,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	int			fd = -1;
 	XLogSegNo	curOpenSegNo = 0;
 	Size		spilled = 0;
+	Size		size = txn->size;
 
 	elog(DEBUG2, "spill %u changes in XID %u to disk",
 		 (uint32) txn->nentries_mem, txn->xid);
@@ -2473,6 +2478,13 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		spilled++;
 	}
 
+	/* update the statistics */
+	rb->spillCount += 1;
+	rb->spillBytes += size;
+
+	/* Don't consider already serialized transaction. */
+	rb->spillTxns += txn->serialized ? 0 : 1;
+
 	Assert(spilled == txn->nentries_mem);
 	Assert(dlist_is_empty(&txn->changes));
 	txn->nentries_mem = 0;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 7f56715..fa75872 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -248,6 +248,7 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
+static void UpdateSpillStats(LogicalDecodingContext *ctx);
 static void XLogRead(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count);
 
 
@@ -1261,7 +1262,8 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
 /*
  * LogicalDecodingContext 'update_progress' callback.
  *
- * Write the current position to the lag tracker (see XLogSendPhysical).
+ * Write the current position to the lag tracker (see XLogSendPhysical),
+ * and update the spill statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1280,6 +1282,11 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 
 	LagTrackerWrite(lsn, now);
 	sendTime = now;
+
+	/*
+	 * Update statistics about transactions that spilled to disk.
+	 */
+	UpdateSpillStats(ctx);
 }
 
 /*
@@ -2318,6 +2325,9 @@ InitWalSenderSlot(void)
 			walsnd->state = WALSNDSTATE_STARTUP;
 			walsnd->latch = &MyProc->procLatch;
 			walsnd->replyTime = 0;
+			walsnd->spillTxns = 0;
+			walsnd->spillCount = 0;
+			walsnd->spillBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3219,7 +3229,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	12
+#define PG_STAT_GET_WAL_SENDERS_COLS	15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3274,6 +3284,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int			pid;
 		WalSndState state;
 		TimestampTz replyTime;
+		int64		spillTxns;
+		int64		spillCount;
+		int64		spillBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3294,6 +3307,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		applyLag = walsnd->applyLag;
 		priority = walsnd->sync_standby_priority;
 		replyTime = walsnd->replyTime;
+		spillTxns = walsnd->spillTxns;
+		spillCount = walsnd->spillCount;
+		spillBytes = walsnd->spillBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		memset(nulls, 0, sizeof(nulls));
@@ -3375,6 +3391,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 				nulls[11] = true;
 			else
 				values[11] = TimestampTzGetDatum(replyTime);
+
+			/* spill to disk */
+			values[12] = Int64GetDatum(spillTxns);
+			values[13] = Int64GetDatum(spillCount);
+			values[14] = Int64GetDatum(spillBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3611,3 +3632,20 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
 	Assert(time != 0);
 	return now - time;
 }
+
+static void
+UpdateSpillStats(LogicalDecodingContext *ctx)
+{
+	ReorderBuffer *rb = ctx->reorder;
+
+	SpinLockAcquire(&MyWalSnd->mutex);
+
+	MyWalSnd->spillTxns = rb->spillTxns;
+	MyWalSnd->spillCount = rb->spillCount;
+	MyWalSnd->spillBytes = rb->spillBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %ld %ld %ld",
+		 rb, rb->spillTxns, rb->spillCount, rb->spillBytes);
+
+	SpinLockRelease(&MyWalSnd->mutex);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 58ea5b9..fa0a2a1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5166,9 +5166,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 7c94d92..0867ee9 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -402,6 +402,17 @@ struct ReorderBuffer
 
 	/* memory accounting */
 	Size		size;
+
+	/*
+	 * Statistics about transactions spilled to disk.
+	 *
+	 * A single transaction may be spilled repeatedly, which is why we keep
+	 * two different counters. For spilling, the transaction counter includes
+	 * both toplevel transactions and subtransactions.
+	 */
+	int64		spillCount;		/* spill-to-disk invocation counter */
+	int64		spillTxns;		/* number of transactions spilled to disk  */
+	int64		spillBytes;		/* amount of data spilled to disk */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 0dd6d1c..a6b3205 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -80,6 +80,11 @@ typedef struct WalSnd
 	 * Timestamp of the last message received from standby.
 	 */
 	TimestampTz replyTime;
+
+	/* Statistics for transactions spilled to disk. */
+	int64		spillTxns;
+	int64		spillCount;
+	int64		spillBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index abe3a43..c9cc569 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1952,9 +1952,12 @@ pg_stat_replication| SELECT s.pid,
     w.replay_lag,
     w.sync_priority,
     w.sync_state,
-    w.reply_time
+    w.reply_time,
+    w.spill_txns,
+    w.spill_count,
+    w.spill_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
1.8.3.1

0003-Issue-individual-invalidations-with-wal_level-logica.patchapplication/octet-stream; name=0003-Issue-individual-invalidations-with-wal_level-logica.patchDownload
From e4d3e144b8babaf08be669c17b577c5348e0e3cc Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:26:33 +0530
Subject: [PATCH 03/13] Issue individual invalidations with wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.

The new invalidations are written to WAL immediately, without any
such caching. Perhaps it would be possible to add similar caching,
e.g. at the command level, or something like that?
---
 src/backend/access/rmgrdesc/xactdesc.c          | 52 +++++++++++++++++++
 src/backend/access/transam/xact.c               |  7 +++
 src/backend/replication/logical/decode.c        | 23 +++++++++
 src/backend/replication/logical/reorderbuffer.c | 56 +++++++++++++++++---
 src/backend/utils/cache/inval.c                 | 69 +++++++++++++++++++++++++
 src/include/access/xact.h                       | 18 ++++++-
 src/include/replication/reorderbuffer.h         | 14 +++++
 7 files changed, 231 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 4c411c5..6cfd6af 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,11 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -397,6 +402,14 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs,
+								xlrec->dbId, xlrec->tsId,
+								xlrec->relcacheInitFileInval);
+	}
 }
 
 const char *
@@ -424,7 +437,46 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval)
+{
+	int			i;
+
+	if (relcacheInitFileInval)
+		appendStringInfo(buf, "; relcache init file inval dbid %u tsid %u",
+						 dbId, tsId);
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else if (msg->id == SHAREDINVALSNAPSHOT_ID)
+			appendStringInfo(buf, " snapshot %u", msg->sn.relId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 4a853f3..a42d11f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6001,6 +6001,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 897b755..9bcefb6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,29 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/* XXX for now we're issuing invalidations one by one */
+				Assert(invals->nmsgs == 1);
+
+				if (!TransactionIdIsValid(xid))
+					break;
+
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->dbId, invals->tsId,
+											 invals->relcacheInitFileInval,
+											 invals->msgs[0]);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 53affeb..b1feff3 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -464,6 +464,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
@@ -1804,17 +1805,23 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation message locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+					txn->is_schema_sent = false;
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -2209,6 +2216,38 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 
 /*
  * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn,
+							 Oid dbId, Oid tsId, bool relcacheInitFileInval,
+							 SharedInvalidationMessage msg)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	/* XXX Should we even write invalidations without valid XID? */
+	if (xid == InvalidTransactionId)
+		return;
+
+	Assert(xid != InvalidTransactionId);
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.dbId = dbId;
+	change->data.inval.tsId = tsId;
+	change->data.inval.relcacheInitFileInval = relcacheInitFileInval;
+	change->data.inval.msg = msg;
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Setup the invalidation of the toplevel transaction.
  *
  * This needs to be done before ReorderBufferCommit is called!
  */
@@ -2656,6 +2695,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -2752,6 +2792,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -3027,6 +3068,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index f09e3a9..0682c55 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -104,6 +104,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +211,9 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -489,6 +493,18 @@ RegisterCatcacheInvalidation(int cacheId,
 {
 	AddCatcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   cacheId, hashValue, dbId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cc.id = (int8) cacheId;
+		msg.cc.dbId = dbId;
+		msg.cc.hashValue = hashValue;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -501,6 +517,18 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 {
 	AddCatalogInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								  dbId, catId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cat.id = SHAREDINVALCATALOG_ID;
+		msg.cat.dbId = dbId;
+		msg.cat.catId = catId;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -511,6 +539,8 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 static void
 RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 {
+	bool		RelcacheInitFileInval = false;
+
 	AddRelcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   dbId, relId);
 
@@ -529,7 +559,22 @@ RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 	 * as well.  Also zap when we are invalidating whole relcache.
 	 */
 	if (relId == InvalidOid || RelationIdIsInInitFile(relId))
+	{
 		transInvalInfo->RelcacheInitFileInval = true;
+		RelcacheInitFileInval = true;
+	}
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.rc.id = SHAREDINVALRELCACHE_ID;
+		msg.rc.dbId = dbId;
+		msg.rc.relId = relId;
+
+		LogLogicalInvalidations(1, &msg, RelcacheInitFileInval);
+	}
 }
 
 /*
@@ -1501,3 +1546,27 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval)
+{
+	xl_xact_invalidations xlrec;
+
+	/* prepare record */
+	memset(&xlrec, 0, sizeof(xlrec));
+	xlrec.dbId = MyDatabaseId;
+	xlrec.tsId = MyDatabaseTableSpace;
+	xlrec.relcacheInitFileInval = relcacheInitFileInval;
+	xlrec.nmsgs = nmsgs;
+
+	/* perform insertion */
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+	XLogRegisterData((char *) msgs,
+					 nmsgs * sizeof(SharedInvalidationMessage));
+	XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 0c7daf4..5bd3893 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,22 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ *
+ * XXX Currently nmsgs=1 but that might change in the future.
+ */
+typedef struct xl_xact_invalidations
+{
+	Oid			dbId;			/* MyDatabaseId */
+	Oid			tsId;			/* MyDatabaseTableSpace */
+	bool		relcacheInitFileInval;	/* invalidate relcache init file */
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0867ee9..6a7187b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,16 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			Oid			dbId;	/* MyDatabaseId */
+			Oid			tsId;	/* MyDatabaseTableSpace */
+			bool		relcacheInitFileInval;	/* invalidate relcache init
+												 * file */
+			SharedInvalidationMessage msg;	/* invalidation message */
+		}			inval;
 	}			data;
 
 	/*
@@ -448,6 +459,9 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  Oid dbId, Oid tsId, bool relcacheInitFileInval,
+								  SharedInvalidationMessage msg);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
1.8.3.1

0005-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patchapplication/octet-stream; name=0005-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patchDownload
From d750bf1746a2cd61ff9d72ff690baf400512605c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 18:08:37 +0200
Subject: [PATCH 05/13] Cleaning up of flags in ReorderBufferTXN structure

---
 src/backend/replication/logical/reorderbuffer.c | 36 ++++++++++++-------------
 src/include/replication/reorderbuffer.h         | 33 ++++++++++++++---------
 2 files changed, 38 insertions(+), 31 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b1feff3..3422939 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -732,7 +732,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 			Assert(prev_first_lsn < cur_txn->first_lsn);
 
 		/* known-as-subtxn txns must not be listed */
-		Assert(!cur_txn->is_known_as_subxact);
+		Assert(!rbtxn_is_known_subxact(cur_txn));
 
 		prev_first_lsn = cur_txn->first_lsn;
 	}
@@ -752,7 +752,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 			Assert(prev_base_snap_lsn < cur_txn->base_snapshot_lsn);
 
 		/* known-as-subtxn txns must not be listed */
-		Assert(!cur_txn->is_known_as_subxact);
+		Assert(!rbtxn_is_known_subxact(cur_txn));
 
 		prev_base_snap_lsn = cur_txn->base_snapshot_lsn;
 	}
@@ -775,7 +775,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
 
 	txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
 
-	Assert(!txn->is_known_as_subxact);
+	Assert(!rbtxn_is_known_subxact(txn));
 	Assert(txn->first_lsn != InvalidXLogRecPtr);
 	return txn;
 }
@@ -835,7 +835,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 
 	if (!new_sub)
 	{
-		if (subtxn->is_known_as_subxact)
+		if (rbtxn_is_known_subxact(subtxn))
 		{
 			/* already associated, nothing to do */
 			return;
@@ -851,7 +851,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 		}
 	}
 
-	subtxn->is_known_as_subxact = true;
+	subtxn->txn_flags |= RBTXN_IS_SUBXACT;
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
@@ -1061,7 +1061,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	{
 		ReorderBufferChange *cur_change;
 
-		if (txn->serialized)
+		if (rbtxn_is_serialized(txn))
 		{
 			/* serialize remaining changes */
 			ReorderBufferSerializeTXN(rb, txn);
@@ -1090,7 +1090,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		{
 			ReorderBufferChange *cur_change;
 
-			if (cur_txn->serialized)
+			if (rbtxn_is_serialized(cur_txn))
 			{
 				/* serialize remaining changes */
 				ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1256,7 +1256,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * they originally were happening inside another subtxn, so we won't
 		 * ever recurse more than one level deep here.
 		 */
-		Assert(subtxn->is_known_as_subxact);
+		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
 		ReorderBufferCleanupTXN(rb, subtxn);
@@ -1304,7 +1304,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	/*
 	 * Remove TXN from its containing list.
 	 *
-	 * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+	 * Note: if txn is known as subxact, we are deleting the TXN from its
 	 * parent's list of known subxacts; this leaves the parent's nsubxacts
 	 * count too high, but we don't care.  Otherwise, we are deleting the TXN
 	 * from the LSN-ordered list of toplevel TXNs.
@@ -1319,7 +1319,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	Assert(found);
 
 	/* remove entries spilled to disk */
-	if (txn->serialized)
+	if (rbtxn_is_serialized(txn))
 		ReorderBufferRestoreCleanup(rb, txn);
 
 	/* deallocate */
@@ -1336,7 +1336,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
 		return;
 
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1970,7 +1970,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
 			 * final_lsn to that of their last change; this causes
 			 * ReorderBufferRestoreCleanup to do the right thing.
 			 */
-			if (txn->serialized && txn->final_lsn == 0)
+			if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
 			{
 				ReorderBufferChange *last =
 				dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -2118,7 +2118,7 @@ ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
 	 * operate on its top-level transaction instead.
 	 */
 	txn = ReorderBufferTXNByXid(rb, xid, true, &is_new, lsn, true);
-	if (txn->is_known_as_subxact)
+	if (rbtxn_is_known_subxact(txn))
 		txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
 									NULL, InvalidXLogRecPtr, false);
 	Assert(txn->base_snapshot == NULL);
@@ -2297,7 +2297,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	txn->has_catalog_changes = true;
+	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2314,7 +2314,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
 	if (txn == NULL)
 		return false;
 
-	return txn->has_catalog_changes;
+	return rbtxn_has_catalog_changes(txn);
 }
 
 /*
@@ -2334,7 +2334,7 @@ ReorderBufferXidHasBaseSnapshot(ReorderBuffer *rb, TransactionId xid)
 		return false;
 
 	/* a known subtxn? operate on top-level txn instead */
-	if (txn->is_known_as_subxact)
+	if (rbtxn_is_known_subxact(txn))
 		txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
 									NULL, InvalidXLogRecPtr, false);
 
@@ -2522,12 +2522,12 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	rb->spillBytes += size;
 
 	/* Don't consider already serialized transaction. */
-	rb->spillTxns += txn->serialized ? 0 : 1;
+	rb->spillTxns += rbtxn_is_serialized(txn) ? 0 : 1;
 
 	Assert(spilled == txn->nentries_mem);
 	Assert(dlist_is_empty(&txn->changes));
 	txn->nentries_mem = 0;
-	txn->serialized = true;
+	txn->txn_flags |= RBTXN_IS_SERIALIZED;
 
 	if (fd != -1)
 		CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5b4be2b..19c7bac 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -169,18 +169,34 @@ typedef struct ReorderBufferChange
 	dlist_node	node;
 } ReorderBufferChange;
 
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT          0x0002
+#define RBTXN_IS_SERIALIZED       0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn)    (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk?  It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn)       (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
 typedef struct ReorderBufferTXN
 {
+	int     txn_flags;
+
 	/*
 	 * The transactions transaction id, can be a toplevel or sub xid.
 	 */
 	TransactionId xid;
 
-	/* did the TX have catalog changes */
-	bool		has_catalog_changes;
-
 	/* Do we know this is a subxact?  Xid of top-level txn if so */
-	bool		is_known_as_subxact;
 	TransactionId toplevel_xid;
 
 	/*
@@ -249,15 +265,6 @@ typedef struct ReorderBufferTXN
 	uint64		nentries_mem;
 
 	/*
-	 * Has this transaction been spilled to disk?  It's not always possible to
-	 * deduce that fact by comparing nentries with nentries_mem, because e.g.
-	 * subtransactions of a large transaction might get serialized together
-	 * with the parent - if they're restored to memory they'd have
-	 * nentries_mem == nentries.
-	 */
-	bool		serialized;
-
-	/*
 	 * List of ReorderBufferChange structs, including new Snapshots and new
 	 * CommandIds
 	 */
-- 
1.8.3.1

0004-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=0004-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From f2d26607b79cacc54c89b70a8f88d6194e814f26 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH 04/13] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 +++++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  57 +++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 6c33c4b..9c77791 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -542,3 +571,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8db9686..fc4ad65 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_message_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction size and network bandwidth, the transfer time
+    may significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similarly to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_work_mem</varname> setting. At
+    that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 7e06615..b88b585 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +204,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -863,6 +911,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	/* FIXME ctx->write_location = apply_lsn; */
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	/* FIXME ctx->write_location = apply_lsn; */
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 31c796b..d95d1b9 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -81,6 +81,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index d4ce54f..a305462 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6a7187b..5b4be2b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -345,6 +345,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -384,6 +430,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

0006-Gracefully-handle-concurrent-aborts-of-uncommitted-t.patchapplication/octet-stream; name=0006-Gracefully-handle-concurrent-aborts-of-uncommitted-t.patchDownload
From fc83f3bbe01cd7e807033e5a9b2ef8dd48b8f5ca Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:32:34 +0530
Subject: [PATCH 06/13] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml               |  5 ++-
 src/backend/access/heap/heapam.c                | 51 +++++++++++++++++++++++++
 src/backend/access/index/genam.c                | 34 +++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c |  9 +++--
 src/backend/utils/time/snapmgr.c                | 25 +++++++++++-
 src/include/utils/snapmgr.h                     |  4 +-
 6 files changed, 120 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index fc4ad65..da6a6f3 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,7 +432,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
      includes writing to tables, performing DDL changes, and
      calling <literal>txid_current()</literal>.
     </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0128bb3..2a60a73 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1303,6 +1303,17 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(scan->rs_base.rs_rd) ||
+		  RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_getnext call")));
+
 	/* Note: no locking manipulations needed */
 
 	HEAPDEBUG_1;				/* heap_getnext( info ) */
@@ -1422,6 +1433,16 @@ heap_fetch(Relation relation,
 	bool		valid;
 
 	/*
+	 * We don't expect direct calls to heap_fetch with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_fetch call")));
+
+	/*
 	 * Fetch and pin the appropriate page of the relation.
 	 */
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
@@ -1535,6 +1556,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		valid;
 	bool		skip;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search_buffer with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_hot_search_buffer call")));
+
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
 		*all_dead = first_call;
@@ -1683,6 +1714,16 @@ heap_get_latest_tid(TableScanDesc sscan,
 	Assert(ItemPointerIsValid(tid));
 
 	/*
+	 * We don't expect direct calls to heap_get_latest_tid with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_get_latest_tid call")));
+
+	/*
 	 * Loop to chase down t_ctid links.  At top of loop, ctid is the tuple we
 	 * need to examine, and *tid is the TID we will return if ctid turns out
 	 * to be bogus.
@@ -5481,6 +5522,16 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	ItemId		lp = NULL;
 	HeapTupleHeader htup;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_hot_search call")));
+
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d..201acfb 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -477,6 +478,17 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
@@ -513,6 +525,17 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return result;
 }
 
@@ -639,6 +662,17 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3422939..bda4a1c 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -683,7 +683,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 			txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 		/* setup snapshot to allow catalog access */
-		SetupHistoricSnapshot(snapshot_now, NULL);
+		SetupHistoricSnapshot(snapshot_now, NULL, xid);
 		PG_TRY();
 		{
 			rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1533,7 +1533,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1784,7 +1784,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1804,7 +1804,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					}
 
 					break;
@@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_CATCH();
 	{
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
+
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
 
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 47b0517..9fa1e43 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -154,6 +154,13 @@ static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
 /*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
+/*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2029,10 +2036,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
  *
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive.  This is to re-check XID status while accessing catalog.
+ *
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+					  TransactionId snapshot_xid)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -2041,8 +2052,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 
 	/* setup (cmin, cmax) lookup hash */
 	tuplecid_data = tuplecids;
-}
 
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
+	 */
+	if (TransactionIdIsValid(snapshot_xid) &&
+		!TransactionIdDidCommit(snapshot_xid))
+		CheckXidAlive = snapshot_xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
 /*
  * Make catalog snapshots behave normally again.
@@ -2052,6 +2072,7 @@ TeardownHistoricSnapshot(bool is_error)
 {
 	HistoricSnapshot = NULL;
 	tuplecid_data = NULL;
+	CheckXidAlive = InvalidTransactionId;
 }
 
 bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 67b07df..9a8f9ce 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,8 +145,10 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+								  TransactionId snapshot_xid);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-- 
1.8.3.1

0007-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=0007-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From 0251050a1ba6ece6ee3a8826b5e1ac5a136c8e39 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 11:42:31 +0530
Subject: [PATCH 07/13] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c     |   38 +-
 src/backend/replication/logical/reorderbuffer.c | 1075 ++++++++++++++++++++++-
 src/include/replication/reorderbuffer.h         |   32 +
 3 files changed, 1112 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 537e681..76a105a 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bda4a1c..0ab3191 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -149,6 +149,28 @@ typedef struct ReorderBufferIterTXNState
 	ReorderBufferIterTXNEntry entries[FLEXIBLE_ARRAY_MEMBER];
 } ReorderBufferIterTXNState;
 
+/*
+ * k-way in-order change iteration support structures
+ *
+ * This is a simplified version for streaming, which does not require
+ * serialization to files and only reads changes that are currently in
+ * memory.
+ */
+typedef struct ReorderBufferStreamIterTXNEntry
+{
+	XLogRecPtr	lsn;
+	ReorderBufferChange *change;
+	ReorderBufferTXN *txn;
+}			ReorderBufferStreamIterTXNEntry;
+
+typedef struct ReorderBufferStreamIterTXNState
+{
+	binaryheap *heap;
+	Size		nr_txns;
+	dlist_head	old_change;
+	ReorderBufferStreamIterTXNEntry entries[FLEXIBLE_ARRAY_MEMBER];
+}			ReorderBufferStreamIterTXNState;
+
 /* toast datastructures */
 typedef struct ReorderBufferToastEnt
 {
@@ -213,6 +235,20 @@ static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
 static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
 
+
+/* iterator for streaming (only get data from memory) */
+static ReorderBufferStreamIterTXNState * ReorderBufferStreamIterTXNInit(
+																		ReorderBuffer *rb,
+																		ReorderBufferTXN *txn);
+
+static ReorderBufferChange *ReorderBufferStreamIterTXNNext(
+							   ReorderBuffer *rb,
+							   ReorderBufferStreamIterTXNState * state);
+
+static void ReorderBufferStreamIterTXNFinish(
+								 ReorderBuffer *rb,
+								 ReorderBufferStreamIterTXNState * state);
+
 /*
  * ---------------------------------------
  * Disk serialization support functions
@@ -227,6 +263,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -235,6 +272,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -362,6 +408,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -759,6 +808,33 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+static void
+AssertChangeLsnOrder(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -855,6 +931,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -978,7 +1057,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1006,6 +1085,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	cur_txn_i;
 	int32		off;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(rb, txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1020,6 +1102,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(rb, cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1235,6 +1320,210 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 }
 
 /*
+ * Binary heap comparison function (streaming iterator).
+ */
+static int
+ReorderBufferStreamIterCompare(Datum a, Datum b, void *arg)
+{
+	ReorderBufferStreamIterTXNState *state = (ReorderBufferStreamIterTXNState *) arg;
+	XLogRecPtr	pos_a = state->entries[DatumGetInt32(a)].lsn;
+	XLogRecPtr	pos_b = state->entries[DatumGetInt32(b)].lsn;
+
+	if (pos_a < pos_b)
+		return 1;
+	else if (pos_a == pos_b)
+		return 0;
+	return -1;
+}
+
+/*
+ * Allocate & initialize an iterator which iterates in lsn order over a
+ * transaction and all its subtransactions. This version is meant for
+ * streaming of incomplete transactions.
+ */
+static ReorderBufferStreamIterTXNState *
+ReorderBufferStreamIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Size		nr_txns = 0;
+	ReorderBufferStreamIterTXNState *state;
+	dlist_iter	cur_txn_i;
+	int32		off;
+
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(rb, txn);
+
+	/*
+	 * Calculate the size of our heap: one element for every transaction that
+	 * contains changes.  (Besides the transactions already in the reorder
+	 * buffer, we count the one we were directly passed.)
+	 */
+	if (txn->nentries > 0)
+		nr_txns++;
+
+	dlist_foreach(cur_txn_i, &txn->subtxns)
+	{
+		ReorderBufferTXN *cur_txn;
+
+		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
+
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(rb, cur_txn);
+
+		if (cur_txn->nentries > 0)
+			nr_txns++;
+	}
+
+	/*
+	 * TODO: Consider adding fastpath for the rather common nr_txns=1 case, no
+	 * need to allocate/build a heap then.
+	 */
+
+	/* allocate iteration state */
+	state = (ReorderBufferStreamIterTXNState *)
+		MemoryContextAllocZero(rb->context,
+							   sizeof(ReorderBufferStreamIterTXNState) +
+							   sizeof(ReorderBufferStreamIterTXNEntry) * nr_txns);
+
+	state->nr_txns = nr_txns;
+	dlist_init(&state->old_change);
+
+	/* allocate heap */
+	state->heap = binaryheap_allocate(state->nr_txns,
+									  ReorderBufferStreamIterCompare,
+									  state);
+
+	/*
+	 * Now insert items into the binary heap, in an unordered fashion.  (We
+	 * will run a heap assembly step at the end; this is more efficient.)
+	 */
+
+	off = 0;
+
+	/* add toplevel transaction if it contains changes */
+	if (txn->nentries > 0)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_head_element(ReorderBufferChange, node,
+										&txn->changes);
+
+		state->entries[off].lsn = cur_change->lsn;
+		state->entries[off].change = cur_change;
+		state->entries[off].txn = txn;
+
+		binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
+	}
+
+	/* add subtransactions if they contain changes */
+	dlist_foreach(cur_txn_i, &txn->subtxns)
+	{
+		ReorderBufferTXN *cur_txn;
+
+		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
+
+		if (cur_txn->nentries > 0)
+		{
+			ReorderBufferChange *cur_change;
+
+			cur_change = dlist_head_element(ReorderBufferChange, node,
+											&cur_txn->changes);
+
+			state->entries[off].lsn = cur_change->lsn;
+			state->entries[off].change = cur_change;
+			state->entries[off].txn = cur_txn;
+
+			binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
+		}
+	}
+
+	Assert(off == nr_txns);
+
+	/* assemble a valid binary heap */
+	binaryheap_build(state->heap);
+
+	return state;
+}
+
+/*
+ * Return the next change when iterating over a transaction and its
+ * subtransactions.
+ *
+ * Returns NULL when no further changes exist.
+ */
+static ReorderBufferChange *
+ReorderBufferStreamIterTXNNext(ReorderBuffer *rb, ReorderBufferStreamIterTXNState * state)
+{
+	ReorderBufferChange *change;
+	ReorderBufferStreamIterTXNEntry *entry;
+	int32		off;
+
+	/* nothing there anymore */
+	if (state->heap->bh_size == 0)
+		return NULL;
+
+	off = DatumGetInt32(binaryheap_first(state->heap));
+	entry = &state->entries[off];
+
+	/* free memory we might have "leaked" in the previous *Next call */
+	if (!dlist_is_empty(&state->old_change))
+	{
+		change = dlist_container(ReorderBufferChange, node,
+								 dlist_pop_head_node(&state->old_change));
+		ReorderBufferReturnChange(rb, change);
+		Assert(dlist_is_empty(&state->old_change));
+	}
+
+	change = entry->change;
+
+	/*
+	 * update heap with information about which transaction has the next
+	 * relevant change in LSN order
+	 */
+
+	/* there are in-memory changes */
+	if (dlist_has_next(&entry->txn->changes, &entry->change->node))
+	{
+		dlist_node *next = dlist_next_node(&entry->txn->changes, &change->node);
+		ReorderBufferChange *next_change =
+		dlist_container(ReorderBufferChange, node, next);
+
+		/* txn stays the same */
+		state->entries[off].lsn = next_change->lsn;
+		state->entries[off].change = next_change;
+
+		binaryheap_replace_first(state->heap, Int32GetDatum(off));
+		return change;
+	}
+
+	/* ok, no changes there anymore, remove */
+	binaryheap_remove_first(state->heap);
+
+	return change;
+}
+
+/*
+ * Deallocate the iterator
+ */
+static void
+ReorderBufferStreamIterTXNFinish(ReorderBuffer *rb,
+								 ReorderBufferStreamIterTXNState * state)
+{
+	/* free memory we might have "leaked" in the last *Next call */
+	if (!dlist_is_empty(&state->old_change))
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node,
+								 dlist_pop_head_node(&state->old_change));
+		ReorderBufferReturnChange(rb, change);
+		Assert(dlist_is_empty(&state->old_change));
+	}
+
+	binaryheap_free(state->heap);
+	pfree(state);
+}
+
+/*
  * Cleanup the contents of a transaction, usually after the transaction
  * committed or aborted.
  */
@@ -1327,33 +1616,104 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
  */
 static void
-ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	dlist_iter	iter;
-	HASHCTL		hash_ctl;
+	dlist_mutable_iter iter;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
 
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
-	hash_ctl.entrysize = sizeof(ReorderBufferTupleCidEnt);
-	hash_ctl.hcxt = rb->context;
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
 
-	/*
-	 * create the hash with the exact number of to-be-stored tuplecids from
-	 * the start
-	 */
-	txn->tuplecid_hash =
-		hash_create("ReorderBufferTupleCid", txn->ntuplecids, &hash_ctl,
-					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
 
-	dlist_foreach(iter, &txn->tuplecids)
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
+ * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
+ */
+static void
+ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_iter	iter;
+	HASHCTL		hash_ctl;
+
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+
+	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
+	hash_ctl.entrysize = sizeof(ReorderBufferTupleCidEnt);
+	hash_ctl.hcxt = rb->context;
+
+	/*
+	 * create the hash with the exact number of to-be-stored tuplecids from
+	 * the start
+	 */
+	txn->tuplecid_hash =
+		hash_create("ReorderBufferTupleCid", txn->ntuplecids, &hash_ctl,
+					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	dlist_foreach(iter, &txn->tuplecids)
 	{
 		ReorderBufferTupleCidKey key;
 		ReorderBufferTupleCidEnt *ent;
@@ -1403,6 +1763,16 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 }
 
+static void
+ReorderBufferDestroyTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+}
+
 /*
  * Copy a provided snapshot so we can modify it privately. This is needed so
  * that catalog modifying transactions can look into intermediate catalog
@@ -1476,6 +1846,19 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 		SnapBuildSnapDecRefcount(snap);
 }
 
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
+
+	ReorderBufferStreamTXN(rb, txn);
+
+	rb->stream_commit(rb, txn, txn->final_lsn);
+
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
 /*
  * Perform the replay of a transaction and its non-aborted subtransactions.
  *
@@ -1515,6 +1898,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
 	 * If this transaction has no snapshot, it didn't make any changes to the
 	 * database, so there's nothing to decode.  Note that
 	 * ReorderBufferCommitChild will have transferred any snapshots from
@@ -1549,6 +1948,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 	PG_TRY();
 	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 		ReorderBufferChange *change;
 		ReorderBufferChange *specinsert = NULL;
 
@@ -1565,6 +1965,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			if (prev_lsn != InvalidXLogRecPtr)
+				Assert(prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1930,6 +2340,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2014,6 +2431,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2149,8 +2573,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2158,6 +2591,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2169,19 +2603,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2210,6 +2653,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2285,6 +2729,9 @@ ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	for (i = 0; i < txn->ninvalidations; i++)
 		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+
+	/* Invalidate current schema as well */
+	txn->is_schema_sent = false;
 }
 
 /*
@@ -2299,6 +2746,23 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * We read catalog changes from WAL, which are not yet sent, so
+	 * invalidate current schema in order output plugin can resend
+	 * schema again.
+	 */
+	txn->is_schema_sent = false;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+	{
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		txn->toptxn->is_schema_sent = false;
+	}
 }
 
 /*
@@ -2403,6 +2867,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
  *
@@ -2422,15 +2918,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2723,6 +3250,498 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id;
+	bool		using_subtxn;
+	Size		streamed = 0;
+	ReorderBufferStreamIterTXNState *volatile iterstate = NULL;
+
+	/*
+	 * If this is a subxact, we need to stream the top-level transaction
+	 * instead.
+	 */
+	if (txn->toptxn)
+	{
+		ReorderBufferStreamTXN(rb, txn->toptxn);
+		return;
+	}
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+
+			if (subtxn->base_snapshot != NULL &&
+				(txn->base_snapshot == NULL ||
+				 txn->base_snapshot_lsn > subtxn->base_snapshot_lsn))
+			{
+				txn->base_snapshot = subtxn->base_snapshot;
+				txn->base_snapshot_lsn = subtxn->base_snapshot_lsn;
+				subtxn->base_snapshot = NULL;
+				subtxn->base_snapshot_lsn = InvalidXLogRecPtr;
+			}
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
+		 * information about subtransactions, which could arrive after streaming start.
+		 */
+		if (!txn->is_schema_sent)
+			snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+												 txn, command_id);
+	}
+
+	/*
+	 * build data to be able to lookup the CommandIds of catalog tuples
+	 */
+	ReorderBufferBuildTupleCidHash(rb, txn);
+
+	/* setup the initial snapshot */
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, txn->xid);
+
+	/*
+	 * Decoding needs access to syscaches et al., which in turn use
+	 * heavyweight locks and such. Thus we need to have enough state around to
+	 * keep track of those.  The easiest way is to simply use a transaction
+	 * internally.  That also allows us to easily enforce that nothing writes
+	 * to the database by checking for xid assignments.
+	 *
+	 * When we're called via the SQL SRF there's already a transaction
+	 * started, so start an explicit subtransaction there.
+	 */
+	using_subtxn = IsTransactionOrTransactionBlock();
+
+	PG_TRY();
+	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+		ReorderBufferChange *change;
+		ReorderBufferChange *specinsert = NULL;
+
+		if (using_subtxn)
+			BeginInternalSubTransaction("stream");
+		else
+			StartTransactionCommand();
+
+		/* start streaming this chunk of transaction */
+		rb->stream_start(rb, txn);
+
+		iterstate = ReorderBufferStreamIterTXNInit(rb, txn);
+		while ((change = ReorderBufferStreamIterTXNNext(rb, iterstate)) != NULL)
+		{
+			Relation	relation = NULL;
+			Oid			reloid;
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			if (prev_lsn != InvalidXLogRecPtr)
+				Assert(prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* we're going to stream this change */
+			streamed++;
+
+			switch (change->action)
+			{
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+
+					/*
+					 * Confirmation for speculative insertion arrived. Simply
+					 * use as a normal record. It'll be cleaned up at the end
+					 * of INSERT processing.
+					 */
+					Assert(specinsert->data.tp.oldtuple == NULL);
+					change = specinsert;
+					change->action = REORDER_BUFFER_CHANGE_INSERT;
+
+					/* intentionally fall through */
+				case REORDER_BUFFER_CHANGE_INSERT:
+				case REORDER_BUFFER_CHANGE_UPDATE:
+				case REORDER_BUFFER_CHANGE_DELETE:
+					Assert(snapshot_now);
+
+					reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
+												change->data.tp.relnode.relNode);
+
+					/*
+					 * Catalog tuple without data, emitted while catalog was
+					 * in the process of being rewritten.
+					 */
+					if (reloid == InvalidOid &&
+						change->data.tp.newtuple == NULL &&
+						change->data.tp.oldtuple == NULL)
+						goto change_done;
+					else if (reloid == InvalidOid)
+						elog(ERROR, "could not map filenode \"%s\" to relation OID",
+							 relpathperm(change->data.tp.relnode,
+										 MAIN_FORKNUM));
+
+					relation = RelationIdGetRelation(reloid);
+
+					if (relation == NULL)
+						elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
+							 reloid,
+							 relpathperm(change->data.tp.relnode,
+										 MAIN_FORKNUM));
+
+					if (!RelationIsLogicallyLogged(relation))
+						goto change_done;
+
+					/*
+					 * For now ignore sequence changes entirely. Most of the
+					 * time they don't log changes using records we
+					 * understand, so it doesn't make sense to handle the few
+					 * cases we do.
+					 */
+					if (relation->rd_rel->relkind == RELKIND_SEQUENCE)
+						goto change_done;
+
+					/* user-triggered change */
+					if (!IsToastRelation(relation))
+					{
+						ReorderBufferToastReplace(rb, txn, relation, change);
+						rb->stream_change(rb, txn, relation, change);
+
+						/*
+						 * Only clear reassembled toast chunks if we're sure
+						 * they're not required anymore. The creator of the
+						 * tuple tells us.
+						 */
+						if (change->data.tp.clear_toast_afterwards)
+							ReorderBufferToastReset(rb, txn);
+					}
+					/* we're not interested in toast deletions */
+					else if (change->action == REORDER_BUFFER_CHANGE_INSERT)
+					{
+						/*
+						 * Need to reassemble the full toasted Datum in
+						 * memory, to ensure the chunks don't get reused till
+						 * we're done remove it from the list of this
+						 * transaction's changes. Otherwise it will get
+						 * freed/reused while restoring spooled data from
+						 * disk.
+						 */
+						dlist_delete(&change->node);
+						ReorderBufferToastAppendChunk(rb, txn, relation,
+													  change);
+					}
+
+			change_done:
+
+					/*
+					 * Either speculative insertion was confirmed, or it was
+					 * unsuccessful and the record isn't needed anymore.
+					 */
+					if (specinsert != NULL)
+					{
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+
+					if (relation != NULL)
+					{
+						RelationClose(relation);
+						relation = NULL;
+					}
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
+
+					/*
+					 * Speculative insertions are dealt with by delaying the
+					 * processing of the insert until the confirmation record
+					 * arrives. For that we simply unlink the record from the
+					 * chain, so it does not get freed/reused while restoring
+					 * spooled data from disk.
+					 *
+					 * This is safe in the face of concurrent catalog changes
+					 * because the relevant relation can't be changed between
+					 * speculative insertion and confirmation due to
+					 * CheckTableNotInUse() and locking.
+					 */
+
+					/* clear out a pending (and thus failed) speculation */
+					if (specinsert != NULL)
+					{
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+
+					/* and memorize the pending insertion */
+					dlist_delete(&change->node);
+					specinsert = change;
+					break;
+
+				case REORDER_BUFFER_CHANGE_TRUNCATE:
+					{
+						int			i;
+						int			nrelids = change->data.truncate.nrelids;
+						int			nrelations = 0;
+						Relation   *relations;
+
+						relations = palloc0(nrelids * sizeof(Relation));
+						for (i = 0; i < nrelids; i++)
+						{
+							Oid			relid = change->data.truncate.relids[i];
+							Relation	relation;
+
+							relation = RelationIdGetRelation(relid);
+
+							if (relation == NULL)
+								elog(ERROR, "could not open relation with OID %u", relid);
+
+							if (!RelationIsLogicallyLogged(relation))
+								continue;
+
+							relations[nrelations++] = relation;
+						}
+
+						rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+						for (i = 0; i < nrelations; i++)
+							RelationClose(relations[i]);
+
+						break;
+					}
+
+				case REORDER_BUFFER_CHANGE_MESSAGE:
+
+					rb->stream_message(rb, txn, change->lsn, true,
+									   change->data.msg.prefix,
+									   change->data.msg.message_size,
+									   change->data.msg.message);
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
+					/* get rid of the old */
+					TeardownHistoricSnapshot(false);
+
+					if (snapshot_now->copied)
+					{
+						ReorderBufferFreeSnap(rb, snapshot_now);
+						snapshot_now =
+							ReorderBufferCopySnap(rb, change->data.snapshot,
+												  txn, command_id);
+					}
+
+					/*
+					 * Restored from disk, need to be careful not to double
+					 * free. We could introduce refcounting for that, but for
+					 * now this seems infrequent enough not to care.
+					 */
+					else if (change->data.snapshot->copied)
+					{
+						snapshot_now =
+							ReorderBufferCopySnap(rb, change->data.snapshot,
+												  txn, command_id);
+					}
+					else
+					{
+						snapshot_now = change->data.snapshot;
+					}
+
+					/*
+					 * TOCHECK: Snapshot changed, then invalidate current schema to reflect
+					 * possible catalog changes.
+					 */
+					txn->is_schema_sent = false;
+
+					/* and continue with the new one */
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+										  txn->xid);
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
+					Assert(change->data.command_id != InvalidCommandId);
+
+					if (command_id < change->data.command_id)
+					{
+						command_id = change->data.command_id;
+
+						if (!snapshot_now->copied)
+						{
+							/* we don't use the global one anymore */
+							snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+																 txn, command_id);
+						}
+
+						snapshot_now->curcid = command_id;
+
+						TeardownHistoricSnapshot(false);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+											  txn->xid);
+					}
+
+					break;
+
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation message locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+					txn->is_schema_sent = false;
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+					elog(ERROR, "tuplecid value in changequeue");
+					break;
+			}
+		}
+
+		/*
+		 * There's a speculative insertion remaining, just clean in up, it
+		 * can't have been successful, otherwise we'd gotten a confirmation
+		 * record.
+		 */
+		if (specinsert)
+		{
+			ReorderBufferReturnChange(rb, specinsert);
+			specinsert = NULL;
+		}
+
+		/* clean up the iterator */
+		ReorderBufferStreamIterTXNFinish(rb, iterstate);
+		iterstate = NULL;
+
+		/* call stream_stop callback */
+		rb->stream_stop(rb, txn);
+
+		/* this is just a sanity check against bad output plugin behaviour */
+		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
+			elog(ERROR, "output plugin used XID %u",
+				 GetCurrentTransactionId());
+
+		/* remember the command ID and snapshot for the streaming run */
+		txn->command_id = command_id;
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+
+		/* cleanup */
+		TeardownHistoricSnapshot(false);
+
+		/*
+		 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+		 * any memory. We could also keep the hash table and update it with
+		 * new ctid values, but this seems simpler and good enough for now.
+		 */
+		ReorderBufferDestroyTupleCidHash(rb, txn);
+
+		/*
+		 * Aborting the current (sub-)transaction as a whole has the right
+		 * semantics. We want all locks acquired in here to be released, not
+		 * reassigned to the parent and we do not want any database access
+		 * have persistent effects.
+		 */
+		AbortCurrentTransaction();
+
+		/* make sure there's no cache pollution */
+		ReorderBufferExecuteInvalidations(rb, txn);
+
+		if (using_subtxn)
+			RollbackAndReleaseCurrentSubTransaction();
+	}
+	PG_CATCH();
+	{
+		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
+		if (iterstate)
+			ReorderBufferStreamIterTXNFinish(rb, iterstate);
+
+		TeardownHistoricSnapshot(true);
+
+		/*
+		 * Force cache invalidation to happen outside of a valid transaction
+		 * to prevent catalog access as we just caught an error.
+		 */
+		AbortCurrentTransaction();
+
+		/* make sure there's no cache pollution */
+		ReorderBufferExecuteInvalidations(rb, txn);
+
+		if (using_subtxn)
+			RollbackAndReleaseCurrentSubTransaction();
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	/*
+	 * Discard the changes that we just streamed, and mark the transactions
+	 * as streamed (if they contained changes).
+	 */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 19c7bac..7d08e2f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* does the txn have catalog changes */
 #define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -187,6 +188,20 @@ typedef struct ReorderBufferChange
  */
 #define rbtxn_is_serialized(txn)       (txn->txn_flags & RBTXN_IS_SERIALIZED)
 
+/*
+ * Has this transaction been streamed to downstream? Similarly to spilling
+ * to disk, it's not trivial to deduce this from nentries and nentries_mem,
+ * for various reasons. For example, all changes may be in subtransactions
+ * in which case we'd have nentries==0 for the toplevel one, and it'd say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.
+ *
+ * Note: We never stream and serialize a transaction at the same time (e
+ * only do spill to disk when streaming is not supported by the plugin),
+ * so only one of those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn)         (txn->txn_flags & RBTXN_IS_STREAMED)
+
 typedef struct ReorderBufferTXN
 {
 	int     txn_flags;
@@ -222,6 +237,16 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	final_lsn;
 
 	/*
+	 * Do we need to send schema for this transaction in output plugin?
+	 */
+	bool		is_schema_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
+	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
 	XLogRecPtr	end_lsn;
@@ -252,6 +277,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
-- 
1.8.3.1

0008-Support-logical_decoding_work_mem-set-from-create-su.patchapplication/octet-stream; name=0008-Support-logical_decoding_work_mem-set-from-create-su.patchDownload
From 54556711ea38bcd407bdc65f12f6c70ec2e8a592 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 11:51:04 +0530
Subject: [PATCH 08/13] Support logical_decoding_work_mem set from create
 subscription command

---
 doc/src/sgml/config.sgml                           | 21 +++++++++++
 doc/src/sgml/ref/create_subscription.sgml          | 12 ++++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/commands/subscriptioncmds.c            | 44 +++++++++++++++++++---
 .../libpqwalreceiver/libpqwalreceiver.c            |  3 ++
 src/backend/replication/logical/worker.c           |  1 +
 src/backend/replication/pgoutput/pgoutput.c        | 30 ++++++++++++++-
 src/include/catalog/pg_subscription.h              |  3 ++
 src/include/replication/walreceiver.h              |  1 +
 9 files changed, 108 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d4d1fe4..f449fa1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1753,6 +1753,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-logical-decoding-work-mem" xreflabel="logical_decoding_work_mem">
+      <term><varname>logical_decoding_work_mem</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>logical_decoding_work_mem</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to be used by logical decoding,
+        before some of the decoded changes are either written to local disk.
+        This limits the amount of memory used by logical streaming replication
+        connections. It defaults to 64 megabytes (<literal>64MB</literal>).
+        Since each replication connection only uses a single buffer of this size,
+        and an installation normally doesn't have many such connections
+        concurrently (as limited by <varname>max_wal_senders</varname>), it's
+        safe to set this value significantly higher than <varname>work_mem</varname>,
+        reducing the amount of decoded changes written to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c24..91790b0 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>work_mem</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          Limits the amount of memory used to decode changes on the
+          publisher.  If not specified, the publisher will use the default
+          specified by <varname>logical_decoding_work_mem</varname>. When
+          needed, additional data are spilled to disk.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 68d88ff..2a27648 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->workmem = subform->subworkmem;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 5408edc..fbb4473 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -58,7 +58,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, int *logical_wm,
+						   bool *logical_wm_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -89,6 +90,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (logical_wm)
+		*logical_wm_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -174,6 +177,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0 && logical_wm)
+		{
+			if (*logical_wm_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*logical_wm_given = true;
+			*logical_wm = defGetInt32(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -317,6 +330,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	int			logical_wm;
+	bool		logical_wm_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -333,7 +348,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &logical_wm, &logical_wm_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -411,6 +426,12 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (logical_wm_given)
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(logical_wm);
+	else
+		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +690,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				int			logical_wm;
+				bool		logical_wm_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &logical_wm, &logical_wm_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +721,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (logical_wm_given)
+				{
+					values[Anum_pg_subscription_subworkmem - 1] =
+						Int32GetDatum(logical_wm);
+					replaces[Anum_pg_subscription_subworkmem - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +739,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +778,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +815,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 545d2fc..0ab6855 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -406,6 +406,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		appendStringInfo(&cmd, ", work_mem '%d'",
+						 options->proto.logical.work_mem);
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ff62303..14c0ce8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1726,6 +1726,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.work_mem = MySubscription->workmem;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 3483c1b..cf6e03b 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -18,6 +18,7 @@
 #include "replication/logicalproto.h"
 #include "replication/origin.h"
 #include "replication/pgoutput.h"
+#include "utils/guc.h"
 #include "utils/int8.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
@@ -87,11 +88,12 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, int *logical_decoding_work_mem)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		work_mem_given = false;
 
 	foreach(lc, options)
 	{
@@ -137,6 +139,29 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0)
+		{
+			int64	parsed;
+
+			if (work_mem_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			work_mem_given = true;
+
+			if (!scanint8(strVal(defel->arg), true, &parsed))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid work_mem")));
+
+			if (parsed > PG_INT32_MAX || parsed < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("work_mem \"%s\" out of range",
+								strVal(defel->arg))));
+
+			*logical_decoding_work_mem = (int)parsed;
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -171,7 +196,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&logical_decoding_work_mem);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3cb13d8..10ea113 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	int32		subworkmem;		/* Memory to use to decode changes. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	int			workmem;		/* Memory to decode changes. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e12a934..4e68a69 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -162,6 +162,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			int			work_mem;	/* Memory limit to use for decoding */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
-- 
1.8.3.1

0010-Track-statistics-for-streaming.patchapplication/octet-stream; name=0010-Track-statistics-for-streaming.patchDownload
From bf0e4ba0c0441cbbf52989f35b49fdbf7da01fdc Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 12:19:49 +0530
Subject: [PATCH 10/13] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                    | 25 +++++++++++++++++++++
 src/backend/catalog/system_views.sql            |  5 ++++-
 src/backend/replication/logical/reorderbuffer.c | 13 +++++++++++
 src/backend/replication/walsender.c             | 30 ++++++++++++++++++++-----
 src/include/catalog/pg_proc.dat                 |  6 ++---
 src/include/replication/reorderbuffer.h         | 13 +++++++----
 src/include/replication/walsender_private.h     |  5 +++++
 src/test/regress/expected/rules.out             |  7 ++++--
 8 files changed, 89 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index a3c5f86..3de62c0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1992,6 +1992,31 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       may get spilled repeatedly, and this counter gets incremented on every
       such invocation.</entry>
     </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
+
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f7800f0..5897611 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -779,7 +779,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 0ab3191..83eb4df 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -358,6 +358,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3732,6 +3736,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	PG_END_TRY();
 
 	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
+	/*
 	 * Discard the changes that we just streamed, and mark the transactions
 	 * as streamed (if they contained changes).
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index a2ae283..8619a01 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1269,7 +1269,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1290,7 +1290,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2334,6 +2335,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3235,7 +3239,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3293,6 +3297,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillTxns;
 		int64		spillCount;
 		int64		spillBytes;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3316,6 +3323,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		memset(nulls, 0, sizeof(nulls));
@@ -3402,6 +3412,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3650,8 +3665,13 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %ld %ld %ld",
-		 rb, rb->spillTxns, rb->spillCount, rb->spillBytes);
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %ld %ld %ld %ld %ld %ld",
+		 rb, rb->spillTxns, rb->spillCount, rb->spillBytes,
+		 rb->streamTxns, rb->streamCount, rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fa0a2a1..9a508bf 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5166,9 +5166,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 7d08e2f..c183fce 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -511,15 +511,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index a6b3205..7efc332 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -85,6 +85,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c9cc569..9fe3dd5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1955,9 +1955,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
1.8.3.1

0009-Add-support-for-streaming-to-built-in-replication.patchapplication/octet-stream; name=0009-Add-support-for-streaming-to-built-in-replication.patchDownload
From 81193e00db98be3807352d84e712d13b5b4049d9 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 11:53:58 +0530
Subject: [PATCH 09/13] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |    5 +-
 doc/src/sgml/ref/create_subscription.sgml          |   12 +
 src/backend/catalog/pg_subscription.c              |    1 +
 src/backend/commands/subscriptioncmds.c            |   60 +-
 src/backend/postmaster/pgstat.c                    |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    8 +-
 src/backend/replication/logical/launcher.c         |    2 +
 src/backend/replication/logical/logical.c          |    4 +-
 src/backend/replication/logical/proto.c            |  157 ++-
 src/backend/replication/logical/worker.c           | 1031 ++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c        |  263 ++++-
 src/backend/replication/slotfuncs.c                |    7 +
 src/backend/replication/walsender.c                |    6 +
 src/include/catalog/pg_subscription.h              |    3 +
 src/include/pgstat.h                               |    6 +-
 src/include/replication/logicalproto.h             |   42 +-
 src/include/replication/walreceiver.h              |    1 +
 src/test/subscription/t/009_stream_simple.pl       |   86 ++
 src/test/subscription/t/010_stream_subxact.pl      |  102 ++
 src/test/subscription/t/011_stream_ddl.pl          |   95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |   82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |   84 ++
 22 files changed, 2027 insertions(+), 42 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 6dfb2e4..e1fb907 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
+      and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 91790b0..d9abf5e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -218,6 +218,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 2a27648..15a6f5a 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->workmem = subform->subworkmem;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index fbb4473..b2b93d6 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
 						   bool *refresh, int *logical_wm,
-						   bool *logical_wm_given)
+						   bool *logical_wm_given, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -92,6 +93,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*refresh = true;
 	if (logical_wm)
 		*logical_wm_given = false;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -186,6 +189,26 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 
 			*logical_wm_given = true;
 			*logical_wm = defGetInt32(defel);
+
+			/*
+			 * Check that the value is not smaller than 64kB (which is
+			 * the minimum value for logical_work_mem).
+			 */
+			if (*logical_wm < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("%d is outside the valid range for parameter \"work_mem\" (64 .. 2147483647)",
+								*logical_wm)));
+		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
 		}
 		else
 			ereport(ERROR,
@@ -332,6 +355,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	int			logical_wm;
 	bool		logical_wm_given;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -348,7 +373,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL, &logical_wm, &logical_wm_given);
+							   NULL, &logical_wm, &logical_wm_given,
+							   &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -430,7 +456,15 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 		values[Anum_pg_subscription_subworkmem - 1] =
 			Int32GetDatum(logical_wm);
 	else
-		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(-1);
+
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
 
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
@@ -692,11 +726,14 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				int			logical_wm;
 				bool		logical_wm_given;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
 										   NULL, &synchronous_commit, NULL,
-										   &logical_wm, &logical_wm_given);
+										   &logical_wm, &logical_wm_given,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -728,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subworkmem - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -740,7 +784,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
 										   NULL, NULL, NULL, NULL, NULL,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -778,7 +822,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh, NULL, NULL);
+										   NULL, &refresh, NULL, NULL,
+										   NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -815,7 +860,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index fabcf31..75effed 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4104,6 +4104,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 0ab6855..9970170 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -406,8 +406,12 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
-		appendStringInfo(&cmd, ", work_mem '%d'",
-						 options->proto.logical.work_mem);
+		if (options->proto.logical.work_mem != -1)
+			appendStringInfo(&cmd, ", work_mem '%d'",
+							 options->proto.logical.work_mem);
+
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
 
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 1f8821c..9307b67 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,6 +14,8 @@
  *
  *-------------------------------------------------------------------------
  */
+#include <sys/types.h>
+#include <unistd.h>
 
 #include "postgres.h"
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index b88b585..ad43ab3 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1149,7 +1149,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1194,7 +1194,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index e7df47d..5a379fb 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,7 +139,8 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
@@ -147,6 +148,10 @@ logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -182,8 +187,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -191,6 +196,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -252,13 +261,18 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
+	pq_sendbyte(out, 'D');		/* action DELETE */
+
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
-	pq_sendbyte(out, 'D');		/* action DELETE */
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
 
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
@@ -300,6 +314,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -309,6 +324,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -351,12 +370,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -401,7 +424,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -409,6 +432,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -689,3 +716,119 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendint32(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgint(in, 4) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+}
+
+TransactionId
+logicalrep_read_stream_stop(StringInfo in)
+{
+	TransactionId xid;
+
+	xid = pq_getmsgint(in, 4);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 14c0ce8..65e47aa 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in /tmp by default, and the filenames include both
+ * the XID of the toplevel transaction and OID of the subscription. This
+ * is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -31,6 +52,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -59,6 +81,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -66,6 +89,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -105,6 +129,50 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	off_t		offset;			/* offset in the file */
+}			SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
+static bool handle_streamed_transaction(const char action, StringInfo s);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
@@ -114,6 +182,9 @@ static void maybe_reread_subscription(void);
 /* Flags set by signal handlers */
 static volatile sig_atomic_t got_SIGHUP = false;
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -165,6 +236,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -512,6 +619,318 @@ apply_handle_origin(StringInfo s)
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* FIXME optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/* We should not receive aborts for unknown subtransactions. */
+		Assert(found);
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	/* XXX Should this be allocated in another memory context? */
+
+	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	ensure_transaction();
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pfree(buffer);
+	pfree(s2.data);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -524,6 +943,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -539,6 +961,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -574,6 +999,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -677,6 +1105,9 @@ apply_handle_update(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -797,6 +1228,9 @@ apply_handle_delete(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -896,6 +1330,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -987,6 +1424,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1084,6 +1537,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 }
 
 /*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i]);
+}
+
+/*
  * Apply main loop.
  */
 static void
@@ -1099,6 +1568,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1547,6 +2019,564 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	CloseTransientFile(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	CloseTransientFile(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * TODO: Add missing_ok flag to specify in which cases it's OK not to
+ * find the files, and when it's an error.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialied in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* SIGHUP: set flag to reload configuration at next convenient time */
 static void
 logicalrep_worker_sighup(SIGNAL_ARGS)
@@ -1727,6 +2757,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.work_mem = MySubscription->workmem;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index cf6e03b..8490ea4 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -45,16 +45,42 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
 
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
+
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 
-/* Entry in the map used to remember which relation schemas we sent. */
+/*
+ * Entry in the map used to remember which relation schemas we sent.
+ *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
+	TransactionId	xid;		/* transaction that created the record */
 	bool		schema_sent;	/* did we send the schema? */
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -64,6 +90,7 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
@@ -84,16 +111,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, int *logical_decoding_work_mem)
+						List **publication_names, int *logical_decoding_work_mem,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		work_mem_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -162,6 +199,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*logical_decoding_work_mem = (int)parsed;
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -174,6 +228,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -197,7 +252,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&logical_decoding_work_mem);
+								&logical_decoding_work_mem,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -217,6 +273,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -284,9 +361,42 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (!relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = txn->is_schema_sent;
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (!schema_sent)
 	{
 		TupleDesc	desc;
 		int			i;
@@ -312,19 +422,26 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 				continue;
 
 			OutputPluginPrepareWrite(ctx, false);
-			logicalrep_write_typ(ctx->out, att->atttypid);
+			logicalrep_write_typ(ctx->out, xid, att->atttypid);
 			OutputPluginWrite(ctx, false);
 		}
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_rel(ctx->out, relation);
+		logicalrep_write_rel(ctx->out, xid, relation);
 		OutputPluginWrite(ctx, false);
-		relentry->schema_sent = true;
+		relentry->xid = change->txn->xid;
+
+		if (in_streaming)
+			txn->is_schema_sent = true;
+		else
+			relentry->schema_sent = true;
 	}
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -333,6 +450,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -361,14 +482,14 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
 	{
 		case REORDER_BUFFER_CHANGE_INSERT:
 			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_insert(ctx->out, relation,
+			logicalrep_write_insert(ctx->out, xid, relation,
 									&change->data.tp.newtuple->tuple);
 			OutputPluginWrite(ctx, true);
 			break;
@@ -378,7 +499,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				&change->data.tp.oldtuple->tuple : NULL;
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple,
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
 										&change->data.tp.newtuple->tuple);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -387,7 +508,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			if (change->data.tp.oldtuple)
 			{
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation,
+				logicalrep_write_delete(ctx->out, xid, relation,
 										&change->data.tp.oldtuple->tuple);
 				OutputPluginWrite(ctx, true);
 			}
@@ -413,6 +534,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -433,13 +558,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -513,6 +639,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -623,6 +834,34 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+	}
+
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 46e6dd4..c98d476 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -147,6 +147,13 @@ create_logical_replication_slot(char *name, char *plugin,
 									logical_read_local_xlog_page, NULL, NULL,
 									NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+
 	/* build initial snapshot, might take a while */
 	DecodingContextFindStartpoint(ctx);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index fa75872..a2ae283 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -945,6 +945,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndUpdateProgress);
 
 		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
+		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
 		 * messages or send keepalives. As we possibly need to wait for
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 10ea113..8793676 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -50,6 +50,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	int32		subworkmem;		/* Memory to use to decode changes. */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -76,6 +78,7 @@ typedef struct Subscription
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	int			workmem;		/* Memory to decode changes. */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fe076d8..bc45194 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -944,7 +944,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 3fc430a..bf02cbc 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -85,25 +89,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out, TransactionId xid);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4e68a69..fe6acb4 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -163,6 +163,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			int			work_mem;	/* Memory limit to use for decoding */
+			bool		streaming;	/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

0012-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patchapplication/octet-stream; name=0012-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patchDownload
From 9699ab41ef32af4a0bafff7d4263a9d6d534ccc4 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:14:45 +0200
Subject: [PATCH 12/13] BUGFIX: set final_lsn for subxacts before cleanup

---
 src/backend/replication/logical/reorderbuffer.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 83eb4df..410da36 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1544,6 +1544,10 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
+		/* make sure subtxn has final_lsn */
+		if (subtxn->final_lsn == InvalidXLogRecPtr)
+			subtxn->final_lsn = txn->final_lsn;
+
 		/*
 		 * Subtransactions are always associated to the toplevel TXN, even if
 		 * they originally were happening inside another subtxn, so we won't
-- 
1.8.3.1

0011-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=0011-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From cf6ce6192de8ebcaa8942fb234426e30e976055d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH 11/13] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 40e306a..f41a0e1 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -64,7 +64,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 81547f6..8dfeafc 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 81520a7..0c9c6b3 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

0013-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=0013-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From 79bba6bd97b938917720012d99a19ec614752976 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH 13/13] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

#143Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#142)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Nov 20, 2019 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Nov 20, 2019 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Nov 19, 2019 at 5:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Few other comments on this patch:
1.
+ case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+ /*
+ * Execute the invalidation message locally.
+ *
+ * XXX Do we need to care about relcacheInitFileInval and
+ * the other fields added to ReorderBufferChange, or just
+ * about the message itself?
+ */
+ LocalExecuteInvalidationMessage(&change->data.inval.msg);
+ break;

Here, why are we executing messages individually? Can't we just
follow what we do in DecodeCommit which is to record the invalidations
in ReorderBufferTXN as we encounter them and then allow them to
execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a
reason why we don't do ReorderBufferXidSetCatalogChanges when we
receive any invalidation message?

I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
commit. Because this is required to add any committed transaction to
the snapshot if it has done any catalog changes.

Hmm, this is also used to build cid hash map (see
ReorderBufferBuildTupleCidHash) which we need to use while streaming
changes for the in-progress transactions. So, I think that it would
be required earlier (before commit) as well.

Oh right, I guess I missed that part.

Attached a new rebased version of the patch set. I have fixed all
the issues discussed up-thread and agreed upon.

Pending Issues:
1. The default value of the logical_decoding_work_mem is set to 64kb
in test_decoding/logical.conf. So we need to change the expected
output files for the test decoding module.
2. Need to complete the patch for concurrent abort handling of the
(sub)transaction. There are some pending issues with the existing
patch[1].
[1] /messages/by-id/CAFiTN-ud98kWHCo2YKS55H8rGw3_A7ESyssHwU0xPU6KJsoy6A@mail.gmail.com

Apart from these there is one more issue reported upthread[2]/messages/by-id/CAFiTN-vrSNkAfRVrWKe2R1dqFBTubjt=DYS=jhH+jiCoBODdaw@mail.gmail.com
[2]: /messages/by-id/CAFiTN-vrSNkAfRVrWKe2R1dqFBTubjt=DYS=jhH+jiCoBODdaw@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#144Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#140)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Nov 19, 2019 at 5:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Nov 16, 2019 at 6:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Nov 7, 2019 at 5:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Some notes before commit:
--------------------------------------
1.
Commit message need to be changed for the first patch
-------------------------------------------------------------------------
A.

The memory limit is defined by a new logical_decoding_work_mem GUC, so for example we can do this

SET logical_decoding_work_mem = '128kB'

to trigger very aggressive streaming. The minimum value is 64kB.

I think this patch doesn't contain streaming, so we either need to
reword it or remove it.

B.

The logical_decoding_work_mem may be set either in postgresql.conf, in which case it serves as the default for all publishers on that instance, or when creating the
subscription, using a work_mem paramemter in the WITH clause (specifies number of kilobytes).

We need to reword this as we have decided to remove the setting from
the subscription side as of now.

2. I think we can change the message level in UpdateSpillStats() to DEBUG2.

I have made these modifications and additionally ran pgindent.

4. I think we can combine both patches and commit as one patch, but it
is okay to commit them separately as well.

I am not sure if this is a good idea, so still kept them as separate.

I have committed the first patch. I will commit the second one
related to stats of spilled xacts on Thursday. The second patch needs
catalog version bump as well because we are modifying the catalog
contents in that patch.

Committed the second one as well. Now, we can move to a review of
patches for "streaming of in-progress transactions".

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#145Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#143)
13 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Nov 21, 2019 at 9:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Nov 20, 2019 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Nov 20, 2019 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Nov 19, 2019 at 5:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Few other comments on this patch:
1.
+ case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+ /*
+ * Execute the invalidation message locally.
+ *
+ * XXX Do we need to care about relcacheInitFileInval and
+ * the other fields added to ReorderBufferChange, or just
+ * about the message itself?
+ */
+ LocalExecuteInvalidationMessage(&change->data.inval.msg);
+ break;

Here, why are we executing messages individually? Can't we just
follow what we do in DecodeCommit which is to record the invalidations
in ReorderBufferTXN as we encounter them and then allow them to
execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a
reason why we don't do ReorderBufferXidSetCatalogChanges when we
receive any invalidation message?

I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
commit. Because this is required to add any committed transaction to
the snapshot if it has done any catalog changes.

Hmm, this is also used to build cid hash map (see
ReorderBufferBuildTupleCidHash) which we need to use while streaming
changes for the in-progress transactions. So, I think that it would
be required earlier (before commit) as well.

Oh right, I guess I missed that part.

Attached a new rebased version of the patch set. I have fixed all
the issues discussed up-thread and agreed upon.

Pending Issues:
1. The default value of the logical_decoding_work_mem is set to 64kb
in test_decoding/logical.conf. So we need to change the expected
output files for the test decoding module.
2. Need to complete the patch for concurrent abort handling of the
(sub)transaction. There are some pending issues with the existing
patch[1].
[1] /messages/by-id/CAFiTN-ud98kWHCo2YKS55H8rGw3_A7ESyssHwU0xPU6KJsoy6A@mail.gmail.com

Apart from these there is one more issue reported upthread[2]
[2] /messages/by-id/CAFiTN-vrSNkAfRVrWKe2R1dqFBTubjt=DYS=jhH+jiCoBODdaw@mail.gmail.com

I have rebased the patch on the latest head and also fix the issue of
"concurrent abort handling of the (sub)transaction." and attached as
(v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
the complete patch set. I have added the version number so that we
can track the changes.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v1-0007-Support-logical_decoding_work_mem-set-from-create.patchapplication/octet-stream; name=v1-0007-Support-logical_decoding_work_mem-set-from-create.patchDownload
From ac49838b249e181e0fbd9297b034b96b6636f964 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 11:51:04 +0530
Subject: [PATCH v1 07/13] Support logical_decoding_work_mem set from create
 subscription command

---
 doc/src/sgml/config.sgml                           | 21 +++++++++++
 doc/src/sgml/ref/create_subscription.sgml          | 12 ++++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/commands/subscriptioncmds.c            | 44 +++++++++++++++++++---
 .../libpqwalreceiver/libpqwalreceiver.c            |  3 ++
 src/backend/replication/logical/worker.c           |  1 +
 src/backend/replication/pgoutput/pgoutput.c        | 30 ++++++++++++++-
 src/include/catalog/pg_subscription.h              |  3 ++
 src/include/replication/walreceiver.h              |  1 +
 9 files changed, 108 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d4d1fe4..f449fa1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1753,6 +1753,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-logical-decoding-work-mem" xreflabel="logical_decoding_work_mem">
+      <term><varname>logical_decoding_work_mem</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>logical_decoding_work_mem</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to be used by logical decoding,
+        before some of the decoded changes are either written to local disk.
+        This limits the amount of memory used by logical streaming replication
+        connections. It defaults to 64 megabytes (<literal>64MB</literal>).
+        Since each replication connection only uses a single buffer of this size,
+        and an installation normally doesn't have many such connections
+        concurrently (as limited by <varname>max_wal_senders</varname>), it's
+        safe to set this value significantly higher than <varname>work_mem</varname>,
+        reducing the amount of decoded changes written to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c24..91790b0 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>work_mem</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          Limits the amount of memory used to decode changes on the
+          publisher.  If not specified, the publisher will use the default
+          specified by <varname>logical_decoding_work_mem</varname>. When
+          needed, additional data are spilled to disk.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 68d88ff..2a27648 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->workmem = subform->subworkmem;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 5408edc..fbb4473 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -58,7 +58,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, int *logical_wm,
+						   bool *logical_wm_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -89,6 +90,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (logical_wm)
+		*logical_wm_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -174,6 +177,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0 && logical_wm)
+		{
+			if (*logical_wm_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*logical_wm_given = true;
+			*logical_wm = defGetInt32(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -317,6 +330,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	int			logical_wm;
+	bool		logical_wm_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -333,7 +348,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &logical_wm, &logical_wm_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -411,6 +426,12 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (logical_wm_given)
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(logical_wm);
+	else
+		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +690,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				int			logical_wm;
+				bool		logical_wm_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &logical_wm, &logical_wm_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +721,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (logical_wm_given)
+				{
+					values[Anum_pg_subscription_subworkmem - 1] =
+						Int32GetDatum(logical_wm);
+					replaces[Anum_pg_subscription_subworkmem - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +739,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +778,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +815,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 545d2fc..0ab6855 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -406,6 +406,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		appendStringInfo(&cmd, ", work_mem '%d'",
+						 options->proto.logical.work_mem);
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ff62303..14c0ce8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1726,6 +1726,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.work_mem = MySubscription->workmem;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 3483c1b..cf6e03b 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -18,6 +18,7 @@
 #include "replication/logicalproto.h"
 #include "replication/origin.h"
 #include "replication/pgoutput.h"
+#include "utils/guc.h"
 #include "utils/int8.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
@@ -87,11 +88,12 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, int *logical_decoding_work_mem)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		work_mem_given = false;
 
 	foreach(lc, options)
 	{
@@ -137,6 +139,29 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0)
+		{
+			int64	parsed;
+
+			if (work_mem_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			work_mem_given = true;
+
+			if (!scanint8(strVal(defel->arg), true, &parsed))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid work_mem")));
+
+			if (parsed > PG_INT32_MAX || parsed < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("work_mem \"%s\" out of range",
+								strVal(defel->arg))));
+
+			*logical_decoding_work_mem = (int)parsed;
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -171,7 +196,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&logical_decoding_work_mem);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3cb13d8..10ea113 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	int32		subworkmem;		/* Memory to use to decode changes. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	int			workmem;		/* Memory to decode changes. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e12a934..4e68a69 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -162,6 +162,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			int			work_mem;	/* Memory limit to use for decoding */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
-- 
1.8.3.1

v1-0008-Add-support-for-streaming-to-built-in-replication.patchapplication/octet-stream; name=v1-0008-Add-support-for-streaming-to-built-in-replication.patchDownload
From 3968b2aa385b703316109df3eaef0dce7e027f0b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 11:53:58 +0530
Subject: [PATCH v1 08/13] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |    5 +-
 doc/src/sgml/ref/create_subscription.sgml          |   12 +
 src/backend/catalog/pg_subscription.c              |    1 +
 src/backend/commands/subscriptioncmds.c            |   60 +-
 src/backend/postmaster/pgstat.c                    |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    8 +-
 src/backend/replication/logical/launcher.c         |    2 +
 src/backend/replication/logical/logical.c          |    4 +-
 src/backend/replication/logical/proto.c            |  157 ++-
 src/backend/replication/logical/worker.c           | 1031 ++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c        |  263 ++++-
 src/backend/replication/slotfuncs.c                |    7 +
 src/backend/replication/walsender.c                |    6 +
 src/include/catalog/pg_subscription.h              |    3 +
 src/include/pgstat.h                               |    6 +-
 src/include/replication/logicalproto.h             |   42 +-
 src/include/replication/walreceiver.h              |    1 +
 src/test/subscription/t/009_stream_simple.pl       |   86 ++
 src/test/subscription/t/010_stream_subxact.pl      |  102 ++
 src/test/subscription/t/011_stream_ddl.pl          |   95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |   82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |   84 ++
 22 files changed, 2027 insertions(+), 42 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 6dfb2e4..e1fb907 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
+      and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 91790b0..d9abf5e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -218,6 +218,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 2a27648..15a6f5a 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->workmem = subform->subworkmem;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index fbb4473..b2b93d6 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
 						   bool *refresh, int *logical_wm,
-						   bool *logical_wm_given)
+						   bool *logical_wm_given, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -92,6 +93,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*refresh = true;
 	if (logical_wm)
 		*logical_wm_given = false;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -186,6 +189,26 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 
 			*logical_wm_given = true;
 			*logical_wm = defGetInt32(defel);
+
+			/*
+			 * Check that the value is not smaller than 64kB (which is
+			 * the minimum value for logical_work_mem).
+			 */
+			if (*logical_wm < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("%d is outside the valid range for parameter \"work_mem\" (64 .. 2147483647)",
+								*logical_wm)));
+		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
 		}
 		else
 			ereport(ERROR,
@@ -332,6 +355,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	int			logical_wm;
 	bool		logical_wm_given;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -348,7 +373,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL, &logical_wm, &logical_wm_given);
+							   NULL, &logical_wm, &logical_wm_given,
+							   &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -430,7 +456,15 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 		values[Anum_pg_subscription_subworkmem - 1] =
 			Int32GetDatum(logical_wm);
 	else
-		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(-1);
+
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
 
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
@@ -692,11 +726,14 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				int			logical_wm;
 				bool		logical_wm_given;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
 										   NULL, &synchronous_commit, NULL,
-										   &logical_wm, &logical_wm_given);
+										   &logical_wm, &logical_wm_given,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -728,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subworkmem - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -740,7 +784,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
 										   NULL, NULL, NULL, NULL, NULL,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -778,7 +822,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh, NULL, NULL);
+										   NULL, &refresh, NULL, NULL,
+										   NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -815,7 +860,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index fabcf31..75effed 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4104,6 +4104,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 0ab6855..9970170 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -406,8 +406,12 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
-		appendStringInfo(&cmd, ", work_mem '%d'",
-						 options->proto.logical.work_mem);
+		if (options->proto.logical.work_mem != -1)
+			appendStringInfo(&cmd, ", work_mem '%d'",
+							 options->proto.logical.work_mem);
+
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
 
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 1f8821c..9307b67 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,6 +14,8 @@
  *
  *-------------------------------------------------------------------------
  */
+#include <sys/types.h>
+#include <unistd.h>
 
 #include "postgres.h"
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index b88b585..ad43ab3 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1149,7 +1149,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1194,7 +1194,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index e7df47d..5a379fb 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,7 +139,8 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
@@ -147,6 +148,10 @@ logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -182,8 +187,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -191,6 +196,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -252,13 +261,18 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
+	pq_sendbyte(out, 'D');		/* action DELETE */
+
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
-	pq_sendbyte(out, 'D');		/* action DELETE */
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
 
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
@@ -300,6 +314,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -309,6 +324,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -351,12 +370,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -401,7 +424,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -409,6 +432,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -689,3 +716,119 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendint32(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgint(in, 4) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+}
+
+TransactionId
+logicalrep_read_stream_stop(StringInfo in)
+{
+	TransactionId xid;
+
+	xid = pq_getmsgint(in, 4);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 14c0ce8..65e47aa 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in /tmp by default, and the filenames include both
+ * the XID of the toplevel transaction and OID of the subscription. This
+ * is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -31,6 +52,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -59,6 +81,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -66,6 +89,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -105,6 +129,50 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	off_t		offset;			/* offset in the file */
+}			SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
+static bool handle_streamed_transaction(const char action, StringInfo s);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
@@ -114,6 +182,9 @@ static void maybe_reread_subscription(void);
 /* Flags set by signal handlers */
 static volatile sig_atomic_t got_SIGHUP = false;
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -165,6 +236,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -512,6 +619,318 @@ apply_handle_origin(StringInfo s)
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* FIXME optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/* We should not receive aborts for unknown subtransactions. */
+		Assert(found);
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	/* XXX Should this be allocated in another memory context? */
+
+	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	ensure_transaction();
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pfree(buffer);
+	pfree(s2.data);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -524,6 +943,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -539,6 +961,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -574,6 +999,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -677,6 +1105,9 @@ apply_handle_update(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -797,6 +1228,9 @@ apply_handle_delete(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -896,6 +1330,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -987,6 +1424,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1084,6 +1537,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 }
 
 /*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i]);
+}
+
+/*
  * Apply main loop.
  */
 static void
@@ -1099,6 +1568,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1547,6 +2019,564 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	CloseTransientFile(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	CloseTransientFile(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * TODO: Add missing_ok flag to specify in which cases it's OK not to
+ * find the files, and when it's an error.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialied in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* SIGHUP: set flag to reload configuration at next convenient time */
 static void
 logicalrep_worker_sighup(SIGNAL_ARGS)
@@ -1727,6 +2757,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.work_mem = MySubscription->workmem;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index cf6e03b..8490ea4 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -45,16 +45,42 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
 
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
+
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 
-/* Entry in the map used to remember which relation schemas we sent. */
+/*
+ * Entry in the map used to remember which relation schemas we sent.
+ *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
+	TransactionId	xid;		/* transaction that created the record */
 	bool		schema_sent;	/* did we send the schema? */
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -64,6 +90,7 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
@@ -84,16 +111,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, int *logical_decoding_work_mem)
+						List **publication_names, int *logical_decoding_work_mem,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		work_mem_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -162,6 +199,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*logical_decoding_work_mem = (int)parsed;
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -174,6 +228,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -197,7 +252,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&logical_decoding_work_mem);
+								&logical_decoding_work_mem,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -217,6 +273,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -284,9 +361,42 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (!relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = txn->is_schema_sent;
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (!schema_sent)
 	{
 		TupleDesc	desc;
 		int			i;
@@ -312,19 +422,26 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 				continue;
 
 			OutputPluginPrepareWrite(ctx, false);
-			logicalrep_write_typ(ctx->out, att->atttypid);
+			logicalrep_write_typ(ctx->out, xid, att->atttypid);
 			OutputPluginWrite(ctx, false);
 		}
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_rel(ctx->out, relation);
+		logicalrep_write_rel(ctx->out, xid, relation);
 		OutputPluginWrite(ctx, false);
-		relentry->schema_sent = true;
+		relentry->xid = change->txn->xid;
+
+		if (in_streaming)
+			txn->is_schema_sent = true;
+		else
+			relentry->schema_sent = true;
 	}
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -333,6 +450,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -361,14 +482,14 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
 	{
 		case REORDER_BUFFER_CHANGE_INSERT:
 			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_insert(ctx->out, relation,
+			logicalrep_write_insert(ctx->out, xid, relation,
 									&change->data.tp.newtuple->tuple);
 			OutputPluginWrite(ctx, true);
 			break;
@@ -378,7 +499,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				&change->data.tp.oldtuple->tuple : NULL;
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple,
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
 										&change->data.tp.newtuple->tuple);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -387,7 +508,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			if (change->data.tp.oldtuple)
 			{
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation,
+				logicalrep_write_delete(ctx->out, xid, relation,
 										&change->data.tp.oldtuple->tuple);
 				OutputPluginWrite(ctx, true);
 			}
@@ -413,6 +534,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -433,13 +558,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -513,6 +639,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -623,6 +834,34 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+	}
+
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 46e6dd4..c98d476 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -147,6 +147,13 @@ create_logical_replication_slot(char *name, char *plugin,
 									logical_read_local_xlog_page, NULL, NULL,
 									NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+
 	/* build initial snapshot, might take a while */
 	DecodingContextFindStartpoint(ctx);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index fa75872..a2ae283 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -945,6 +945,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndUpdateProgress);
 
 		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
+		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
 		 * messages or send keepalives. As we possibly need to wait for
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 10ea113..8793676 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -50,6 +50,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	int32		subworkmem;		/* Memory to use to decode changes. */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -76,6 +78,7 @@ typedef struct Subscription
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	int			workmem;		/* Memory to decode changes. */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fe076d8..bc45194 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -944,7 +944,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 3fc430a..bf02cbc 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -85,25 +89,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out, TransactionId xid);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4e68a69..fe6acb4 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -163,6 +163,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			int			work_mem;	/* Memory limit to use for decoding */
+			bool		streaming;	/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v1-0009-Track-statistics-for-streaming.patchapplication/octet-stream; name=v1-0009-Track-statistics-for-streaming.patchDownload
From 9ebaada8dc1630ba7b039da109f14f3eb111ca50 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 12:19:49 +0530
Subject: [PATCH v1 09/13] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                    | 25 +++++++++++++++++++++
 src/backend/catalog/system_views.sql            |  5 ++++-
 src/backend/replication/logical/reorderbuffer.c | 13 +++++++++++
 src/backend/replication/walsender.c             | 30 ++++++++++++++++++++-----
 src/include/catalog/pg_proc.dat                 |  6 ++---
 src/include/replication/reorderbuffer.h         | 13 +++++++----
 src/include/replication/walsender_private.h     |  5 +++++
 src/test/regress/expected/rules.out             |  7 ++++--
 8 files changed, 89 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index a3c5f86..3de62c0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1992,6 +1992,31 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       may get spilled repeatedly, and this counter gets incremented on every
       such invocation.</entry>
     </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
+
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f7800f0..5897611 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -779,7 +779,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 0ab3191..83eb4df 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -358,6 +358,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3732,6 +3736,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	PG_END_TRY();
 
 	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
+	/*
 	 * Discard the changes that we just streamed, and mark the transactions
 	 * as streamed (if they contained changes).
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index a2ae283..8619a01 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1269,7 +1269,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1290,7 +1290,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2334,6 +2335,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3235,7 +3239,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3293,6 +3297,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillTxns;
 		int64		spillCount;
 		int64		spillBytes;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3316,6 +3323,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		memset(nulls, 0, sizeof(nulls));
@@ -3402,6 +3412,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3650,8 +3665,13 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %ld %ld %ld",
-		 rb, rb->spillTxns, rb->spillCount, rb->spillBytes);
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %ld %ld %ld %ld %ld %ld",
+		 rb, rb->spillTxns, rb->spillCount, rb->spillBytes,
+		 rb->streamTxns, rb->streamCount, rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d4fa6dd..b3bbc82 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5166,9 +5166,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 7d08e2f..c183fce 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -511,15 +511,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index a6b3205..7efc332 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -85,6 +85,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c9cc569..9fe3dd5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1955,9 +1955,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
1.8.3.1

v1-0010-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v1-0010-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From 989a8c4e3f0574949f36e2ecaa685548aa7726b1 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v1 10/13] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 40e306a..f41a0e1 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -64,7 +64,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 81547f6..8dfeafc 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 81520a7..0c9c6b3 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v1-0001-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=v1-0001-Immediately-WAL-log-assignments.patchDownload
From 949e0782796eef008afd703c85a8eaca6625709d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 09:53:08 +0530
Subject: [PATCH v1 01/13] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So instead we write the assignment info into WAL immediately, as
part of the next WAL record (to minimize overhead).
---
 src/backend/access/transam/xact.c        | 45 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 39 +++++++++++++--------------
 src/include/access/xact.h                |  3 +++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 +++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 98 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8fe38c3..4a853f3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -190,6 +190,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -5096,6 +5097,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6002,3 +6004,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index aa9dca0..a8a8084 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -88,11 +88,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -194,6 +196,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -397,7 +403,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -743,6 +749,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		curinsert_flags |= XLOG_INCLUDE_XID;
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 7f24f0c..4ef2661 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1071,6 +1071,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1109,6 +1110,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index bc532d0..897b755 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel_xid is valid, we need to assign the subxact to the
+	 * toplevel transaction. We need to do this for all records, hence we
+	 * do it before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 record->toplevel_xid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrIds) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(r->toplevel_xid))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 9d2899d..5b9740c 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d519252..b492d3e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -227,6 +227,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 1bbee38..c37a83d 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -148,6 +148,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -243,6 +245,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 9375e54..bcfba0a 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v1-0002-Issue-individual-invalidations-with-wal_level-log.patchapplication/octet-stream; name=v1-0002-Issue-individual-invalidations-with-wal_level-log.patchDownload
From 4ddd0a234ec138105302d62bb9900a8c858a225e Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:26:33 +0530
Subject: [PATCH v1 02/13] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.

The new invalidations are written to WAL immediately, without any
such caching. Perhaps it would be possible to add similar caching,
e.g. at the command level, or something like that?
---
 src/backend/access/rmgrdesc/xactdesc.c          | 52 +++++++++++++++++++
 src/backend/access/transam/xact.c               |  7 +++
 src/backend/replication/logical/decode.c        | 23 +++++++++
 src/backend/replication/logical/reorderbuffer.c | 56 +++++++++++++++++---
 src/backend/utils/cache/inval.c                 | 69 +++++++++++++++++++++++++
 src/include/access/xact.h                       | 18 ++++++-
 src/include/replication/reorderbuffer.h         | 14 +++++
 7 files changed, 231 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 4c411c5..6cfd6af 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,11 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -397,6 +402,14 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs,
+								xlrec->dbId, xlrec->tsId,
+								xlrec->relcacheInitFileInval);
+	}
 }
 
 const char *
@@ -424,7 +437,46 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval)
+{
+	int			i;
+
+	if (relcacheInitFileInval)
+		appendStringInfo(buf, "; relcache init file inval dbid %u tsid %u",
+						 dbId, tsId);
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else if (msg->id == SHAREDINVALSNAPSHOT_ID)
+			appendStringInfo(buf, " snapshot %u", msg->sn.relId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 4a853f3..a42d11f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6001,6 +6001,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 897b755..9bcefb6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,29 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/* XXX for now we're issuing invalidations one by one */
+				Assert(invals->nmsgs == 1);
+
+				if (!TransactionIdIsValid(xid))
+					break;
+
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->dbId, invals->tsId,
+											 invals->relcacheInitFileInval,
+											 invals->msgs[0]);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 53affeb..b1feff3 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -464,6 +464,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
@@ -1804,17 +1805,23 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation message locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+					txn->is_schema_sent = false;
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -2209,6 +2216,38 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 
 /*
  * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn,
+							 Oid dbId, Oid tsId, bool relcacheInitFileInval,
+							 SharedInvalidationMessage msg)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	/* XXX Should we even write invalidations without valid XID? */
+	if (xid == InvalidTransactionId)
+		return;
+
+	Assert(xid != InvalidTransactionId);
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.dbId = dbId;
+	change->data.inval.tsId = tsId;
+	change->data.inval.relcacheInitFileInval = relcacheInitFileInval;
+	change->data.inval.msg = msg;
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Setup the invalidation of the toplevel transaction.
  *
  * This needs to be done before ReorderBufferCommit is called!
  */
@@ -2656,6 +2695,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -2752,6 +2792,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -3027,6 +3068,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index f09e3a9..0682c55 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -104,6 +104,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +211,9 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -489,6 +493,18 @@ RegisterCatcacheInvalidation(int cacheId,
 {
 	AddCatcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   cacheId, hashValue, dbId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cc.id = (int8) cacheId;
+		msg.cc.dbId = dbId;
+		msg.cc.hashValue = hashValue;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -501,6 +517,18 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 {
 	AddCatalogInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								  dbId, catId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cat.id = SHAREDINVALCATALOG_ID;
+		msg.cat.dbId = dbId;
+		msg.cat.catId = catId;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -511,6 +539,8 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 static void
 RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 {
+	bool		RelcacheInitFileInval = false;
+
 	AddRelcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   dbId, relId);
 
@@ -529,7 +559,22 @@ RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 	 * as well.  Also zap when we are invalidating whole relcache.
 	 */
 	if (relId == InvalidOid || RelationIdIsInInitFile(relId))
+	{
 		transInvalInfo->RelcacheInitFileInval = true;
+		RelcacheInitFileInval = true;
+	}
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.rc.id = SHAREDINVALRELCACHE_ID;
+		msg.rc.dbId = dbId;
+		msg.rc.relId = relId;
+
+		LogLogicalInvalidations(1, &msg, RelcacheInitFileInval);
+	}
 }
 
 /*
@@ -1501,3 +1546,27 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval)
+{
+	xl_xact_invalidations xlrec;
+
+	/* prepare record */
+	memset(&xlrec, 0, sizeof(xlrec));
+	xlrec.dbId = MyDatabaseId;
+	xlrec.tsId = MyDatabaseTableSpace;
+	xlrec.relcacheInitFileInval = relcacheInitFileInval;
+	xlrec.nmsgs = nmsgs;
+
+	/* perform insertion */
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+	XLogRegisterData((char *) msgs,
+					 nmsgs * sizeof(SharedInvalidationMessage));
+	XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5b9740c..82d4942 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,22 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ *
+ * XXX Currently nmsgs=1 but that might change in the future.
+ */
+typedef struct xl_xact_invalidations
+{
+	Oid			dbId;			/* MyDatabaseId */
+	Oid			tsId;			/* MyDatabaseTableSpace */
+	bool		relcacheInitFileInval;	/* invalidate relcache init file */
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0867ee9..6a7187b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,16 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			Oid			dbId;	/* MyDatabaseId */
+			Oid			tsId;	/* MyDatabaseTableSpace */
+			bool		relcacheInitFileInval;	/* invalidate relcache init
+												 * file */
+			SharedInvalidationMessage msg;	/* invalidation message */
+		}			inval;
 	}			data;
 
 	/*
@@ -448,6 +459,9 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  Oid dbId, Oid tsId, bool relcacheInitFileInval,
+								  SharedInvalidationMessage msg);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
1.8.3.1

v1-0004-Cleaning-up-of-flags-in-ReorderBufferTXN-structur.patchapplication/octet-stream; name=v1-0004-Cleaning-up-of-flags-in-ReorderBufferTXN-structur.patchDownload
From ad5a84eef45bc215044748f382e0d5d586a502be Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 18:08:37 +0200
Subject: [PATCH v1 04/13] Cleaning up of flags in ReorderBufferTXN structure

---
 src/backend/replication/logical/reorderbuffer.c | 36 ++++++++++++-------------
 src/include/replication/reorderbuffer.h         | 33 ++++++++++++++---------
 2 files changed, 38 insertions(+), 31 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b1feff3..3422939 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -732,7 +732,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 			Assert(prev_first_lsn < cur_txn->first_lsn);
 
 		/* known-as-subtxn txns must not be listed */
-		Assert(!cur_txn->is_known_as_subxact);
+		Assert(!rbtxn_is_known_subxact(cur_txn));
 
 		prev_first_lsn = cur_txn->first_lsn;
 	}
@@ -752,7 +752,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 			Assert(prev_base_snap_lsn < cur_txn->base_snapshot_lsn);
 
 		/* known-as-subtxn txns must not be listed */
-		Assert(!cur_txn->is_known_as_subxact);
+		Assert(!rbtxn_is_known_subxact(cur_txn));
 
 		prev_base_snap_lsn = cur_txn->base_snapshot_lsn;
 	}
@@ -775,7 +775,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
 
 	txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
 
-	Assert(!txn->is_known_as_subxact);
+	Assert(!rbtxn_is_known_subxact(txn));
 	Assert(txn->first_lsn != InvalidXLogRecPtr);
 	return txn;
 }
@@ -835,7 +835,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 
 	if (!new_sub)
 	{
-		if (subtxn->is_known_as_subxact)
+		if (rbtxn_is_known_subxact(subtxn))
 		{
 			/* already associated, nothing to do */
 			return;
@@ -851,7 +851,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 		}
 	}
 
-	subtxn->is_known_as_subxact = true;
+	subtxn->txn_flags |= RBTXN_IS_SUBXACT;
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
@@ -1061,7 +1061,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	{
 		ReorderBufferChange *cur_change;
 
-		if (txn->serialized)
+		if (rbtxn_is_serialized(txn))
 		{
 			/* serialize remaining changes */
 			ReorderBufferSerializeTXN(rb, txn);
@@ -1090,7 +1090,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		{
 			ReorderBufferChange *cur_change;
 
-			if (cur_txn->serialized)
+			if (rbtxn_is_serialized(cur_txn))
 			{
 				/* serialize remaining changes */
 				ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1256,7 +1256,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * they originally were happening inside another subtxn, so we won't
 		 * ever recurse more than one level deep here.
 		 */
-		Assert(subtxn->is_known_as_subxact);
+		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
 		ReorderBufferCleanupTXN(rb, subtxn);
@@ -1304,7 +1304,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	/*
 	 * Remove TXN from its containing list.
 	 *
-	 * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+	 * Note: if txn is known as subxact, we are deleting the TXN from its
 	 * parent's list of known subxacts; this leaves the parent's nsubxacts
 	 * count too high, but we don't care.  Otherwise, we are deleting the TXN
 	 * from the LSN-ordered list of toplevel TXNs.
@@ -1319,7 +1319,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	Assert(found);
 
 	/* remove entries spilled to disk */
-	if (txn->serialized)
+	if (rbtxn_is_serialized(txn))
 		ReorderBufferRestoreCleanup(rb, txn);
 
 	/* deallocate */
@@ -1336,7 +1336,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
 		return;
 
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1970,7 +1970,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
 			 * final_lsn to that of their last change; this causes
 			 * ReorderBufferRestoreCleanup to do the right thing.
 			 */
-			if (txn->serialized && txn->final_lsn == 0)
+			if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
 			{
 				ReorderBufferChange *last =
 				dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -2118,7 +2118,7 @@ ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
 	 * operate on its top-level transaction instead.
 	 */
 	txn = ReorderBufferTXNByXid(rb, xid, true, &is_new, lsn, true);
-	if (txn->is_known_as_subxact)
+	if (rbtxn_is_known_subxact(txn))
 		txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
 									NULL, InvalidXLogRecPtr, false);
 	Assert(txn->base_snapshot == NULL);
@@ -2297,7 +2297,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	txn->has_catalog_changes = true;
+	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2314,7 +2314,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
 	if (txn == NULL)
 		return false;
 
-	return txn->has_catalog_changes;
+	return rbtxn_has_catalog_changes(txn);
 }
 
 /*
@@ -2334,7 +2334,7 @@ ReorderBufferXidHasBaseSnapshot(ReorderBuffer *rb, TransactionId xid)
 		return false;
 
 	/* a known subtxn? operate on top-level txn instead */
-	if (txn->is_known_as_subxact)
+	if (rbtxn_is_known_subxact(txn))
 		txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
 									NULL, InvalidXLogRecPtr, false);
 
@@ -2522,12 +2522,12 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	rb->spillBytes += size;
 
 	/* Don't consider already serialized transaction. */
-	rb->spillTxns += txn->serialized ? 0 : 1;
+	rb->spillTxns += rbtxn_is_serialized(txn) ? 0 : 1;
 
 	Assert(spilled == txn->nentries_mem);
 	Assert(dlist_is_empty(&txn->changes));
 	txn->nentries_mem = 0;
-	txn->serialized = true;
+	txn->txn_flags |= RBTXN_IS_SERIALIZED;
 
 	if (fd != -1)
 		CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5b4be2b..19c7bac 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -169,18 +169,34 @@ typedef struct ReorderBufferChange
 	dlist_node	node;
 } ReorderBufferChange;
 
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT          0x0002
+#define RBTXN_IS_SERIALIZED       0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn)    (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk?  It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn)       (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
 typedef struct ReorderBufferTXN
 {
+	int     txn_flags;
+
 	/*
 	 * The transactions transaction id, can be a toplevel or sub xid.
 	 */
 	TransactionId xid;
 
-	/* did the TX have catalog changes */
-	bool		has_catalog_changes;
-
 	/* Do we know this is a subxact?  Xid of top-level txn if so */
-	bool		is_known_as_subxact;
 	TransactionId toplevel_xid;
 
 	/*
@@ -249,15 +265,6 @@ typedef struct ReorderBufferTXN
 	uint64		nentries_mem;
 
 	/*
-	 * Has this transaction been spilled to disk?  It's not always possible to
-	 * deduce that fact by comparing nentries with nentries_mem, because e.g.
-	 * subtransactions of a large transaction might get serialized together
-	 * with the parent - if they're restored to memory they'd have
-	 * nentries_mem == nentries.
-	 */
-	bool		serialized;
-
-	/*
 	 * List of ReorderBufferChange structs, including new Snapshots and new
 	 * CommandIds
 	 */
-- 
1.8.3.1

v1-0005-Gracefully-handle-concurrent-aborts-of-uncommitte.patchapplication/octet-stream; name=v1-0005-Gracefully-handle-concurrent-aborts-of-uncommitte.patchDownload
From 193f2812afa906a07cc5f90a8b872cebbf96402b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:32:34 +0530
Subject: [PATCH v1 05/13] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml               |  5 ++-
 src/backend/access/heap/heapam.c                | 51 +++++++++++++++++++++++++
 src/backend/access/index/genam.c                | 34 +++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c |  9 +++--
 src/backend/utils/time/snapmgr.c                | 25 +++++++++++-
 src/include/utils/snapmgr.h                     |  4 +-
 6 files changed, 120 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index fc4ad65..da6a6f3 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,7 +432,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
      includes writing to tables, performing DDL changes, and
      calling <literal>txid_current()</literal>.
     </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0128bb3..2a60a73 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1303,6 +1303,17 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(scan->rs_base.rs_rd) ||
+		  RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_getnext call")));
+
 	/* Note: no locking manipulations needed */
 
 	HEAPDEBUG_1;				/* heap_getnext( info ) */
@@ -1422,6 +1433,16 @@ heap_fetch(Relation relation,
 	bool		valid;
 
 	/*
+	 * We don't expect direct calls to heap_fetch with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_fetch call")));
+
+	/*
 	 * Fetch and pin the appropriate page of the relation.
 	 */
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
@@ -1535,6 +1556,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		valid;
 	bool		skip;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search_buffer with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_hot_search_buffer call")));
+
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
 		*all_dead = first_call;
@@ -1683,6 +1714,16 @@ heap_get_latest_tid(TableScanDesc sscan,
 	Assert(ItemPointerIsValid(tid));
 
 	/*
+	 * We don't expect direct calls to heap_get_latest_tid with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_get_latest_tid call")));
+
+	/*
 	 * Loop to chase down t_ctid links.  At top of loop, ctid is the tuple we
 	 * need to examine, and *tid is the TID we will return if ctid turns out
 	 * to be bogus.
@@ -5481,6 +5522,16 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	ItemId		lp = NULL;
 	HeapTupleHeader htup;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_hot_search call")));
+
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d..201acfb 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -477,6 +478,17 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
@@ -513,6 +525,17 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return result;
 }
 
@@ -639,6 +662,17 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3422939..bda4a1c 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -683,7 +683,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 			txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 		/* setup snapshot to allow catalog access */
-		SetupHistoricSnapshot(snapshot_now, NULL);
+		SetupHistoricSnapshot(snapshot_now, NULL, xid);
 		PG_TRY();
 		{
 			rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1533,7 +1533,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1784,7 +1784,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1804,7 +1804,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					}
 
 					break;
@@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_CATCH();
 	{
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
+
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
 
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 47b0517..9fa1e43 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -154,6 +154,13 @@ static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
 /*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
+/*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2029,10 +2036,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
  *
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive.  This is to re-check XID status while accessing catalog.
+ *
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+					  TransactionId snapshot_xid)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -2041,8 +2052,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 
 	/* setup (cmin, cmax) lookup hash */
 	tuplecid_data = tuplecids;
-}
 
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
+	 */
+	if (TransactionIdIsValid(snapshot_xid) &&
+		!TransactionIdDidCommit(snapshot_xid))
+		CheckXidAlive = snapshot_xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
 /*
  * Make catalog snapshots behave normally again.
@@ -2052,6 +2072,7 @@ TeardownHistoricSnapshot(bool is_error)
 {
 	HistoricSnapshot = NULL;
 	tuplecid_data = NULL;
+	CheckXidAlive = InvalidTransactionId;
 }
 
 bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 67b07df..9a8f9ce 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,8 +145,10 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+								  TransactionId snapshot_xid);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-- 
1.8.3.1

v1-0003-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=v1-0003-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From d3e02ec7f0485638247e236937e25d62a1f44ab3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v1 03/13] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 +++++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  57 +++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 6c33c4b..9c77791 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -542,3 +571,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8db9686..fc4ad65 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_message_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction size and network bandwidth, the transfer time
+    may significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similarly to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_work_mem</varname> setting. At
+    that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 7e06615..b88b585 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +204,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -863,6 +911,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	/* FIXME ctx->write_location = apply_lsn; */
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	/* FIXME ctx->write_location = apply_lsn; */
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 31c796b..d95d1b9 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -81,6 +81,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index d4ce54f..a305462 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6a7187b..5b4be2b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -345,6 +345,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -384,6 +430,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v1-0006-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v1-0006-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From 95cc2862df84b5bc02ca7ac0a3ab8bd2f95f52f2 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 11:42:31 +0530
Subject: [PATCH v1 06/13] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c     |   38 +-
 src/backend/replication/logical/reorderbuffer.c | 1075 ++++++++++++++++++++++-
 src/include/replication/reorderbuffer.h         |   32 +
 3 files changed, 1112 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 3e36467..cf10dd0 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bda4a1c..0ab3191 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -149,6 +149,28 @@ typedef struct ReorderBufferIterTXNState
 	ReorderBufferIterTXNEntry entries[FLEXIBLE_ARRAY_MEMBER];
 } ReorderBufferIterTXNState;
 
+/*
+ * k-way in-order change iteration support structures
+ *
+ * This is a simplified version for streaming, which does not require
+ * serialization to files and only reads changes that are currently in
+ * memory.
+ */
+typedef struct ReorderBufferStreamIterTXNEntry
+{
+	XLogRecPtr	lsn;
+	ReorderBufferChange *change;
+	ReorderBufferTXN *txn;
+}			ReorderBufferStreamIterTXNEntry;
+
+typedef struct ReorderBufferStreamIterTXNState
+{
+	binaryheap *heap;
+	Size		nr_txns;
+	dlist_head	old_change;
+	ReorderBufferStreamIterTXNEntry entries[FLEXIBLE_ARRAY_MEMBER];
+}			ReorderBufferStreamIterTXNState;
+
 /* toast datastructures */
 typedef struct ReorderBufferToastEnt
 {
@@ -213,6 +235,20 @@ static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
 static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
 
+
+/* iterator for streaming (only get data from memory) */
+static ReorderBufferStreamIterTXNState * ReorderBufferStreamIterTXNInit(
+																		ReorderBuffer *rb,
+																		ReorderBufferTXN *txn);
+
+static ReorderBufferChange *ReorderBufferStreamIterTXNNext(
+							   ReorderBuffer *rb,
+							   ReorderBufferStreamIterTXNState * state);
+
+static void ReorderBufferStreamIterTXNFinish(
+								 ReorderBuffer *rb,
+								 ReorderBufferStreamIterTXNState * state);
+
 /*
  * ---------------------------------------
  * Disk serialization support functions
@@ -227,6 +263,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -235,6 +272,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -362,6 +408,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -759,6 +808,33 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+static void
+AssertChangeLsnOrder(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -855,6 +931,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -978,7 +1057,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1006,6 +1085,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	cur_txn_i;
 	int32		off;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(rb, txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1020,6 +1102,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(rb, cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1235,6 +1320,210 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 }
 
 /*
+ * Binary heap comparison function (streaming iterator).
+ */
+static int
+ReorderBufferStreamIterCompare(Datum a, Datum b, void *arg)
+{
+	ReorderBufferStreamIterTXNState *state = (ReorderBufferStreamIterTXNState *) arg;
+	XLogRecPtr	pos_a = state->entries[DatumGetInt32(a)].lsn;
+	XLogRecPtr	pos_b = state->entries[DatumGetInt32(b)].lsn;
+
+	if (pos_a < pos_b)
+		return 1;
+	else if (pos_a == pos_b)
+		return 0;
+	return -1;
+}
+
+/*
+ * Allocate & initialize an iterator which iterates in lsn order over a
+ * transaction and all its subtransactions. This version is meant for
+ * streaming of incomplete transactions.
+ */
+static ReorderBufferStreamIterTXNState *
+ReorderBufferStreamIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Size		nr_txns = 0;
+	ReorderBufferStreamIterTXNState *state;
+	dlist_iter	cur_txn_i;
+	int32		off;
+
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(rb, txn);
+
+	/*
+	 * Calculate the size of our heap: one element for every transaction that
+	 * contains changes.  (Besides the transactions already in the reorder
+	 * buffer, we count the one we were directly passed.)
+	 */
+	if (txn->nentries > 0)
+		nr_txns++;
+
+	dlist_foreach(cur_txn_i, &txn->subtxns)
+	{
+		ReorderBufferTXN *cur_txn;
+
+		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
+
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(rb, cur_txn);
+
+		if (cur_txn->nentries > 0)
+			nr_txns++;
+	}
+
+	/*
+	 * TODO: Consider adding fastpath for the rather common nr_txns=1 case, no
+	 * need to allocate/build a heap then.
+	 */
+
+	/* allocate iteration state */
+	state = (ReorderBufferStreamIterTXNState *)
+		MemoryContextAllocZero(rb->context,
+							   sizeof(ReorderBufferStreamIterTXNState) +
+							   sizeof(ReorderBufferStreamIterTXNEntry) * nr_txns);
+
+	state->nr_txns = nr_txns;
+	dlist_init(&state->old_change);
+
+	/* allocate heap */
+	state->heap = binaryheap_allocate(state->nr_txns,
+									  ReorderBufferStreamIterCompare,
+									  state);
+
+	/*
+	 * Now insert items into the binary heap, in an unordered fashion.  (We
+	 * will run a heap assembly step at the end; this is more efficient.)
+	 */
+
+	off = 0;
+
+	/* add toplevel transaction if it contains changes */
+	if (txn->nentries > 0)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_head_element(ReorderBufferChange, node,
+										&txn->changes);
+
+		state->entries[off].lsn = cur_change->lsn;
+		state->entries[off].change = cur_change;
+		state->entries[off].txn = txn;
+
+		binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
+	}
+
+	/* add subtransactions if they contain changes */
+	dlist_foreach(cur_txn_i, &txn->subtxns)
+	{
+		ReorderBufferTXN *cur_txn;
+
+		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
+
+		if (cur_txn->nentries > 0)
+		{
+			ReorderBufferChange *cur_change;
+
+			cur_change = dlist_head_element(ReorderBufferChange, node,
+											&cur_txn->changes);
+
+			state->entries[off].lsn = cur_change->lsn;
+			state->entries[off].change = cur_change;
+			state->entries[off].txn = cur_txn;
+
+			binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
+		}
+	}
+
+	Assert(off == nr_txns);
+
+	/* assemble a valid binary heap */
+	binaryheap_build(state->heap);
+
+	return state;
+}
+
+/*
+ * Return the next change when iterating over a transaction and its
+ * subtransactions.
+ *
+ * Returns NULL when no further changes exist.
+ */
+static ReorderBufferChange *
+ReorderBufferStreamIterTXNNext(ReorderBuffer *rb, ReorderBufferStreamIterTXNState * state)
+{
+	ReorderBufferChange *change;
+	ReorderBufferStreamIterTXNEntry *entry;
+	int32		off;
+
+	/* nothing there anymore */
+	if (state->heap->bh_size == 0)
+		return NULL;
+
+	off = DatumGetInt32(binaryheap_first(state->heap));
+	entry = &state->entries[off];
+
+	/* free memory we might have "leaked" in the previous *Next call */
+	if (!dlist_is_empty(&state->old_change))
+	{
+		change = dlist_container(ReorderBufferChange, node,
+								 dlist_pop_head_node(&state->old_change));
+		ReorderBufferReturnChange(rb, change);
+		Assert(dlist_is_empty(&state->old_change));
+	}
+
+	change = entry->change;
+
+	/*
+	 * update heap with information about which transaction has the next
+	 * relevant change in LSN order
+	 */
+
+	/* there are in-memory changes */
+	if (dlist_has_next(&entry->txn->changes, &entry->change->node))
+	{
+		dlist_node *next = dlist_next_node(&entry->txn->changes, &change->node);
+		ReorderBufferChange *next_change =
+		dlist_container(ReorderBufferChange, node, next);
+
+		/* txn stays the same */
+		state->entries[off].lsn = next_change->lsn;
+		state->entries[off].change = next_change;
+
+		binaryheap_replace_first(state->heap, Int32GetDatum(off));
+		return change;
+	}
+
+	/* ok, no changes there anymore, remove */
+	binaryheap_remove_first(state->heap);
+
+	return change;
+}
+
+/*
+ * Deallocate the iterator
+ */
+static void
+ReorderBufferStreamIterTXNFinish(ReorderBuffer *rb,
+								 ReorderBufferStreamIterTXNState * state)
+{
+	/* free memory we might have "leaked" in the last *Next call */
+	if (!dlist_is_empty(&state->old_change))
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node,
+								 dlist_pop_head_node(&state->old_change));
+		ReorderBufferReturnChange(rb, change);
+		Assert(dlist_is_empty(&state->old_change));
+	}
+
+	binaryheap_free(state->heap);
+	pfree(state);
+}
+
+/*
  * Cleanup the contents of a transaction, usually after the transaction
  * committed or aborted.
  */
@@ -1327,33 +1616,104 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
  */
 static void
-ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	dlist_iter	iter;
-	HASHCTL		hash_ctl;
+	dlist_mutable_iter iter;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
 
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
-	hash_ctl.entrysize = sizeof(ReorderBufferTupleCidEnt);
-	hash_ctl.hcxt = rb->context;
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
 
-	/*
-	 * create the hash with the exact number of to-be-stored tuplecids from
-	 * the start
-	 */
-	txn->tuplecid_hash =
-		hash_create("ReorderBufferTupleCid", txn->ntuplecids, &hash_ctl,
-					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
 
-	dlist_foreach(iter, &txn->tuplecids)
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
+ * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
+ */
+static void
+ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_iter	iter;
+	HASHCTL		hash_ctl;
+
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+
+	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
+	hash_ctl.entrysize = sizeof(ReorderBufferTupleCidEnt);
+	hash_ctl.hcxt = rb->context;
+
+	/*
+	 * create the hash with the exact number of to-be-stored tuplecids from
+	 * the start
+	 */
+	txn->tuplecid_hash =
+		hash_create("ReorderBufferTupleCid", txn->ntuplecids, &hash_ctl,
+					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	dlist_foreach(iter, &txn->tuplecids)
 	{
 		ReorderBufferTupleCidKey key;
 		ReorderBufferTupleCidEnt *ent;
@@ -1403,6 +1763,16 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 }
 
+static void
+ReorderBufferDestroyTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+}
+
 /*
  * Copy a provided snapshot so we can modify it privately. This is needed so
  * that catalog modifying transactions can look into intermediate catalog
@@ -1476,6 +1846,19 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 		SnapBuildSnapDecRefcount(snap);
 }
 
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
+
+	ReorderBufferStreamTXN(rb, txn);
+
+	rb->stream_commit(rb, txn, txn->final_lsn);
+
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
 /*
  * Perform the replay of a transaction and its non-aborted subtransactions.
  *
@@ -1515,6 +1898,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
 	 * If this transaction has no snapshot, it didn't make any changes to the
 	 * database, so there's nothing to decode.  Note that
 	 * ReorderBufferCommitChild will have transferred any snapshots from
@@ -1549,6 +1948,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 	PG_TRY();
 	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 		ReorderBufferChange *change;
 		ReorderBufferChange *specinsert = NULL;
 
@@ -1565,6 +1965,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			if (prev_lsn != InvalidXLogRecPtr)
+				Assert(prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1930,6 +2340,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2014,6 +2431,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2149,8 +2573,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2158,6 +2591,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2169,19 +2603,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2210,6 +2653,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2285,6 +2729,9 @@ ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	for (i = 0; i < txn->ninvalidations; i++)
 		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+
+	/* Invalidate current schema as well */
+	txn->is_schema_sent = false;
 }
 
 /*
@@ -2299,6 +2746,23 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * We read catalog changes from WAL, which are not yet sent, so
+	 * invalidate current schema in order output plugin can resend
+	 * schema again.
+	 */
+	txn->is_schema_sent = false;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+	{
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		txn->toptxn->is_schema_sent = false;
+	}
 }
 
 /*
@@ -2403,6 +2867,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
  *
@@ -2422,15 +2918,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2723,6 +3250,498 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id;
+	bool		using_subtxn;
+	Size		streamed = 0;
+	ReorderBufferStreamIterTXNState *volatile iterstate = NULL;
+
+	/*
+	 * If this is a subxact, we need to stream the top-level transaction
+	 * instead.
+	 */
+	if (txn->toptxn)
+	{
+		ReorderBufferStreamTXN(rb, txn->toptxn);
+		return;
+	}
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+
+			if (subtxn->base_snapshot != NULL &&
+				(txn->base_snapshot == NULL ||
+				 txn->base_snapshot_lsn > subtxn->base_snapshot_lsn))
+			{
+				txn->base_snapshot = subtxn->base_snapshot;
+				txn->base_snapshot_lsn = subtxn->base_snapshot_lsn;
+				subtxn->base_snapshot = NULL;
+				subtxn->base_snapshot_lsn = InvalidXLogRecPtr;
+			}
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
+		 * information about subtransactions, which could arrive after streaming start.
+		 */
+		if (!txn->is_schema_sent)
+			snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+												 txn, command_id);
+	}
+
+	/*
+	 * build data to be able to lookup the CommandIds of catalog tuples
+	 */
+	ReorderBufferBuildTupleCidHash(rb, txn);
+
+	/* setup the initial snapshot */
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, txn->xid);
+
+	/*
+	 * Decoding needs access to syscaches et al., which in turn use
+	 * heavyweight locks and such. Thus we need to have enough state around to
+	 * keep track of those.  The easiest way is to simply use a transaction
+	 * internally.  That also allows us to easily enforce that nothing writes
+	 * to the database by checking for xid assignments.
+	 *
+	 * When we're called via the SQL SRF there's already a transaction
+	 * started, so start an explicit subtransaction there.
+	 */
+	using_subtxn = IsTransactionOrTransactionBlock();
+
+	PG_TRY();
+	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+		ReorderBufferChange *change;
+		ReorderBufferChange *specinsert = NULL;
+
+		if (using_subtxn)
+			BeginInternalSubTransaction("stream");
+		else
+			StartTransactionCommand();
+
+		/* start streaming this chunk of transaction */
+		rb->stream_start(rb, txn);
+
+		iterstate = ReorderBufferStreamIterTXNInit(rb, txn);
+		while ((change = ReorderBufferStreamIterTXNNext(rb, iterstate)) != NULL)
+		{
+			Relation	relation = NULL;
+			Oid			reloid;
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			if (prev_lsn != InvalidXLogRecPtr)
+				Assert(prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* we're going to stream this change */
+			streamed++;
+
+			switch (change->action)
+			{
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+
+					/*
+					 * Confirmation for speculative insertion arrived. Simply
+					 * use as a normal record. It'll be cleaned up at the end
+					 * of INSERT processing.
+					 */
+					Assert(specinsert->data.tp.oldtuple == NULL);
+					change = specinsert;
+					change->action = REORDER_BUFFER_CHANGE_INSERT;
+
+					/* intentionally fall through */
+				case REORDER_BUFFER_CHANGE_INSERT:
+				case REORDER_BUFFER_CHANGE_UPDATE:
+				case REORDER_BUFFER_CHANGE_DELETE:
+					Assert(snapshot_now);
+
+					reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
+												change->data.tp.relnode.relNode);
+
+					/*
+					 * Catalog tuple without data, emitted while catalog was
+					 * in the process of being rewritten.
+					 */
+					if (reloid == InvalidOid &&
+						change->data.tp.newtuple == NULL &&
+						change->data.tp.oldtuple == NULL)
+						goto change_done;
+					else if (reloid == InvalidOid)
+						elog(ERROR, "could not map filenode \"%s\" to relation OID",
+							 relpathperm(change->data.tp.relnode,
+										 MAIN_FORKNUM));
+
+					relation = RelationIdGetRelation(reloid);
+
+					if (relation == NULL)
+						elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
+							 reloid,
+							 relpathperm(change->data.tp.relnode,
+										 MAIN_FORKNUM));
+
+					if (!RelationIsLogicallyLogged(relation))
+						goto change_done;
+
+					/*
+					 * For now ignore sequence changes entirely. Most of the
+					 * time they don't log changes using records we
+					 * understand, so it doesn't make sense to handle the few
+					 * cases we do.
+					 */
+					if (relation->rd_rel->relkind == RELKIND_SEQUENCE)
+						goto change_done;
+
+					/* user-triggered change */
+					if (!IsToastRelation(relation))
+					{
+						ReorderBufferToastReplace(rb, txn, relation, change);
+						rb->stream_change(rb, txn, relation, change);
+
+						/*
+						 * Only clear reassembled toast chunks if we're sure
+						 * they're not required anymore. The creator of the
+						 * tuple tells us.
+						 */
+						if (change->data.tp.clear_toast_afterwards)
+							ReorderBufferToastReset(rb, txn);
+					}
+					/* we're not interested in toast deletions */
+					else if (change->action == REORDER_BUFFER_CHANGE_INSERT)
+					{
+						/*
+						 * Need to reassemble the full toasted Datum in
+						 * memory, to ensure the chunks don't get reused till
+						 * we're done remove it from the list of this
+						 * transaction's changes. Otherwise it will get
+						 * freed/reused while restoring spooled data from
+						 * disk.
+						 */
+						dlist_delete(&change->node);
+						ReorderBufferToastAppendChunk(rb, txn, relation,
+													  change);
+					}
+
+			change_done:
+
+					/*
+					 * Either speculative insertion was confirmed, or it was
+					 * unsuccessful and the record isn't needed anymore.
+					 */
+					if (specinsert != NULL)
+					{
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+
+					if (relation != NULL)
+					{
+						RelationClose(relation);
+						relation = NULL;
+					}
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
+
+					/*
+					 * Speculative insertions are dealt with by delaying the
+					 * processing of the insert until the confirmation record
+					 * arrives. For that we simply unlink the record from the
+					 * chain, so it does not get freed/reused while restoring
+					 * spooled data from disk.
+					 *
+					 * This is safe in the face of concurrent catalog changes
+					 * because the relevant relation can't be changed between
+					 * speculative insertion and confirmation due to
+					 * CheckTableNotInUse() and locking.
+					 */
+
+					/* clear out a pending (and thus failed) speculation */
+					if (specinsert != NULL)
+					{
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+
+					/* and memorize the pending insertion */
+					dlist_delete(&change->node);
+					specinsert = change;
+					break;
+
+				case REORDER_BUFFER_CHANGE_TRUNCATE:
+					{
+						int			i;
+						int			nrelids = change->data.truncate.nrelids;
+						int			nrelations = 0;
+						Relation   *relations;
+
+						relations = palloc0(nrelids * sizeof(Relation));
+						for (i = 0; i < nrelids; i++)
+						{
+							Oid			relid = change->data.truncate.relids[i];
+							Relation	relation;
+
+							relation = RelationIdGetRelation(relid);
+
+							if (relation == NULL)
+								elog(ERROR, "could not open relation with OID %u", relid);
+
+							if (!RelationIsLogicallyLogged(relation))
+								continue;
+
+							relations[nrelations++] = relation;
+						}
+
+						rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+						for (i = 0; i < nrelations; i++)
+							RelationClose(relations[i]);
+
+						break;
+					}
+
+				case REORDER_BUFFER_CHANGE_MESSAGE:
+
+					rb->stream_message(rb, txn, change->lsn, true,
+									   change->data.msg.prefix,
+									   change->data.msg.message_size,
+									   change->data.msg.message);
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
+					/* get rid of the old */
+					TeardownHistoricSnapshot(false);
+
+					if (snapshot_now->copied)
+					{
+						ReorderBufferFreeSnap(rb, snapshot_now);
+						snapshot_now =
+							ReorderBufferCopySnap(rb, change->data.snapshot,
+												  txn, command_id);
+					}
+
+					/*
+					 * Restored from disk, need to be careful not to double
+					 * free. We could introduce refcounting for that, but for
+					 * now this seems infrequent enough not to care.
+					 */
+					else if (change->data.snapshot->copied)
+					{
+						snapshot_now =
+							ReorderBufferCopySnap(rb, change->data.snapshot,
+												  txn, command_id);
+					}
+					else
+					{
+						snapshot_now = change->data.snapshot;
+					}
+
+					/*
+					 * TOCHECK: Snapshot changed, then invalidate current schema to reflect
+					 * possible catalog changes.
+					 */
+					txn->is_schema_sent = false;
+
+					/* and continue with the new one */
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+										  txn->xid);
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
+					Assert(change->data.command_id != InvalidCommandId);
+
+					if (command_id < change->data.command_id)
+					{
+						command_id = change->data.command_id;
+
+						if (!snapshot_now->copied)
+						{
+							/* we don't use the global one anymore */
+							snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+																 txn, command_id);
+						}
+
+						snapshot_now->curcid = command_id;
+
+						TeardownHistoricSnapshot(false);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+											  txn->xid);
+					}
+
+					break;
+
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation message locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+					txn->is_schema_sent = false;
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+					elog(ERROR, "tuplecid value in changequeue");
+					break;
+			}
+		}
+
+		/*
+		 * There's a speculative insertion remaining, just clean in up, it
+		 * can't have been successful, otherwise we'd gotten a confirmation
+		 * record.
+		 */
+		if (specinsert)
+		{
+			ReorderBufferReturnChange(rb, specinsert);
+			specinsert = NULL;
+		}
+
+		/* clean up the iterator */
+		ReorderBufferStreamIterTXNFinish(rb, iterstate);
+		iterstate = NULL;
+
+		/* call stream_stop callback */
+		rb->stream_stop(rb, txn);
+
+		/* this is just a sanity check against bad output plugin behaviour */
+		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
+			elog(ERROR, "output plugin used XID %u",
+				 GetCurrentTransactionId());
+
+		/* remember the command ID and snapshot for the streaming run */
+		txn->command_id = command_id;
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+
+		/* cleanup */
+		TeardownHistoricSnapshot(false);
+
+		/*
+		 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+		 * any memory. We could also keep the hash table and update it with
+		 * new ctid values, but this seems simpler and good enough for now.
+		 */
+		ReorderBufferDestroyTupleCidHash(rb, txn);
+
+		/*
+		 * Aborting the current (sub-)transaction as a whole has the right
+		 * semantics. We want all locks acquired in here to be released, not
+		 * reassigned to the parent and we do not want any database access
+		 * have persistent effects.
+		 */
+		AbortCurrentTransaction();
+
+		/* make sure there's no cache pollution */
+		ReorderBufferExecuteInvalidations(rb, txn);
+
+		if (using_subtxn)
+			RollbackAndReleaseCurrentSubTransaction();
+	}
+	PG_CATCH();
+	{
+		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
+		if (iterstate)
+			ReorderBufferStreamIterTXNFinish(rb, iterstate);
+
+		TeardownHistoricSnapshot(true);
+
+		/*
+		 * Force cache invalidation to happen outside of a valid transaction
+		 * to prevent catalog access as we just caught an error.
+		 */
+		AbortCurrentTransaction();
+
+		/* make sure there's no cache pollution */
+		ReorderBufferExecuteInvalidations(rb, txn);
+
+		if (using_subtxn)
+			RollbackAndReleaseCurrentSubTransaction();
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	/*
+	 * Discard the changes that we just streamed, and mark the transactions
+	 * as streamed (if they contained changes).
+	 */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 19c7bac..7d08e2f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* does the txn have catalog changes */
 #define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -187,6 +188,20 @@ typedef struct ReorderBufferChange
  */
 #define rbtxn_is_serialized(txn)       (txn->txn_flags & RBTXN_IS_SERIALIZED)
 
+/*
+ * Has this transaction been streamed to downstream? Similarly to spilling
+ * to disk, it's not trivial to deduce this from nentries and nentries_mem,
+ * for various reasons. For example, all changes may be in subtransactions
+ * in which case we'd have nentries==0 for the toplevel one, and it'd say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.
+ *
+ * Note: We never stream and serialize a transaction at the same time (e
+ * only do spill to disk when streaming is not supported by the plugin),
+ * so only one of those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn)         (txn->txn_flags & RBTXN_IS_STREAMED)
+
 typedef struct ReorderBufferTXN
 {
 	int     txn_flags;
@@ -222,6 +237,16 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	final_lsn;
 
 	/*
+	 * Do we need to send schema for this transaction in output plugin?
+	 */
+	bool		is_schema_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
+	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
 	XLogRecPtr	end_lsn;
@@ -252,6 +277,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
-- 
1.8.3.1

v1-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patchapplication/octet-stream; name=v1-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patchDownload
From 5bfee24870b38fb1b60b1c0806f98fd289edabc6 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:14:45 +0200
Subject: [PATCH v1 11/13] BUGFIX: set final_lsn for subxacts before cleanup

---
 src/backend/replication/logical/reorderbuffer.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 83eb4df..410da36 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1544,6 +1544,10 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
+		/* make sure subtxn has final_lsn */
+		if (subtxn->final_lsn == InvalidXLogRecPtr)
+			subtxn->final_lsn = txn->final_lsn;
+
 		/*
 		 * Subtransactions are always associated to the toplevel TXN, even if
 		 * they originally were happening inside another subtxn, so we won't
-- 
1.8.3.1

v1-0012-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v1-0012-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From 739e742dfe7b453857d6ef4d354cfbfe1d25c111 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v1 12/13] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v1-0013-Extend-handling-of-concurrent-aborts-for-streamin.patchapplication/octet-stream; name=v1-0013-Extend-handling-of-concurrent-aborts-for-streamin.patchDownload
From 39df08b4bbe216f2fe1a386c5afd63775565099a Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 22 Nov 2019 12:43:38 +0530
Subject: [PATCH v1 13/13] Extend handling of concurrent aborts for streaming
 transaction

---
 src/backend/replication/logical/reorderbuffer.c | 38 +++++++++++++++++++++++--
 src/include/replication/reorderbuffer.h         |  5 ++++
 2 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 410da36..710a22f 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2350,9 +2350,9 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 
 	/*
 	 * When the (sub)transaction was streamed, notify the remote node
-	 * about the abort.
+	 * about the abort only if we have sent any data for this transaction.
 	 */
-	if (rbtxn_is_streamed(txn))
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
 		rb->stream_abort(rb, txn, lsn);
 
 	/* cosmetic... */
@@ -3281,6 +3281,7 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	volatile CommandId command_id;
 	bool		using_subtxn;
 	Size		streamed = 0;
+	MemoryContext ccxt = CurrentMemoryContext;
 	ReorderBufferStreamIterTXNState *volatile iterstate = NULL;
 
 	/*
@@ -3411,6 +3412,13 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			/* we're going to stream this change */
 			streamed++;
 
+			/*
+			 * Set the CheckXidAlive to the current (sub)xid for which this
+			 * change belongs to so that we can detect the abort while we are
+			 * decoding.
+			 */
+			CheckXidAlive = change->txn->xid;
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -3472,6 +3480,10 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 						ReorderBufferToastReplace(rb, txn, relation, change);
 						rb->stream_change(rb, txn, relation, change);
 
+						/* Remember that we have sent some data for this txn.*/
+						if (!change->txn->any_data_sent)
+							change->txn->any_data_sent = true;
+
 						/*
 						 * Only clear reassembled toast chunks if we're sure
 						 * they're not required anymore. The creator of the
@@ -3654,6 +3666,8 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 					 * about the message itself?
 					 */
 					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+
+					/* Invalidate current schema as well */
 					txn->is_schema_sent = false;
 					break;
 
@@ -3717,6 +3731,9 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferStreamIterTXNFinish(rb, iterstate);
@@ -3735,7 +3752,22 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		PG_RE_THROW();
+		/* re-throw only if it's not an abort */
+		if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+		else
+		{
+			/* remember the command ID and snapshot for the streaming run */
+			txn->command_id = command_id;
+			txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+													  txn, command_id);
+			rb->stream_stop(rb, txn);
+
+			FlushErrorState();
+		}
 	}
 	PG_END_TRY();
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index c183fce..1db6da6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -242,6 +242,11 @@ typedef struct ReorderBufferTXN
 	bool		is_schema_sent;
 
 	/*
+	 * Have we sent any changes for this transaction in output plugin?
+	 */
+	bool		any_data_sent;
+
+	/*
 	 * Toplevel transaction for this subxact (NULL for top-level).
 	 */
 	struct ReorderBufferTXN *toptxn;
-- 
1.8.3.1

#146Michael Paquier
michael@paquier.xyz
In reply to: Dilip Kumar (#145)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:

I have rebased the patch on the latest head and also fix the issue of
"concurrent abort handling of the (sub)transaction." and attached as
(v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
the complete patch set. I have added the version number so that we
can track the changes.

The patch has rotten a bit and does not apply anymore. Could you
please send a rebased version? I have moved it to next CF, waiting on
author.
--
Michael

#147Dilip Kumar
dilipbalaut@gmail.com
In reply to: Michael Paquier (#146)
13 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:

I have rebased the patch on the latest head and also fix the issue of
"concurrent abort handling of the (sub)transaction." and attached as
(v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
the complete patch set. I have added the version number so that we
can track the changes.

The patch has rotten a bit and does not apply anymore. Could you
please send a rebased version? I have moved it to next CF, waiting on
author.

I have rebased the patch set on the latest head.

Apart from this, there is one issue reported by my colleague Vignesh.
The issue is that if we use more than two relations in a transaction
then there is an error on standby (no relation map entry for remote
relation ID 16390). After analyzing I have found that for the
streaming transaction an "is_schema_sent" flag is kept in
ReorderBufferTXN. And, I think that is done so that we can send the
schema for each transaction stream so that if any subtransaction gets
aborted we don't lose the logical WAL for that schema. But, this
solution has induced a very basic issue that if a transaction operate
on more than 1 relation then after sending the schema for the first
relation it will mark the flag true and the schema for the subsequent
relations will never be sent. I am still working on finding a better
solution for this if anyone has any opinion/solution about this feel
free to suggest.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v2-0008-Add-support-for-streaming-to-built-in-replication.patchapplication/octet-stream; name=v2-0008-Add-support-for-streaming-to-built-in-replication.patchDownload
From 6aa8ffd27af999e878f1fd53f83d3a70dfbec0ac Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 11:53:58 +0530
Subject: [PATCH v2 08/13] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |    5 +-
 doc/src/sgml/ref/create_subscription.sgml          |   12 +
 src/backend/catalog/pg_subscription.c              |    1 +
 src/backend/commands/subscriptioncmds.c            |   60 +-
 src/backend/postmaster/pgstat.c                    |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    8 +-
 src/backend/replication/logical/launcher.c         |    2 +
 src/backend/replication/logical/logical.c          |    4 +-
 src/backend/replication/logical/proto.c            |  157 ++-
 src/backend/replication/logical/worker.c           | 1031 ++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c        |  263 ++++-
 src/backend/replication/slotfuncs.c                |    7 +
 src/backend/replication/walsender.c                |    6 +
 src/include/catalog/pg_subscription.h              |    3 +
 src/include/pgstat.h                               |    6 +-
 src/include/replication/logicalproto.h             |   42 +-
 src/include/replication/walreceiver.h              |    1 +
 src/test/subscription/t/009_stream_simple.pl       |   86 ++
 src/test/subscription/t/010_stream_subxact.pl      |  102 ++
 src/test/subscription/t/011_stream_ddl.pl          |   95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |   82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |   84 ++
 22 files changed, 2027 insertions(+), 42 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 6dfb2e4..e1fb907 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
+      and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 91790b0..d9abf5e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -218,6 +218,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 2a27648..15a6f5a 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->workmem = subform->subworkmem;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index fbb4473..b2b93d6 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
 						   bool *refresh, int *logical_wm,
-						   bool *logical_wm_given)
+						   bool *logical_wm_given, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -92,6 +93,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*refresh = true;
 	if (logical_wm)
 		*logical_wm_given = false;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -186,6 +189,26 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 
 			*logical_wm_given = true;
 			*logical_wm = defGetInt32(defel);
+
+			/*
+			 * Check that the value is not smaller than 64kB (which is
+			 * the minimum value for logical_work_mem).
+			 */
+			if (*logical_wm < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("%d is outside the valid range for parameter \"work_mem\" (64 .. 2147483647)",
+								*logical_wm)));
+		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
 		}
 		else
 			ereport(ERROR,
@@ -332,6 +355,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	int			logical_wm;
 	bool		logical_wm_given;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -348,7 +373,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL, &logical_wm, &logical_wm_given);
+							   NULL, &logical_wm, &logical_wm_given,
+							   &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -430,7 +456,15 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 		values[Anum_pg_subscription_subworkmem - 1] =
 			Int32GetDatum(logical_wm);
 	else
-		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(-1);
+
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
 
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
@@ -692,11 +726,14 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				int			logical_wm;
 				bool		logical_wm_given;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
 										   NULL, &synchronous_commit, NULL,
-										   &logical_wm, &logical_wm_given);
+										   &logical_wm, &logical_wm_given,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -728,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subworkmem - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -740,7 +784,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
 										   NULL, NULL, NULL, NULL, NULL,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -778,7 +822,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh, NULL, NULL);
+										   NULL, &refresh, NULL, NULL,
+										   NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -815,7 +860,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index fabcf31..75effed 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4104,6 +4104,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 0ab6855..9970170 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -406,8 +406,12 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
-		appendStringInfo(&cmd, ", work_mem '%d'",
-						 options->proto.logical.work_mem);
+		if (options->proto.logical.work_mem != -1)
+			appendStringInfo(&cmd, ", work_mem '%d'",
+							 options->proto.logical.work_mem);
+
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
 
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 4643af9..15e7140 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,6 +14,8 @@
  *
  *-------------------------------------------------------------------------
  */
+#include <sys/types.h>
+#include <unistd.h>
 
 #include "postgres.h"
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index b88b585..ad43ab3 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1149,7 +1149,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1194,7 +1194,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index e7df47d..5a379fb 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,7 +139,8 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
@@ -147,6 +148,10 @@ logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -182,8 +187,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -191,6 +196,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -252,13 +261,18 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
+	pq_sendbyte(out, 'D');		/* action DELETE */
+
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
-	pq_sendbyte(out, 'D');		/* action DELETE */
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
 
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
@@ -300,6 +314,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -309,6 +324,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -351,12 +370,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -401,7 +424,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -409,6 +432,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -689,3 +716,119 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendint32(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgint(in, 4) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+}
+
+TransactionId
+logicalrep_read_stream_stop(StringInfo in)
+{
+	TransactionId xid;
+
+	xid = pq_getmsgint(in, 4);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ebb976c..62be080 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in /tmp by default, and the filenames include both
+ * the XID of the toplevel transaction and OID of the subscription. This
+ * is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -31,6 +52,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -59,6 +81,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -66,6 +89,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -105,6 +129,50 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	off_t		offset;			/* offset in the file */
+}			SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
+static bool handle_streamed_transaction(const char action, StringInfo s);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
@@ -114,6 +182,9 @@ static void maybe_reread_subscription(void);
 /* Flags set by signal handlers */
 static volatile sig_atomic_t got_SIGHUP = false;
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -165,6 +236,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -528,6 +635,318 @@ apply_handle_origin(StringInfo s)
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* FIXME optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/* We should not receive aborts for unknown subtransactions. */
+		Assert(found);
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	/* XXX Should this be allocated in another memory context? */
+
+	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	ensure_transaction();
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pfree(buffer);
+	pfree(s2.data);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -540,6 +959,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -555,6 +977,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -590,6 +1015,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -693,6 +1121,9 @@ apply_handle_update(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -813,6 +1244,9 @@ apply_handle_delete(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -912,6 +1346,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1003,6 +1440,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1100,6 +1553,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 }
 
 /*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i]);
+}
+
+/*
  * Apply main loop.
  */
 static void
@@ -1115,6 +1584,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1563,6 +2035,564 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	CloseTransientFile(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	CloseTransientFile(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * TODO: Add missing_ok flag to specify in which cases it's OK not to
+ * find the files, and when it's an error.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialied in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* SIGHUP: set flag to reload configuration at next convenient time */
 static void
 logicalrep_worker_sighup(SIGNAL_ARGS)
@@ -1743,6 +2773,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.work_mem = MySubscription->workmem;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index cf6e03b..8490ea4 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -45,16 +45,42 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
 
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
+
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 
-/* Entry in the map used to remember which relation schemas we sent. */
+/*
+ * Entry in the map used to remember which relation schemas we sent.
+ *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
+	TransactionId	xid;		/* transaction that created the record */
 	bool		schema_sent;	/* did we send the schema? */
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -64,6 +90,7 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
@@ -84,16 +111,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, int *logical_decoding_work_mem)
+						List **publication_names, int *logical_decoding_work_mem,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		work_mem_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -162,6 +199,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*logical_decoding_work_mem = (int)parsed;
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -174,6 +228,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -197,7 +252,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&logical_decoding_work_mem);
+								&logical_decoding_work_mem,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -217,6 +273,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -284,9 +361,42 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (!relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = txn->is_schema_sent;
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (!schema_sent)
 	{
 		TupleDesc	desc;
 		int			i;
@@ -312,19 +422,26 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 				continue;
 
 			OutputPluginPrepareWrite(ctx, false);
-			logicalrep_write_typ(ctx->out, att->atttypid);
+			logicalrep_write_typ(ctx->out, xid, att->atttypid);
 			OutputPluginWrite(ctx, false);
 		}
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_rel(ctx->out, relation);
+		logicalrep_write_rel(ctx->out, xid, relation);
 		OutputPluginWrite(ctx, false);
-		relentry->schema_sent = true;
+		relentry->xid = change->txn->xid;
+
+		if (in_streaming)
+			txn->is_schema_sent = true;
+		else
+			relentry->schema_sent = true;
 	}
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -333,6 +450,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -361,14 +482,14 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
 	{
 		case REORDER_BUFFER_CHANGE_INSERT:
 			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_insert(ctx->out, relation,
+			logicalrep_write_insert(ctx->out, xid, relation,
 									&change->data.tp.newtuple->tuple);
 			OutputPluginWrite(ctx, true);
 			break;
@@ -378,7 +499,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				&change->data.tp.oldtuple->tuple : NULL;
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple,
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
 										&change->data.tp.newtuple->tuple);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -387,7 +508,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			if (change->data.tp.oldtuple)
 			{
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation,
+				logicalrep_write_delete(ctx->out, xid, relation,
 										&change->data.tp.oldtuple->tuple);
 				OutputPluginWrite(ctx, true);
 			}
@@ -413,6 +534,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -433,13 +558,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -513,6 +639,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -623,6 +834,34 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+	}
+
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 46e6dd4..c98d476 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -147,6 +147,13 @@ create_logical_replication_slot(char *name, char *plugin,
 									logical_read_local_xlog_page, NULL, NULL,
 									NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+
 	/* build initial snapshot, might take a while */
 	DecodingContextFindStartpoint(ctx);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 8bafa65..0837264 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -968,6 +968,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndUpdateProgress);
 
 		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
+		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
 		 * messages or send keepalives. As we possibly need to wait for
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 10ea113..8793676 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -50,6 +50,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	int32		subworkmem;		/* Memory to use to decode changes. */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -76,6 +78,7 @@ typedef struct Subscription
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	int			workmem;		/* Memory to decode changes. */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fe076d8..bc45194 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -944,7 +944,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 3fc430a..bf02cbc 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -85,25 +89,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out, TransactionId xid);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 1db706a..3d19b5d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -163,6 +163,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			int			work_mem;	/* Memory limit to use for decoding */
+			bool		streaming;	/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v2-0009-Track-statistics-for-streaming.patchapplication/octet-stream; name=v2-0009-Track-statistics-for-streaming.patchDownload
From ebc31a6153b7a4e7887d5d8b7c6c4959c54f4e60 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 2 Dec 2019 09:58:50 +0530
Subject: [PATCH v2 09/13] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                    | 25 +++++++++++++++++++
 src/backend/catalog/system_views.sql            |  5 +++-
 src/backend/replication/logical/reorderbuffer.c | 13 ++++++++++
 src/backend/replication/walsender.c             | 32 +++++++++++++++++++++----
 src/include/catalog/pg_proc.dat                 |  6 ++---
 src/include/replication/reorderbuffer.h         | 13 ++++++----
 src/include/replication/walsender_private.h     |  5 ++++
 src/test/regress/expected/rules.out             |  7 ++++--
 8 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index a3c5f86..3de62c0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1992,6 +1992,31 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       may get spilled repeatedly, and this counter gets incremented on every
       such invocation.</entry>
     </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
+
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f7800f0..5897611 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -779,7 +779,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 0ab3191..83eb4df 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -358,6 +358,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3732,6 +3736,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	PG_END_TRY();
 
 	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
+	/*
 	 * Discard the changes that we just streamed, and mark the transactions
 	 * as streamed (if they contained changes).
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 0837264..9f34b3b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1292,7 +1292,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1313,7 +1313,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2356,6 +2357,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3178,7 +3182,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3236,6 +3240,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillTxns;
 		int64		spillCount;
 		int64		spillBytes;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3259,6 +3266,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		memset(nulls, 0, sizeof(nulls));
@@ -3345,6 +3355,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3592,12 +3607,19 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillTxns = rb->spillTxns;
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
+	
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index ac8f64b..3b897a5 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5166,9 +5166,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 7d08e2f..c183fce 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -511,15 +511,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index a6b3205..7efc332 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -85,6 +85,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c9cc569..9fe3dd5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1955,9 +1955,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
1.8.3.1

v2-0010-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v2-0010-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From a97bae3b74b77b77df1ca134f61a764e6cea1bd3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v2 10/13] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 77a1560..8cd1993 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -65,7 +65,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 81547f6..8dfeafc 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 81520a7..0c9c6b3 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v2-0001-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=v2-0001-Immediately-WAL-log-assignments.patchDownload
From 05fe8c63cc2e81b32aaf72599190fa4521362429 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 09:53:08 +0530
Subject: [PATCH v2 01/13] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So instead we write the assignment info into WAL immediately, as
part of the next WAL record (to minimize overhead).
---
 src/backend/access/transam/xact.c        | 45 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 39 +++++++++++++--------------
 src/include/access/xact.h                |  3 +++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 +++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 98 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 5353b6a..708e523 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -190,6 +190,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -5096,6 +5097,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6002,3 +6004,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index aa9dca0..a8a8084 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -88,11 +88,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -194,6 +196,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -397,7 +403,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -743,6 +749,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		curinsert_flags |= XLOG_INCLUDE_XID;
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 67418b0..4435c63 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1165,6 +1165,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1203,6 +1204,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index bc532d0..897b755 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel_xid is valid, we need to assign the subxact to the
+	 * toplevel transaction. We need to do this for all records, hence we
+	 * do it before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 record->toplevel_xid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrIds) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(r->toplevel_xid))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 9d2899d..5b9740c 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d519252..b492d3e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -227,6 +227,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 0193611..a676151 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -147,6 +147,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -280,6 +282,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 9375e54..bcfba0a 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v2-0002-Issue-individual-invalidations-with-wal_level-log.patchapplication/octet-stream; name=v2-0002-Issue-individual-invalidations-with-wal_level-log.patchDownload
From a10bea688773350e4262839d9d64df2ec2b7c429 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:26:33 +0530
Subject: [PATCH v2 02/13] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.

The new invalidations are written to WAL immediately, without any
such caching. Perhaps it would be possible to add similar caching,
e.g. at the command level, or something like that?
---
 src/backend/access/rmgrdesc/xactdesc.c          | 52 +++++++++++++++++++
 src/backend/access/transam/xact.c               |  7 +++
 src/backend/replication/logical/decode.c        | 23 +++++++++
 src/backend/replication/logical/reorderbuffer.c | 56 +++++++++++++++++---
 src/backend/utils/cache/inval.c                 | 69 +++++++++++++++++++++++++
 src/include/access/xact.h                       | 18 ++++++-
 src/include/replication/reorderbuffer.h         | 14 +++++
 7 files changed, 231 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 4c411c5..6cfd6af 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,11 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -397,6 +402,14 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs,
+								xlrec->dbId, xlrec->tsId,
+								xlrec->relcacheInitFileInval);
+	}
 }
 
 const char *
@@ -424,7 +437,46 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval)
+{
+	int			i;
+
+	if (relcacheInitFileInval)
+		appendStringInfo(buf, "; relcache init file inval dbid %u tsid %u",
+						 dbId, tsId);
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else if (msg->id == SHAREDINVALSNAPSHOT_ID)
+			appendStringInfo(buf, " snapshot %u", msg->sn.relId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 708e523..da15556 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6001,6 +6001,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 897b755..9bcefb6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,29 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/* XXX for now we're issuing invalidations one by one */
+				Assert(invals->nmsgs == 1);
+
+				if (!TransactionIdIsValid(xid))
+					break;
+
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->dbId, invals->tsId,
+											 invals->relcacheInitFileInval,
+											 invals->msgs[0]);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 53affeb..b1feff3 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -464,6 +464,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
@@ -1804,17 +1805,23 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation message locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+					txn->is_schema_sent = false;
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -2209,6 +2216,38 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 
 /*
  * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn,
+							 Oid dbId, Oid tsId, bool relcacheInitFileInval,
+							 SharedInvalidationMessage msg)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	/* XXX Should we even write invalidations without valid XID? */
+	if (xid == InvalidTransactionId)
+		return;
+
+	Assert(xid != InvalidTransactionId);
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.dbId = dbId;
+	change->data.inval.tsId = tsId;
+	change->data.inval.relcacheInitFileInval = relcacheInitFileInval;
+	change->data.inval.msg = msg;
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Setup the invalidation of the toplevel transaction.
  *
  * This needs to be done before ReorderBufferCommit is called!
  */
@@ -2656,6 +2695,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -2752,6 +2792,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -3027,6 +3068,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index f09e3a9..0682c55 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -104,6 +104,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +211,9 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -489,6 +493,18 @@ RegisterCatcacheInvalidation(int cacheId,
 {
 	AddCatcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   cacheId, hashValue, dbId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cc.id = (int8) cacheId;
+		msg.cc.dbId = dbId;
+		msg.cc.hashValue = hashValue;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -501,6 +517,18 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 {
 	AddCatalogInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								  dbId, catId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cat.id = SHAREDINVALCATALOG_ID;
+		msg.cat.dbId = dbId;
+		msg.cat.catId = catId;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -511,6 +539,8 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 static void
 RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 {
+	bool		RelcacheInitFileInval = false;
+
 	AddRelcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   dbId, relId);
 
@@ -529,7 +559,22 @@ RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 	 * as well.  Also zap when we are invalidating whole relcache.
 	 */
 	if (relId == InvalidOid || RelationIdIsInInitFile(relId))
+	{
 		transInvalInfo->RelcacheInitFileInval = true;
+		RelcacheInitFileInval = true;
+	}
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.rc.id = SHAREDINVALRELCACHE_ID;
+		msg.rc.dbId = dbId;
+		msg.rc.relId = relId;
+
+		LogLogicalInvalidations(1, &msg, RelcacheInitFileInval);
+	}
 }
 
 /*
@@ -1501,3 +1546,27 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval)
+{
+	xl_xact_invalidations xlrec;
+
+	/* prepare record */
+	memset(&xlrec, 0, sizeof(xlrec));
+	xlrec.dbId = MyDatabaseId;
+	xlrec.tsId = MyDatabaseTableSpace;
+	xlrec.relcacheInitFileInval = relcacheInitFileInval;
+	xlrec.nmsgs = nmsgs;
+
+	/* perform insertion */
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+	XLogRegisterData((char *) msgs,
+					 nmsgs * sizeof(SharedInvalidationMessage));
+	XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5b9740c..82d4942 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,22 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ *
+ * XXX Currently nmsgs=1 but that might change in the future.
+ */
+typedef struct xl_xact_invalidations
+{
+	Oid			dbId;			/* MyDatabaseId */
+	Oid			tsId;			/* MyDatabaseTableSpace */
+	bool		relcacheInitFileInval;	/* invalidate relcache init file */
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0867ee9..6a7187b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,16 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			Oid			dbId;	/* MyDatabaseId */
+			Oid			tsId;	/* MyDatabaseTableSpace */
+			bool		relcacheInitFileInval;	/* invalidate relcache init
+												 * file */
+			SharedInvalidationMessage msg;	/* invalidation message */
+		}			inval;
 	}			data;
 
 	/*
@@ -448,6 +459,9 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  Oid dbId, Oid tsId, bool relcacheInitFileInval,
+								  SharedInvalidationMessage msg);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
1.8.3.1

v2-0004-Cleaning-up-of-flags-in-ReorderBufferTXN-structur.patchapplication/octet-stream; name=v2-0004-Cleaning-up-of-flags-in-ReorderBufferTXN-structur.patchDownload
From b6825f74dfd9809dc06feb16073eab5714a4a488 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 18:08:37 +0200
Subject: [PATCH v2 04/13] Cleaning up of flags in ReorderBufferTXN structure

---
 src/backend/replication/logical/reorderbuffer.c | 36 ++++++++++++-------------
 src/include/replication/reorderbuffer.h         | 33 ++++++++++++++---------
 2 files changed, 38 insertions(+), 31 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b1feff3..3422939 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -732,7 +732,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 			Assert(prev_first_lsn < cur_txn->first_lsn);
 
 		/* known-as-subtxn txns must not be listed */
-		Assert(!cur_txn->is_known_as_subxact);
+		Assert(!rbtxn_is_known_subxact(cur_txn));
 
 		prev_first_lsn = cur_txn->first_lsn;
 	}
@@ -752,7 +752,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 			Assert(prev_base_snap_lsn < cur_txn->base_snapshot_lsn);
 
 		/* known-as-subtxn txns must not be listed */
-		Assert(!cur_txn->is_known_as_subxact);
+		Assert(!rbtxn_is_known_subxact(cur_txn));
 
 		prev_base_snap_lsn = cur_txn->base_snapshot_lsn;
 	}
@@ -775,7 +775,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
 
 	txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
 
-	Assert(!txn->is_known_as_subxact);
+	Assert(!rbtxn_is_known_subxact(txn));
 	Assert(txn->first_lsn != InvalidXLogRecPtr);
 	return txn;
 }
@@ -835,7 +835,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 
 	if (!new_sub)
 	{
-		if (subtxn->is_known_as_subxact)
+		if (rbtxn_is_known_subxact(subtxn))
 		{
 			/* already associated, nothing to do */
 			return;
@@ -851,7 +851,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 		}
 	}
 
-	subtxn->is_known_as_subxact = true;
+	subtxn->txn_flags |= RBTXN_IS_SUBXACT;
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
@@ -1061,7 +1061,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	{
 		ReorderBufferChange *cur_change;
 
-		if (txn->serialized)
+		if (rbtxn_is_serialized(txn))
 		{
 			/* serialize remaining changes */
 			ReorderBufferSerializeTXN(rb, txn);
@@ -1090,7 +1090,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		{
 			ReorderBufferChange *cur_change;
 
-			if (cur_txn->serialized)
+			if (rbtxn_is_serialized(cur_txn))
 			{
 				/* serialize remaining changes */
 				ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1256,7 +1256,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * they originally were happening inside another subtxn, so we won't
 		 * ever recurse more than one level deep here.
 		 */
-		Assert(subtxn->is_known_as_subxact);
+		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
 		ReorderBufferCleanupTXN(rb, subtxn);
@@ -1304,7 +1304,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	/*
 	 * Remove TXN from its containing list.
 	 *
-	 * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+	 * Note: if txn is known as subxact, we are deleting the TXN from its
 	 * parent's list of known subxacts; this leaves the parent's nsubxacts
 	 * count too high, but we don't care.  Otherwise, we are deleting the TXN
 	 * from the LSN-ordered list of toplevel TXNs.
@@ -1319,7 +1319,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	Assert(found);
 
 	/* remove entries spilled to disk */
-	if (txn->serialized)
+	if (rbtxn_is_serialized(txn))
 		ReorderBufferRestoreCleanup(rb, txn);
 
 	/* deallocate */
@@ -1336,7 +1336,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
 		return;
 
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1970,7 +1970,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
 			 * final_lsn to that of their last change; this causes
 			 * ReorderBufferRestoreCleanup to do the right thing.
 			 */
-			if (txn->serialized && txn->final_lsn == 0)
+			if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
 			{
 				ReorderBufferChange *last =
 				dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -2118,7 +2118,7 @@ ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
 	 * operate on its top-level transaction instead.
 	 */
 	txn = ReorderBufferTXNByXid(rb, xid, true, &is_new, lsn, true);
-	if (txn->is_known_as_subxact)
+	if (rbtxn_is_known_subxact(txn))
 		txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
 									NULL, InvalidXLogRecPtr, false);
 	Assert(txn->base_snapshot == NULL);
@@ -2297,7 +2297,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	txn->has_catalog_changes = true;
+	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2314,7 +2314,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
 	if (txn == NULL)
 		return false;
 
-	return txn->has_catalog_changes;
+	return rbtxn_has_catalog_changes(txn);
 }
 
 /*
@@ -2334,7 +2334,7 @@ ReorderBufferXidHasBaseSnapshot(ReorderBuffer *rb, TransactionId xid)
 		return false;
 
 	/* a known subtxn? operate on top-level txn instead */
-	if (txn->is_known_as_subxact)
+	if (rbtxn_is_known_subxact(txn))
 		txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
 									NULL, InvalidXLogRecPtr, false);
 
@@ -2522,12 +2522,12 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	rb->spillBytes += size;
 
 	/* Don't consider already serialized transaction. */
-	rb->spillTxns += txn->serialized ? 0 : 1;
+	rb->spillTxns += rbtxn_is_serialized(txn) ? 0 : 1;
 
 	Assert(spilled == txn->nentries_mem);
 	Assert(dlist_is_empty(&txn->changes));
 	txn->nentries_mem = 0;
-	txn->serialized = true;
+	txn->txn_flags |= RBTXN_IS_SERIALIZED;
 
 	if (fd != -1)
 		CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5b4be2b..19c7bac 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -169,18 +169,34 @@ typedef struct ReorderBufferChange
 	dlist_node	node;
 } ReorderBufferChange;
 
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT          0x0002
+#define RBTXN_IS_SERIALIZED       0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn)    (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk?  It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn)       (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
 typedef struct ReorderBufferTXN
 {
+	int     txn_flags;
+
 	/*
 	 * The transactions transaction id, can be a toplevel or sub xid.
 	 */
 	TransactionId xid;
 
-	/* did the TX have catalog changes */
-	bool		has_catalog_changes;
-
 	/* Do we know this is a subxact?  Xid of top-level txn if so */
-	bool		is_known_as_subxact;
 	TransactionId toplevel_xid;
 
 	/*
@@ -249,15 +265,6 @@ typedef struct ReorderBufferTXN
 	uint64		nentries_mem;
 
 	/*
-	 * Has this transaction been spilled to disk?  It's not always possible to
-	 * deduce that fact by comparing nentries with nentries_mem, because e.g.
-	 * subtransactions of a large transaction might get serialized together
-	 * with the parent - if they're restored to memory they'd have
-	 * nentries_mem == nentries.
-	 */
-	bool		serialized;
-
-	/*
 	 * List of ReorderBufferChange structs, including new Snapshots and new
 	 * CommandIds
 	 */
-- 
1.8.3.1

v2-0005-Gracefully-handle-concurrent-aborts-of-uncommitte.patchapplication/octet-stream; name=v2-0005-Gracefully-handle-concurrent-aborts-of-uncommitte.patchDownload
From 2b3a261b2c80aa84073c39d7be76a42d6e491195 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:32:34 +0530
Subject: [PATCH v2 05/13] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml               |  5 ++-
 src/backend/access/heap/heapam.c                | 51 +++++++++++++++++++++++++
 src/backend/access/index/genam.c                | 34 +++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c |  9 +++--
 src/backend/utils/time/snapmgr.c                | 25 +++++++++++-
 src/include/utils/snapmgr.h                     |  4 +-
 6 files changed, 120 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index fc4ad65..da6a6f3 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,7 +432,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
      includes writing to tables, performing DDL changes, and
      calling <literal>txid_current()</literal>.
     </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0128bb3..2a60a73 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1303,6 +1303,17 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(scan->rs_base.rs_rd) ||
+		  RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_getnext call")));
+
 	/* Note: no locking manipulations needed */
 
 	HEAPDEBUG_1;				/* heap_getnext( info ) */
@@ -1422,6 +1433,16 @@ heap_fetch(Relation relation,
 	bool		valid;
 
 	/*
+	 * We don't expect direct calls to heap_fetch with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_fetch call")));
+
+	/*
 	 * Fetch and pin the appropriate page of the relation.
 	 */
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
@@ -1535,6 +1556,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		valid;
 	bool		skip;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search_buffer with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_hot_search_buffer call")));
+
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
 		*all_dead = first_call;
@@ -1683,6 +1714,16 @@ heap_get_latest_tid(TableScanDesc sscan,
 	Assert(ItemPointerIsValid(tid));
 
 	/*
+	 * We don't expect direct calls to heap_get_latest_tid with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_get_latest_tid call")));
+
+	/*
 	 * Loop to chase down t_ctid links.  At top of loop, ctid is the tuple we
 	 * need to examine, and *tid is the TID we will return if ctid turns out
 	 * to be bogus.
@@ -5481,6 +5522,16 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	ItemId		lp = NULL;
 	HeapTupleHeader htup;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_hot_search call")));
+
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d..201acfb 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -477,6 +478,17 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
@@ -513,6 +525,17 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return result;
 }
 
@@ -639,6 +662,17 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3422939..bda4a1c 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -683,7 +683,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 			txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 		/* setup snapshot to allow catalog access */
-		SetupHistoricSnapshot(snapshot_now, NULL);
+		SetupHistoricSnapshot(snapshot_now, NULL, xid);
 		PG_TRY();
 		{
 			rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1533,7 +1533,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1784,7 +1784,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1804,7 +1804,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					}
 
 					break;
@@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_CATCH();
 	{
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
+
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
 
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 47b0517..9fa1e43 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -154,6 +154,13 @@ static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
 /*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
+/*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2029,10 +2036,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
  *
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive.  This is to re-check XID status while accessing catalog.
+ *
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+					  TransactionId snapshot_xid)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -2041,8 +2052,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 
 	/* setup (cmin, cmax) lookup hash */
 	tuplecid_data = tuplecids;
-}
 
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
+	 */
+	if (TransactionIdIsValid(snapshot_xid) &&
+		!TransactionIdDidCommit(snapshot_xid))
+		CheckXidAlive = snapshot_xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
 /*
  * Make catalog snapshots behave normally again.
@@ -2052,6 +2072,7 @@ TeardownHistoricSnapshot(bool is_error)
 {
 	HistoricSnapshot = NULL;
 	tuplecid_data = NULL;
+	CheckXidAlive = InvalidTransactionId;
 }
 
 bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 67b07df..9a8f9ce 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,8 +145,10 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+								  TransactionId snapshot_xid);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-- 
1.8.3.1

v2-0003-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=v2-0003-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From 64fa40c143ef89e61590a45f93f8e3dc675af4f2 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v2 03/13] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 +++++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  57 +++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 6c33c4b..9c77791 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -542,3 +571,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8db9686..fc4ad65 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_message_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction size and network bandwidth, the transfer time
+    may significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similarly to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_work_mem</varname> setting. At
+    that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 7e06615..b88b585 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +204,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -863,6 +911,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	/* FIXME ctx->write_location = apply_lsn; */
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	/* FIXME ctx->write_location = apply_lsn; */
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 6879a2e..1e934d2 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index d4ce54f..a305462 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6a7187b..5b4be2b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -345,6 +345,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -384,6 +430,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v2-0006-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v2-0006-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From bd34ab5db8501a4a8ea06bcd53c4386fe660a7ae Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 11:42:31 +0530
Subject: [PATCH v2 06/13] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c     |   38 +-
 src/backend/replication/logical/reorderbuffer.c | 1075 ++++++++++++++++++++++-
 src/include/replication/reorderbuffer.h         |   32 +
 3 files changed, 1112 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 3e36467..cf10dd0 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bda4a1c..0ab3191 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -149,6 +149,28 @@ typedef struct ReorderBufferIterTXNState
 	ReorderBufferIterTXNEntry entries[FLEXIBLE_ARRAY_MEMBER];
 } ReorderBufferIterTXNState;
 
+/*
+ * k-way in-order change iteration support structures
+ *
+ * This is a simplified version for streaming, which does not require
+ * serialization to files and only reads changes that are currently in
+ * memory.
+ */
+typedef struct ReorderBufferStreamIterTXNEntry
+{
+	XLogRecPtr	lsn;
+	ReorderBufferChange *change;
+	ReorderBufferTXN *txn;
+}			ReorderBufferStreamIterTXNEntry;
+
+typedef struct ReorderBufferStreamIterTXNState
+{
+	binaryheap *heap;
+	Size		nr_txns;
+	dlist_head	old_change;
+	ReorderBufferStreamIterTXNEntry entries[FLEXIBLE_ARRAY_MEMBER];
+}			ReorderBufferStreamIterTXNState;
+
 /* toast datastructures */
 typedef struct ReorderBufferToastEnt
 {
@@ -213,6 +235,20 @@ static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
 static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
 
+
+/* iterator for streaming (only get data from memory) */
+static ReorderBufferStreamIterTXNState * ReorderBufferStreamIterTXNInit(
+																		ReorderBuffer *rb,
+																		ReorderBufferTXN *txn);
+
+static ReorderBufferChange *ReorderBufferStreamIterTXNNext(
+							   ReorderBuffer *rb,
+							   ReorderBufferStreamIterTXNState * state);
+
+static void ReorderBufferStreamIterTXNFinish(
+								 ReorderBuffer *rb,
+								 ReorderBufferStreamIterTXNState * state);
+
 /*
  * ---------------------------------------
  * Disk serialization support functions
@@ -227,6 +263,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -235,6 +272,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -362,6 +408,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -759,6 +808,33 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+static void
+AssertChangeLsnOrder(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -855,6 +931,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -978,7 +1057,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1006,6 +1085,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	cur_txn_i;
 	int32		off;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(rb, txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1020,6 +1102,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(rb, cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1235,6 +1320,210 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 }
 
 /*
+ * Binary heap comparison function (streaming iterator).
+ */
+static int
+ReorderBufferStreamIterCompare(Datum a, Datum b, void *arg)
+{
+	ReorderBufferStreamIterTXNState *state = (ReorderBufferStreamIterTXNState *) arg;
+	XLogRecPtr	pos_a = state->entries[DatumGetInt32(a)].lsn;
+	XLogRecPtr	pos_b = state->entries[DatumGetInt32(b)].lsn;
+
+	if (pos_a < pos_b)
+		return 1;
+	else if (pos_a == pos_b)
+		return 0;
+	return -1;
+}
+
+/*
+ * Allocate & initialize an iterator which iterates in lsn order over a
+ * transaction and all its subtransactions. This version is meant for
+ * streaming of incomplete transactions.
+ */
+static ReorderBufferStreamIterTXNState *
+ReorderBufferStreamIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Size		nr_txns = 0;
+	ReorderBufferStreamIterTXNState *state;
+	dlist_iter	cur_txn_i;
+	int32		off;
+
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(rb, txn);
+
+	/*
+	 * Calculate the size of our heap: one element for every transaction that
+	 * contains changes.  (Besides the transactions already in the reorder
+	 * buffer, we count the one we were directly passed.)
+	 */
+	if (txn->nentries > 0)
+		nr_txns++;
+
+	dlist_foreach(cur_txn_i, &txn->subtxns)
+	{
+		ReorderBufferTXN *cur_txn;
+
+		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
+
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(rb, cur_txn);
+
+		if (cur_txn->nentries > 0)
+			nr_txns++;
+	}
+
+	/*
+	 * TODO: Consider adding fastpath for the rather common nr_txns=1 case, no
+	 * need to allocate/build a heap then.
+	 */
+
+	/* allocate iteration state */
+	state = (ReorderBufferStreamIterTXNState *)
+		MemoryContextAllocZero(rb->context,
+							   sizeof(ReorderBufferStreamIterTXNState) +
+							   sizeof(ReorderBufferStreamIterTXNEntry) * nr_txns);
+
+	state->nr_txns = nr_txns;
+	dlist_init(&state->old_change);
+
+	/* allocate heap */
+	state->heap = binaryheap_allocate(state->nr_txns,
+									  ReorderBufferStreamIterCompare,
+									  state);
+
+	/*
+	 * Now insert items into the binary heap, in an unordered fashion.  (We
+	 * will run a heap assembly step at the end; this is more efficient.)
+	 */
+
+	off = 0;
+
+	/* add toplevel transaction if it contains changes */
+	if (txn->nentries > 0)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_head_element(ReorderBufferChange, node,
+										&txn->changes);
+
+		state->entries[off].lsn = cur_change->lsn;
+		state->entries[off].change = cur_change;
+		state->entries[off].txn = txn;
+
+		binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
+	}
+
+	/* add subtransactions if they contain changes */
+	dlist_foreach(cur_txn_i, &txn->subtxns)
+	{
+		ReorderBufferTXN *cur_txn;
+
+		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
+
+		if (cur_txn->nentries > 0)
+		{
+			ReorderBufferChange *cur_change;
+
+			cur_change = dlist_head_element(ReorderBufferChange, node,
+											&cur_txn->changes);
+
+			state->entries[off].lsn = cur_change->lsn;
+			state->entries[off].change = cur_change;
+			state->entries[off].txn = cur_txn;
+
+			binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
+		}
+	}
+
+	Assert(off == nr_txns);
+
+	/* assemble a valid binary heap */
+	binaryheap_build(state->heap);
+
+	return state;
+}
+
+/*
+ * Return the next change when iterating over a transaction and its
+ * subtransactions.
+ *
+ * Returns NULL when no further changes exist.
+ */
+static ReorderBufferChange *
+ReorderBufferStreamIterTXNNext(ReorderBuffer *rb, ReorderBufferStreamIterTXNState * state)
+{
+	ReorderBufferChange *change;
+	ReorderBufferStreamIterTXNEntry *entry;
+	int32		off;
+
+	/* nothing there anymore */
+	if (state->heap->bh_size == 0)
+		return NULL;
+
+	off = DatumGetInt32(binaryheap_first(state->heap));
+	entry = &state->entries[off];
+
+	/* free memory we might have "leaked" in the previous *Next call */
+	if (!dlist_is_empty(&state->old_change))
+	{
+		change = dlist_container(ReorderBufferChange, node,
+								 dlist_pop_head_node(&state->old_change));
+		ReorderBufferReturnChange(rb, change);
+		Assert(dlist_is_empty(&state->old_change));
+	}
+
+	change = entry->change;
+
+	/*
+	 * update heap with information about which transaction has the next
+	 * relevant change in LSN order
+	 */
+
+	/* there are in-memory changes */
+	if (dlist_has_next(&entry->txn->changes, &entry->change->node))
+	{
+		dlist_node *next = dlist_next_node(&entry->txn->changes, &change->node);
+		ReorderBufferChange *next_change =
+		dlist_container(ReorderBufferChange, node, next);
+
+		/* txn stays the same */
+		state->entries[off].lsn = next_change->lsn;
+		state->entries[off].change = next_change;
+
+		binaryheap_replace_first(state->heap, Int32GetDatum(off));
+		return change;
+	}
+
+	/* ok, no changes there anymore, remove */
+	binaryheap_remove_first(state->heap);
+
+	return change;
+}
+
+/*
+ * Deallocate the iterator
+ */
+static void
+ReorderBufferStreamIterTXNFinish(ReorderBuffer *rb,
+								 ReorderBufferStreamIterTXNState * state)
+{
+	/* free memory we might have "leaked" in the last *Next call */
+	if (!dlist_is_empty(&state->old_change))
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node,
+								 dlist_pop_head_node(&state->old_change));
+		ReorderBufferReturnChange(rb, change);
+		Assert(dlist_is_empty(&state->old_change));
+	}
+
+	binaryheap_free(state->heap);
+	pfree(state);
+}
+
+/*
  * Cleanup the contents of a transaction, usually after the transaction
  * committed or aborted.
  */
@@ -1327,33 +1616,104 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
  */
 static void
-ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	dlist_iter	iter;
-	HASHCTL		hash_ctl;
+	dlist_mutable_iter iter;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
 
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
-	hash_ctl.entrysize = sizeof(ReorderBufferTupleCidEnt);
-	hash_ctl.hcxt = rb->context;
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
 
-	/*
-	 * create the hash with the exact number of to-be-stored tuplecids from
-	 * the start
-	 */
-	txn->tuplecid_hash =
-		hash_create("ReorderBufferTupleCid", txn->ntuplecids, &hash_ctl,
-					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
 
-	dlist_foreach(iter, &txn->tuplecids)
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
+ * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
+ */
+static void
+ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_iter	iter;
+	HASHCTL		hash_ctl;
+
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+
+	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
+	hash_ctl.entrysize = sizeof(ReorderBufferTupleCidEnt);
+	hash_ctl.hcxt = rb->context;
+
+	/*
+	 * create the hash with the exact number of to-be-stored tuplecids from
+	 * the start
+	 */
+	txn->tuplecid_hash =
+		hash_create("ReorderBufferTupleCid", txn->ntuplecids, &hash_ctl,
+					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	dlist_foreach(iter, &txn->tuplecids)
 	{
 		ReorderBufferTupleCidKey key;
 		ReorderBufferTupleCidEnt *ent;
@@ -1403,6 +1763,16 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 }
 
+static void
+ReorderBufferDestroyTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+}
+
 /*
  * Copy a provided snapshot so we can modify it privately. This is needed so
  * that catalog modifying transactions can look into intermediate catalog
@@ -1476,6 +1846,19 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 		SnapBuildSnapDecRefcount(snap);
 }
 
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
+
+	ReorderBufferStreamTXN(rb, txn);
+
+	rb->stream_commit(rb, txn, txn->final_lsn);
+
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
 /*
  * Perform the replay of a transaction and its non-aborted subtransactions.
  *
@@ -1515,6 +1898,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
 	 * If this transaction has no snapshot, it didn't make any changes to the
 	 * database, so there's nothing to decode.  Note that
 	 * ReorderBufferCommitChild will have transferred any snapshots from
@@ -1549,6 +1948,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 	PG_TRY();
 	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 		ReorderBufferChange *change;
 		ReorderBufferChange *specinsert = NULL;
 
@@ -1565,6 +1965,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			if (prev_lsn != InvalidXLogRecPtr)
+				Assert(prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1930,6 +2340,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2014,6 +2431,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2149,8 +2573,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2158,6 +2591,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2169,19 +2603,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2210,6 +2653,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2285,6 +2729,9 @@ ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	for (i = 0; i < txn->ninvalidations; i++)
 		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+
+	/* Invalidate current schema as well */
+	txn->is_schema_sent = false;
 }
 
 /*
@@ -2299,6 +2746,23 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * We read catalog changes from WAL, which are not yet sent, so
+	 * invalidate current schema in order output plugin can resend
+	 * schema again.
+	 */
+	txn->is_schema_sent = false;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+	{
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		txn->toptxn->is_schema_sent = false;
+	}
 }
 
 /*
@@ -2403,6 +2867,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
  *
@@ -2422,15 +2918,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2723,6 +3250,498 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id;
+	bool		using_subtxn;
+	Size		streamed = 0;
+	ReorderBufferStreamIterTXNState *volatile iterstate = NULL;
+
+	/*
+	 * If this is a subxact, we need to stream the top-level transaction
+	 * instead.
+	 */
+	if (txn->toptxn)
+	{
+		ReorderBufferStreamTXN(rb, txn->toptxn);
+		return;
+	}
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+
+			if (subtxn->base_snapshot != NULL &&
+				(txn->base_snapshot == NULL ||
+				 txn->base_snapshot_lsn > subtxn->base_snapshot_lsn))
+			{
+				txn->base_snapshot = subtxn->base_snapshot;
+				txn->base_snapshot_lsn = subtxn->base_snapshot_lsn;
+				subtxn->base_snapshot = NULL;
+				subtxn->base_snapshot_lsn = InvalidXLogRecPtr;
+			}
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
+		 * information about subtransactions, which could arrive after streaming start.
+		 */
+		if (!txn->is_schema_sent)
+			snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+												 txn, command_id);
+	}
+
+	/*
+	 * build data to be able to lookup the CommandIds of catalog tuples
+	 */
+	ReorderBufferBuildTupleCidHash(rb, txn);
+
+	/* setup the initial snapshot */
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, txn->xid);
+
+	/*
+	 * Decoding needs access to syscaches et al., which in turn use
+	 * heavyweight locks and such. Thus we need to have enough state around to
+	 * keep track of those.  The easiest way is to simply use a transaction
+	 * internally.  That also allows us to easily enforce that nothing writes
+	 * to the database by checking for xid assignments.
+	 *
+	 * When we're called via the SQL SRF there's already a transaction
+	 * started, so start an explicit subtransaction there.
+	 */
+	using_subtxn = IsTransactionOrTransactionBlock();
+
+	PG_TRY();
+	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+		ReorderBufferChange *change;
+		ReorderBufferChange *specinsert = NULL;
+
+		if (using_subtxn)
+			BeginInternalSubTransaction("stream");
+		else
+			StartTransactionCommand();
+
+		/* start streaming this chunk of transaction */
+		rb->stream_start(rb, txn);
+
+		iterstate = ReorderBufferStreamIterTXNInit(rb, txn);
+		while ((change = ReorderBufferStreamIterTXNNext(rb, iterstate)) != NULL)
+		{
+			Relation	relation = NULL;
+			Oid			reloid;
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			if (prev_lsn != InvalidXLogRecPtr)
+				Assert(prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* we're going to stream this change */
+			streamed++;
+
+			switch (change->action)
+			{
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+
+					/*
+					 * Confirmation for speculative insertion arrived. Simply
+					 * use as a normal record. It'll be cleaned up at the end
+					 * of INSERT processing.
+					 */
+					Assert(specinsert->data.tp.oldtuple == NULL);
+					change = specinsert;
+					change->action = REORDER_BUFFER_CHANGE_INSERT;
+
+					/* intentionally fall through */
+				case REORDER_BUFFER_CHANGE_INSERT:
+				case REORDER_BUFFER_CHANGE_UPDATE:
+				case REORDER_BUFFER_CHANGE_DELETE:
+					Assert(snapshot_now);
+
+					reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
+												change->data.tp.relnode.relNode);
+
+					/*
+					 * Catalog tuple without data, emitted while catalog was
+					 * in the process of being rewritten.
+					 */
+					if (reloid == InvalidOid &&
+						change->data.tp.newtuple == NULL &&
+						change->data.tp.oldtuple == NULL)
+						goto change_done;
+					else if (reloid == InvalidOid)
+						elog(ERROR, "could not map filenode \"%s\" to relation OID",
+							 relpathperm(change->data.tp.relnode,
+										 MAIN_FORKNUM));
+
+					relation = RelationIdGetRelation(reloid);
+
+					if (relation == NULL)
+						elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
+							 reloid,
+							 relpathperm(change->data.tp.relnode,
+										 MAIN_FORKNUM));
+
+					if (!RelationIsLogicallyLogged(relation))
+						goto change_done;
+
+					/*
+					 * For now ignore sequence changes entirely. Most of the
+					 * time they don't log changes using records we
+					 * understand, so it doesn't make sense to handle the few
+					 * cases we do.
+					 */
+					if (relation->rd_rel->relkind == RELKIND_SEQUENCE)
+						goto change_done;
+
+					/* user-triggered change */
+					if (!IsToastRelation(relation))
+					{
+						ReorderBufferToastReplace(rb, txn, relation, change);
+						rb->stream_change(rb, txn, relation, change);
+
+						/*
+						 * Only clear reassembled toast chunks if we're sure
+						 * they're not required anymore. The creator of the
+						 * tuple tells us.
+						 */
+						if (change->data.tp.clear_toast_afterwards)
+							ReorderBufferToastReset(rb, txn);
+					}
+					/* we're not interested in toast deletions */
+					else if (change->action == REORDER_BUFFER_CHANGE_INSERT)
+					{
+						/*
+						 * Need to reassemble the full toasted Datum in
+						 * memory, to ensure the chunks don't get reused till
+						 * we're done remove it from the list of this
+						 * transaction's changes. Otherwise it will get
+						 * freed/reused while restoring spooled data from
+						 * disk.
+						 */
+						dlist_delete(&change->node);
+						ReorderBufferToastAppendChunk(rb, txn, relation,
+													  change);
+					}
+
+			change_done:
+
+					/*
+					 * Either speculative insertion was confirmed, or it was
+					 * unsuccessful and the record isn't needed anymore.
+					 */
+					if (specinsert != NULL)
+					{
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+
+					if (relation != NULL)
+					{
+						RelationClose(relation);
+						relation = NULL;
+					}
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
+
+					/*
+					 * Speculative insertions are dealt with by delaying the
+					 * processing of the insert until the confirmation record
+					 * arrives. For that we simply unlink the record from the
+					 * chain, so it does not get freed/reused while restoring
+					 * spooled data from disk.
+					 *
+					 * This is safe in the face of concurrent catalog changes
+					 * because the relevant relation can't be changed between
+					 * speculative insertion and confirmation due to
+					 * CheckTableNotInUse() and locking.
+					 */
+
+					/* clear out a pending (and thus failed) speculation */
+					if (specinsert != NULL)
+					{
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+
+					/* and memorize the pending insertion */
+					dlist_delete(&change->node);
+					specinsert = change;
+					break;
+
+				case REORDER_BUFFER_CHANGE_TRUNCATE:
+					{
+						int			i;
+						int			nrelids = change->data.truncate.nrelids;
+						int			nrelations = 0;
+						Relation   *relations;
+
+						relations = palloc0(nrelids * sizeof(Relation));
+						for (i = 0; i < nrelids; i++)
+						{
+							Oid			relid = change->data.truncate.relids[i];
+							Relation	relation;
+
+							relation = RelationIdGetRelation(relid);
+
+							if (relation == NULL)
+								elog(ERROR, "could not open relation with OID %u", relid);
+
+							if (!RelationIsLogicallyLogged(relation))
+								continue;
+
+							relations[nrelations++] = relation;
+						}
+
+						rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+						for (i = 0; i < nrelations; i++)
+							RelationClose(relations[i]);
+
+						break;
+					}
+
+				case REORDER_BUFFER_CHANGE_MESSAGE:
+
+					rb->stream_message(rb, txn, change->lsn, true,
+									   change->data.msg.prefix,
+									   change->data.msg.message_size,
+									   change->data.msg.message);
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
+					/* get rid of the old */
+					TeardownHistoricSnapshot(false);
+
+					if (snapshot_now->copied)
+					{
+						ReorderBufferFreeSnap(rb, snapshot_now);
+						snapshot_now =
+							ReorderBufferCopySnap(rb, change->data.snapshot,
+												  txn, command_id);
+					}
+
+					/*
+					 * Restored from disk, need to be careful not to double
+					 * free. We could introduce refcounting for that, but for
+					 * now this seems infrequent enough not to care.
+					 */
+					else if (change->data.snapshot->copied)
+					{
+						snapshot_now =
+							ReorderBufferCopySnap(rb, change->data.snapshot,
+												  txn, command_id);
+					}
+					else
+					{
+						snapshot_now = change->data.snapshot;
+					}
+
+					/*
+					 * TOCHECK: Snapshot changed, then invalidate current schema to reflect
+					 * possible catalog changes.
+					 */
+					txn->is_schema_sent = false;
+
+					/* and continue with the new one */
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+										  txn->xid);
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
+					Assert(change->data.command_id != InvalidCommandId);
+
+					if (command_id < change->data.command_id)
+					{
+						command_id = change->data.command_id;
+
+						if (!snapshot_now->copied)
+						{
+							/* we don't use the global one anymore */
+							snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+																 txn, command_id);
+						}
+
+						snapshot_now->curcid = command_id;
+
+						TeardownHistoricSnapshot(false);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+											  txn->xid);
+					}
+
+					break;
+
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation message locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+					txn->is_schema_sent = false;
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+					elog(ERROR, "tuplecid value in changequeue");
+					break;
+			}
+		}
+
+		/*
+		 * There's a speculative insertion remaining, just clean in up, it
+		 * can't have been successful, otherwise we'd gotten a confirmation
+		 * record.
+		 */
+		if (specinsert)
+		{
+			ReorderBufferReturnChange(rb, specinsert);
+			specinsert = NULL;
+		}
+
+		/* clean up the iterator */
+		ReorderBufferStreamIterTXNFinish(rb, iterstate);
+		iterstate = NULL;
+
+		/* call stream_stop callback */
+		rb->stream_stop(rb, txn);
+
+		/* this is just a sanity check against bad output plugin behaviour */
+		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
+			elog(ERROR, "output plugin used XID %u",
+				 GetCurrentTransactionId());
+
+		/* remember the command ID and snapshot for the streaming run */
+		txn->command_id = command_id;
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+
+		/* cleanup */
+		TeardownHistoricSnapshot(false);
+
+		/*
+		 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+		 * any memory. We could also keep the hash table and update it with
+		 * new ctid values, but this seems simpler and good enough for now.
+		 */
+		ReorderBufferDestroyTupleCidHash(rb, txn);
+
+		/*
+		 * Aborting the current (sub-)transaction as a whole has the right
+		 * semantics. We want all locks acquired in here to be released, not
+		 * reassigned to the parent and we do not want any database access
+		 * have persistent effects.
+		 */
+		AbortCurrentTransaction();
+
+		/* make sure there's no cache pollution */
+		ReorderBufferExecuteInvalidations(rb, txn);
+
+		if (using_subtxn)
+			RollbackAndReleaseCurrentSubTransaction();
+	}
+	PG_CATCH();
+	{
+		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
+		if (iterstate)
+			ReorderBufferStreamIterTXNFinish(rb, iterstate);
+
+		TeardownHistoricSnapshot(true);
+
+		/*
+		 * Force cache invalidation to happen outside of a valid transaction
+		 * to prevent catalog access as we just caught an error.
+		 */
+		AbortCurrentTransaction();
+
+		/* make sure there's no cache pollution */
+		ReorderBufferExecuteInvalidations(rb, txn);
+
+		if (using_subtxn)
+			RollbackAndReleaseCurrentSubTransaction();
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	/*
+	 * Discard the changes that we just streamed, and mark the transactions
+	 * as streamed (if they contained changes).
+	 */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 19c7bac..7d08e2f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* does the txn have catalog changes */
 #define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -187,6 +188,20 @@ typedef struct ReorderBufferChange
  */
 #define rbtxn_is_serialized(txn)       (txn->txn_flags & RBTXN_IS_SERIALIZED)
 
+/*
+ * Has this transaction been streamed to downstream? Similarly to spilling
+ * to disk, it's not trivial to deduce this from nentries and nentries_mem,
+ * for various reasons. For example, all changes may be in subtransactions
+ * in which case we'd have nentries==0 for the toplevel one, and it'd say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.
+ *
+ * Note: We never stream and serialize a transaction at the same time (e
+ * only do spill to disk when streaming is not supported by the plugin),
+ * so only one of those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn)         (txn->txn_flags & RBTXN_IS_STREAMED)
+
 typedef struct ReorderBufferTXN
 {
 	int     txn_flags;
@@ -222,6 +237,16 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	final_lsn;
 
 	/*
+	 * Do we need to send schema for this transaction in output plugin?
+	 */
+	bool		is_schema_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
+	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
 	XLogRecPtr	end_lsn;
@@ -252,6 +277,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
-- 
1.8.3.1

v2-0007-Support-logical_decoding_work_mem-set-from-create.patchapplication/octet-stream; name=v2-0007-Support-logical_decoding_work_mem-set-from-create.patchDownload
From 8f7065a4c1347fcf27b89650ab4a6cc6e9d6d012 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 11:51:04 +0530
Subject: [PATCH v2 07/13] Support logical_decoding_work_mem set from create
 subscription command

---
 doc/src/sgml/config.sgml                           | 21 +++++++++++
 doc/src/sgml/ref/create_subscription.sgml          | 12 ++++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/commands/subscriptioncmds.c            | 44 +++++++++++++++++++---
 .../libpqwalreceiver/libpqwalreceiver.c            |  3 ++
 src/backend/replication/logical/worker.c           |  1 +
 src/backend/replication/pgoutput/pgoutput.c        | 30 ++++++++++++++-
 src/include/catalog/pg_subscription.h              |  3 ++
 src/include/replication/walreceiver.h              |  1 +
 9 files changed, 108 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4ec13f3..3c32686 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1753,6 +1753,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-logical-decoding-work-mem" xreflabel="logical_decoding_work_mem">
+      <term><varname>logical_decoding_work_mem</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>logical_decoding_work_mem</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to be used by logical decoding,
+        before some of the decoded changes are either written to local disk.
+        This limits the amount of memory used by logical streaming replication
+        connections. It defaults to 64 megabytes (<literal>64MB</literal>).
+        Since each replication connection only uses a single buffer of this size,
+        and an installation normally doesn't have many such connections
+        concurrently (as limited by <varname>max_wal_senders</varname>), it's
+        safe to set this value significantly higher than <varname>work_mem</varname>,
+        reducing the amount of decoded changes written to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c24..91790b0 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>work_mem</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          Limits the amount of memory used to decode changes on the
+          publisher.  If not specified, the publisher will use the default
+          specified by <varname>logical_decoding_work_mem</varname>. When
+          needed, additional data are spilled to disk.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 68d88ff..2a27648 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->workmem = subform->subworkmem;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 5408edc..fbb4473 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -58,7 +58,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, int *logical_wm,
+						   bool *logical_wm_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -89,6 +90,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (logical_wm)
+		*logical_wm_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -174,6 +177,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0 && logical_wm)
+		{
+			if (*logical_wm_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*logical_wm_given = true;
+			*logical_wm = defGetInt32(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -317,6 +330,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	int			logical_wm;
+	bool		logical_wm_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -333,7 +348,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &logical_wm, &logical_wm_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -411,6 +426,12 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (logical_wm_given)
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(logical_wm);
+	else
+		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +690,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				int			logical_wm;
+				bool		logical_wm_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &logical_wm, &logical_wm_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +721,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (logical_wm_given)
+				{
+					values[Anum_pg_subscription_subworkmem - 1] =
+						Int32GetDatum(logical_wm);
+					replaces[Anum_pg_subscription_subworkmem - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +739,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +778,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +815,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 545d2fc..0ab6855 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -406,6 +406,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		appendStringInfo(&cmd, ", work_mem '%d'",
+						 options->proto.logical.work_mem);
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ced0d19..ebb976c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1742,6 +1742,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.work_mem = MySubscription->workmem;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 3483c1b..cf6e03b 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -18,6 +18,7 @@
 #include "replication/logicalproto.h"
 #include "replication/origin.h"
 #include "replication/pgoutput.h"
+#include "utils/guc.h"
 #include "utils/int8.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
@@ -87,11 +88,12 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, int *logical_decoding_work_mem)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		work_mem_given = false;
 
 	foreach(lc, options)
 	{
@@ -137,6 +139,29 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0)
+		{
+			int64	parsed;
+
+			if (work_mem_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			work_mem_given = true;
+
+			if (!scanint8(strVal(defel->arg), true, &parsed))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid work_mem")));
+
+			if (parsed > PG_INT32_MAX || parsed < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("work_mem \"%s\" out of range",
+								strVal(defel->arg))));
+
+			*logical_decoding_work_mem = (int)parsed;
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -171,7 +196,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&logical_decoding_work_mem);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3cb13d8..10ea113 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	int32		subworkmem;		/* Memory to use to decode changes. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	int			workmem;		/* Memory to decode changes. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 41714ea..1db706a 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -162,6 +162,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			int			work_mem;	/* Memory limit to use for decoding */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
-- 
1.8.3.1

v2-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patchapplication/octet-stream; name=v2-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patchDownload
From e468e357589eb43acb2b1117f9b3ada7965c86e1 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:14:45 +0200
Subject: [PATCH v2 11/13] BUGFIX: set final_lsn for subxacts before cleanup

---
 src/backend/replication/logical/reorderbuffer.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 83eb4df..410da36 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1544,6 +1544,10 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
+		/* make sure subtxn has final_lsn */
+		if (subtxn->final_lsn == InvalidXLogRecPtr)
+			subtxn->final_lsn = txn->final_lsn;
+
 		/*
 		 * Subtransactions are always associated to the toplevel TXN, even if
 		 * they originally were happening inside another subtxn, so we won't
-- 
1.8.3.1

v2-0012-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v2-0012-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From 24b2c427831ba1e268f834ea368f796b38a81ad1 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v2 12/13] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v2-0013-Extend-handling-of-concurrent-aborts-for-streamin.patchapplication/octet-stream; name=v2-0013-Extend-handling-of-concurrent-aborts-for-streamin.patchDownload
From a0b6c483c0180771e9fd0a21f13aa3ce924fb0db Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 22 Nov 2019 12:43:38 +0530
Subject: [PATCH v2 13/13] Extend handling of concurrent aborts for streaming
 transaction

---
 src/backend/replication/logical/reorderbuffer.c | 38 +++++++++++++++++++++++--
 src/include/replication/reorderbuffer.h         |  5 ++++
 2 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 410da36..710a22f 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2350,9 +2350,9 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 
 	/*
 	 * When the (sub)transaction was streamed, notify the remote node
-	 * about the abort.
+	 * about the abort only if we have sent any data for this transaction.
 	 */
-	if (rbtxn_is_streamed(txn))
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
 		rb->stream_abort(rb, txn, lsn);
 
 	/* cosmetic... */
@@ -3281,6 +3281,7 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	volatile CommandId command_id;
 	bool		using_subtxn;
 	Size		streamed = 0;
+	MemoryContext ccxt = CurrentMemoryContext;
 	ReorderBufferStreamIterTXNState *volatile iterstate = NULL;
 
 	/*
@@ -3411,6 +3412,13 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			/* we're going to stream this change */
 			streamed++;
 
+			/*
+			 * Set the CheckXidAlive to the current (sub)xid for which this
+			 * change belongs to so that we can detect the abort while we are
+			 * decoding.
+			 */
+			CheckXidAlive = change->txn->xid;
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -3472,6 +3480,10 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 						ReorderBufferToastReplace(rb, txn, relation, change);
 						rb->stream_change(rb, txn, relation, change);
 
+						/* Remember that we have sent some data for this txn.*/
+						if (!change->txn->any_data_sent)
+							change->txn->any_data_sent = true;
+
 						/*
 						 * Only clear reassembled toast chunks if we're sure
 						 * they're not required anymore. The creator of the
@@ -3654,6 +3666,8 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 					 * about the message itself?
 					 */
 					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+
+					/* Invalidate current schema as well */
 					txn->is_schema_sent = false;
 					break;
 
@@ -3717,6 +3731,9 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferStreamIterTXNFinish(rb, iterstate);
@@ -3735,7 +3752,22 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		PG_RE_THROW();
+		/* re-throw only if it's not an abort */
+		if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+		else
+		{
+			/* remember the command ID and snapshot for the streaming run */
+			txn->command_id = command_id;
+			txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+													  txn, command_id);
+			rb->stream_stop(rb, txn);
+
+			FlushErrorState();
+		}
 	}
 	PG_END_TRY();
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index c183fce..1db6da6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -242,6 +242,11 @@ typedef struct ReorderBufferTXN
 	bool		is_schema_sent;
 
 	/*
+	 * Have we sent any changes for this transaction in output plugin?
+	 */
+	bool		any_data_sent;
+
+	/*
 	 * Toplevel transaction for this subxact (NULL for top-level).
 	 */
 	struct ReorderBufferTXN *toptxn;
-- 
1.8.3.1

#148Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#147)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Dec 2, 2019 at 2:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:

I have rebased the patch on the latest head and also fix the issue of
"concurrent abort handling of the (sub)transaction." and attached as
(v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
the complete patch set. I have added the version number so that we
can track the changes.

The patch has rotten a bit and does not apply anymore. Could you
please send a rebased version? I have moved it to next CF, waiting on
author.

I have rebased the patch set on the latest head.

I have review the patch set and here are few comments/questions

1.
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ Relation relation,
+ ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}

Should we show the tuple in the streamed change like we do for the
pg_decode_change?

2. pg_logical_slot_get_changes_guts
It recreate the decoding slot [ctx =
CreateDecodingContext(InvalidXLogRecPtr] but doesn't set the streaming
to false, should we pass a parameter to
pg_logical_slot_get_changes_guts saying whether we want streamed results or not

3.
+ XLogRecPtr prev_lsn = InvalidXLogRecPtr;
ReorderBufferChange *change;
ReorderBufferChange *specinsert = NULL;

@@ -1565,6 +1965,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
Relation relation = NULL;
Oid reloid;

+ /*
+ * Enforce correct ordering of changes, merged from multiple
+ * subtransactions. The changes may have the same LSN due to
+ * MULTI_INSERT xlog records.
+ */
+ if (prev_lsn != InvalidXLogRecPtr)
+ Assert(prev_lsn <= change->lsn);
+
+ prev_lsn = change->lsn;
I did not understand, how this change is relavent to this patch
4.
+ /*
+ * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
+ * information about subtransactions, which could arrive after streaming start.
+ */
+ if (!txn->is_schema_sent)
+ snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+ txn, command_id);

In which case, txn->is_schema_sent will be true, because at the end of
the stream in ReorderBufferExecuteInvalidations we are always setting
it false,
so while sending next stream it will always be false. That means we
never required snapshot_now variable in ReorderBufferTXN.

5.
@@ -2299,6 +2746,23 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
*rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);

  txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+ /*
+ * We read catalog changes from WAL, which are not yet sent, so
+ * invalidate current schema in order output plugin can resend
+ * schema again.
+ */
+ txn->is_schema_sent = false;

Same as point 4, during decode time it will never be true.

6.
+ /* send fields */
+ pq_sendint64(out, commit_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);

Commit_time and end_lsn is used in standby_feedback

7.
+ /* FIXME optimize the search by bsearch on sorted data */
+ for (i = nsubxacts; i > 0; i--)
+ {
+ if (subxacts[i - 1].xid == subxid)
+ {
+ subidx = (i - 1);
+ found = true;
+ break;
+ }
+ }
We can not rollback intermediate subtransaction without rollbacking
latest sub-transaction, so why do we need
to search in the array?  It will always be the the last subxact no?
8.
+ /*
+ * send feedback to upstream
+ *
+ * XXX Probably should send a valid LSN. But which one?
+ */
+ send_feedback(InvalidXLogRecPtr, false, false);

Why feedback is sent for every change?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#149Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#147)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:

I have rebased the patch on the latest head and also fix the issue of
"concurrent abort handling of the (sub)transaction." and attached as
(v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
the complete patch set. I have added the version number so that we
can track the changes.

The patch has rotten a bit and does not apply anymore. Could you
please send a rebased version? I have moved it to next CF, waiting on
author.

I have rebased the patch set on the latest head.

Apart from this, there is one issue reported by my colleague Vignesh.
The issue is that if we use more than two relations in a transaction
then there is an error on standby (no relation map entry for remote
relation ID 16390). After analyzing I have found that for the
streaming transaction an "is_schema_sent" flag is kept in
ReorderBufferTXN. And, I think that is done so that we can send the
schema for each transaction stream so that if any subtransaction gets
aborted we don't lose the logical WAL for that schema. But, this
solution has induced a very basic issue that if a transaction operate
on more than 1 relation then after sending the schema for the first
relation it will mark the flag true and the schema for the subsequent
relations will never be sent.

How about keeping a list of top-level xids in each RelationSyncEntry?
Basically, whenever we send the schema for any transaction, we note
that in RelationSyncEntry and at abort time we can remove xid from the
list. Now, whenever, we check whether to send schema for any
operation in a transaction, we will check if our xid is present in
that list for a particular RelationSyncEntry and take an action based
on that (if xid is present, then we won't send the schema, otherwise,
send it).

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#150Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#149)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Dec 10, 2019 at 9:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:

I have rebased the patch on the latest head and also fix the issue of
"concurrent abort handling of the (sub)transaction." and attached as
(v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
the complete patch set. I have added the version number so that we
can track the changes.

The patch has rotten a bit and does not apply anymore. Could you
please send a rebased version? I have moved it to next CF, waiting on
author.

I have rebased the patch set on the latest head.

Apart from this, there is one issue reported by my colleague Vignesh.
The issue is that if we use more than two relations in a transaction
then there is an error on standby (no relation map entry for remote
relation ID 16390). After analyzing I have found that for the
streaming transaction an "is_schema_sent" flag is kept in
ReorderBufferTXN. And, I think that is done so that we can send the
schema for each transaction stream so that if any subtransaction gets
aborted we don't lose the logical WAL for that schema. But, this
solution has induced a very basic issue that if a transaction operate
on more than 1 relation then after sending the schema for the first
relation it will mark the flag true and the schema for the subsequent
relations will never be sent.

How about keeping a list of top-level xids in each RelationSyncEntry?
Basically, whenever we send the schema for any transaction, we note
that in RelationSyncEntry and at abort time we can remove xid from the
list. Now, whenever, we check whether to send schema for any
operation in a transaction, we will check if our xid is present in
that list for a particular RelationSyncEntry and take an action based
on that (if xid is present, then we won't send the schema, otherwise,
send it).

The idea make sense to me. I will try to write a patch for this and test.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#151Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#148)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Dec 9, 2019 at 1:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have review the patch set and here are few comments/questions

1.
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ Relation relation,
+ ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}

Should we show the tuple in the streamed change like we do for the
pg_decode_change?

I think so. The patch shows the message in
pg_decode_stream_message(), so why to prohibit showing tuple here?

2. pg_logical_slot_get_changes_guts
It recreate the decoding slot [ctx =
CreateDecodingContext(InvalidXLogRecPtr] but doesn't set the streaming
to false, should we pass a parameter to
pg_logical_slot_get_changes_guts saying whether we want streamed results or not

CreateDecodingContext internally calls StartupDecodingContext which
sets the value of streaming based on if the plugin has provided
callbacks for streaming functions. Isn't that sufficient? Why do we
need additional parameters here?

3.
+ XLogRecPtr prev_lsn = InvalidXLogRecPtr;
ReorderBufferChange *change;
ReorderBufferChange *specinsert = NULL;

@@ -1565,6 +1965,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
Relation relation = NULL;
Oid reloid;

+ /*
+ * Enforce correct ordering of changes, merged from multiple
+ * subtransactions. The changes may have the same LSN due to
+ * MULTI_INSERT xlog records.
+ */
+ if (prev_lsn != InvalidXLogRecPtr)
+ Assert(prev_lsn <= change->lsn);
+
+ prev_lsn = change->lsn;
I did not understand, how this change is relavent to this patch

This is just to ensure that changes are in LSN order. I think as we
are merging the changes before commit for streaming, it is good to
have such an Assertion for ReorderBufferStreamTXN. And, if we want
to have it in ReorderBufferStreamTXN, then there is no harm in keeping
it in ReorderBufferCommit() at least to keep the code consistent. Do
you see any problem with this?

4.
+ /*
+ * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
+ * information about subtransactions, which could arrive after streaming start.
+ */
+ if (!txn->is_schema_sent)
+ snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+ txn, command_id);

In which case, txn->is_schema_sent will be true, because at the end of
the stream in ReorderBufferExecuteInvalidations we are always setting
it false,
so while sending next stream it will always be false. That means we
never required snapshot_now variable in ReorderBufferTXN.

You are probably right, but as discussed we need to change this part
of design/code (when to send schema changes) due to the issues
discovered. So, I think this part will anyway change when we fix that
problem.

5.
@@ -2299,6 +2746,23 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
*rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);

txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+ /*
+ * We read catalog changes from WAL, which are not yet sent, so
+ * invalidate current schema in order output plugin can resend
+ * schema again.
+ */
+ txn->is_schema_sent = false;

Same as point 4, during decode time it will never be true.

Sure, my previous point's reply applies here as well.

6.
+ /* send fields */
+ pq_sendint64(out, commit_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);

Commit_time and end_lsn is used in standby_feedback

I don't understand what you mean by this. Can you be a bit more clear?

7.
+ /* FIXME optimize the search by bsearch on sorted data */
+ for (i = nsubxacts; i > 0; i--)
+ {
+ if (subxacts[i - 1].xid == subxid)
+ {
+ subidx = (i - 1);
+ found = true;
+ break;
+ }
+ }
We can not rollback intermediate subtransaction without rollbacking
latest sub-transaction, so why do we need
to search in the array?  It will always be the the last subxact no?

The same thing is already mentioned in the comments above this code
("XXX Or perhaps we can rely on the aborts to arrive in the reverse
order, i.e. from the inner-most subxact (when nested)? In which case
we could simply check the last element."). I think what you are
saying is probably right, but we can leave this as it is for now
because this is a minor optimization which can be done later as well
if required. However, if you see any correctness issue, then we can
discuss.

8.
+ /*
+ * send feedback to upstream
+ *
+ * XXX Probably should send a valid LSN. But which one?
+ */
+ send_feedback(InvalidXLogRecPtr, false, false);

Why feedback is sent for every change?

I will study this part of the patch and let you know my opinion.

Few comments on this patch series:

0001-Immediately-WAL-log-assignments:
------------------------------------------------------------

The commit message still refers to the old design for this patch. I
think you need to modify the commit message as per the latest patch.

0002-Issue-individual-invalidations-with-wal_level-log
----------------------------------------------------------------------------
1.
xact_desc_invalidations(StringInfo buf,
{
..
+ else if (msg->id == SHAREDINVALSNAPSHOT_ID)
+ appendStringInfo(buf, " snapshot %u", msg->sn.relId);

You have removed logging for the above cache but forgot to remove its
reference from one of the places. Also, I think you need to add a
comment somewhere in inval.c to say why you are writing for WAL for
some types of invalidations and not for others?

0003-Extend-the-output-plugin-API-with-stream-methods
--------------------------------------------------------------------------------
1.
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_message_cb</function> are optional.

stream_message_cb is mentioned twice. It seems the second one is for truncate.

2.
size of the transaction size and network bandwidth, the transfer time
+ may significantly increase the apply lag.

/size of the transaction size/size of the transaction

no need to mention size twice.

3.
+    Similarly to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress
transactions)
+    exceeds limit defined by <varname>logical_work_mem</varname> setting.

The guc name used is wrong. /Similarly to/Similar to/

4.
stream_start_cb_wrapper()
{
..
+ /* state.report_location = apply_lsn; */
..
+ /* FIXME ctx->write_location = apply_lsn; */
..
}

See, if we can fix these and similar in the callback for the stop. I
think we don't have final_lsn till we commit/abort. Can we compute
before calling these API's?

0005-Gracefully-handle-concurrent-aborts-of-uncommitte
----------------------------------------------------------------------------------
1.
@@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
  PG_CATCH();
  {
  /* TODO: Encapsulate cleanup
from the PG_TRY and PG_CATCH blocks */
+
  if (iterstate)
  ReorderBufferIterTXNFinish(rb, iterstate);

Spurious line change.

2. The commit message of this patch refers to Prepared transactions.
I think that needs to be changed.

0006-Implement-streaming-mode-in-ReorderBuffer
-------------------------------------------------------------------------
1.
+
+/* iterator for streaming (only get data from memory) */
+static ReorderBufferStreamIterTXNState * ReorderBufferStreamIterTXNInit(
+
ReorderBuffer *rb,
+
ReorderBufferTXN
*txn);
+
+static ReorderBufferChange *ReorderBufferStreamIterTXNNext(
+    ReorderBuffer *rb,
+
   ReorderBufferStreamIterTXNState * state);
+
+static void ReorderBufferStreamIterTXNFinish(
+
ReorderBuffer *rb,
+
ReorderBufferStreamIterTXNState * state);

Do we really need to introduce new APIs for iterating over changes
from streamed transactions? Why can't we reuse the same API's as we
use for committed xacts?

2.
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)

Please write some comments atop ReorderBufferStreamCommit.

3.
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
..
+ if (txn->snapshot_now
== NULL)
+ {
+ dlist_iter subxact_i;
+
+ /* make sure this transaction is streamed for the first time */
+
Assert(!rbtxn_is_streamed(txn));
+
+ /* at the beginning we should have invalid command ID */
+ Assert(txn->command_id ==
InvalidCommandId);
+
+ dlist_foreach(subxact_i, &txn->subtxns)
+ {
+ ReorderBufferTXN *subtxn;
+
+
subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+
+ if (subtxn->base_snapshot != NULL &&
+
(txn->base_snapshot == NULL ||
+ txn->base_snapshot_lsn > subtxn->base_snapshot_lsn))
+ {
+
txn->base_snapshot = subtxn->base_snapshot;

The logic here seems to be correct, but I am not sure why it is not
considered to purge the base snapshot before assigning the subtxn's
snapshot and similarly, we have not purged snapshot for subtxn once we
are done with it. I think we can use
ReorderBufferTransferSnapToParent to replace part of the logic here.
Do you see any reason for doing things differently here?

4. In ReorderBufferStreamTXN, why do you need to use
ReorderBufferCopySnap to assign txn->base_snapshot to snapshot_now.

5. I see a lot of code similarity in ReorderBufferStreamTXN and
existing ReorderBufferCommit. I understand that there are some subtle
differences due to which we need to write this new function but can't
we encapsulate the specific parts of code in functions and then call
from both places. I am talking about code in different cases for
change->action.

6. + * Note: We never stream and serialize a transaction at the same time (e
/(e/(we

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#152Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#147)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Dec 2, 2019 at 3:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have rebased the patch set on the latest head.

0001 looks like a clever approach, but are you sure it doesn't hurt
performance when many small XLOG records are being inserted? I think
XLogRecordAssemble() can get pretty hot in some workloads.

With regard to 0002, logging a separate WAL record for each
invalidation seems painful; I think most operations that generate
invalidations generate a bunch of them all at once. Perhaps you could
just queue up invalidations as they happen, and then force anything
that's been queued up to be emitted into WAL just before you emit any
WAL record that might need to be decoded.

Regarding 0005, it seems to me that this is no good:

+ errmsg("improper heap_getnext call")));

I think we should be using elog() rather than ereport() here, because
this should only happen if there's a bug in a logical decoding plugin.
At first, I thought maybe this should just be an Assert(), but since
there are third-party logical decoding plugins available, checking
this even in non-assert builds seems like a good idea. However, I
think making it translatable is overkill; users should never see this,
only developers.

I also think that the message is really bad, because it just tells you
did something bad. It gives no inkling as to why it was bad.

0006 contains lots of XXX comments that look like real issues. I guess
those need to be fixed. Also, why don't we do the thing that the
commit message for 0006 says we could "theoretically" do? I don't
understand why we need the k-way merge at all,

+ if (prev_lsn != InvalidXLogRecPtr)
+ Assert(prev_lsn <= change->lsn);

There is no reason to ever write an if statement that contains only an
Assert, and it's bad style. Write Assert(prev_lsn == InvalidXLogRecPtr
|| prev_lsn <= change->lsn), or better yet, use XLogRecPtrIsInvalid.

The purpose and mechanism of the is_schema_sent flag is not clear to
me. The word "schema" here seems to be being used to mean "snapshot,"
which is rather confusing.

I'm also somewhat unclear on what's happening here with invalidations.
Perhaps that's as much a defect in my understanding as it is
reflective of any problem with the patch, but I also don't see any
comments either in 0002 or later patches explaining the theory of
operation. If I've missed some, please point me in the right
direction. Hypothetically speaking, it seems to me that if you just
did InvalidateSystemCaches() every time the snapshot changed, you
wouldn't need anything else (unless we're concerned with
non-transactional invalidation messages like smgr and relmapper
invalidations; not quite sure how those are handled). And, on the
other hand, if we don't do InvalidateSystemCaches() every time the
snapshot changes, then I don't understand why this works now, even
without streaming.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#153Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#151)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Dec 11, 2019 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Dec 9, 2019 at 1:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have review the patch set and here are few comments/questions

1.
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ Relation relation,
+ ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}

Should we show the tuple in the streamed change like we do for the
pg_decode_change?

I think so. The patch shows the message in
pg_decode_stream_message(), so why to prohibit showing tuple here?

2. pg_logical_slot_get_changes_guts
It recreate the decoding slot [ctx =
CreateDecodingContext(InvalidXLogRecPtr] but doesn't set the streaming
to false, should we pass a parameter to
pg_logical_slot_get_changes_guts saying whether we want streamed results or not

CreateDecodingContext internally calls StartupDecodingContext which
sets the value of streaming based on if the plugin has provided
callbacks for streaming functions. Isn't that sufficient? Why do we
need additional parameters here?

I don't think that if plugin provides streaming function then we
should stream. Like pgoutput plugin provides streaming function but
we only stream if streaming is on in create subscription command. So
I feel that should be true with any plugin.

3.
+ XLogRecPtr prev_lsn = InvalidXLogRecPtr;
ReorderBufferChange *change;
ReorderBufferChange *specinsert = NULL;

@@ -1565,6 +1965,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
Relation relation = NULL;
Oid reloid;

+ /*
+ * Enforce correct ordering of changes, merged from multiple
+ * subtransactions. The changes may have the same LSN due to
+ * MULTI_INSERT xlog records.
+ */
+ if (prev_lsn != InvalidXLogRecPtr)
+ Assert(prev_lsn <= change->lsn);
+
+ prev_lsn = change->lsn;
I did not understand, how this change is relavent to this patch

This is just to ensure that changes are in LSN order. I think as we
are merging the changes before commit for streaming, it is good to
have such an Assertion for ReorderBufferStreamTXN. And, if we want
to have it in ReorderBufferStreamTXN, then there is no harm in keeping
it in ReorderBufferCommit() at least to keep the code consistent. Do
you see any problem with this?

I am fine with this.

4.
+ /*
+ * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
+ * information about subtransactions, which could arrive after streaming start.
+ */
+ if (!txn->is_schema_sent)
+ snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+ txn, command_id);

In which case, txn->is_schema_sent will be true, because at the end of
the stream in ReorderBufferExecuteInvalidations we are always setting
it false,
so while sending next stream it will always be false. That means we
never required snapshot_now variable in ReorderBufferTXN.

You are probably right, but as discussed we need to change this part
of design/code (when to send schema changes) due to the issues
discovered. So, I think this part will anyway change when we fix that
problem.

Make sense.

5.
@@ -2299,6 +2746,23 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
*rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);

txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+ /*
+ * We read catalog changes from WAL, which are not yet sent, so
+ * invalidate current schema in order output plugin can resend
+ * schema again.
+ */
+ txn->is_schema_sent = false;

Same as point 4, during decode time it will never be true.

Sure, my previous point's reply applies here as well.

ok

6.
+ /* send fields */
+ pq_sendint64(out, commit_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);

Commit_time and end_lsn is used in standby_feedback

I don't understand what you mean by this. Can you be a bit more clear?

I think I paste it here by mistake. just ignore it.

7.
+ /* FIXME optimize the search by bsearch on sorted data */
+ for (i = nsubxacts; i > 0; i--)
+ {
+ if (subxacts[i - 1].xid == subxid)
+ {
+ subidx = (i - 1);
+ found = true;
+ break;
+ }
+ }
We can not rollback intermediate subtransaction without rollbacking
latest sub-transaction, so why do we need
to search in the array?  It will always be the the last subxact no?

The same thing is already mentioned in the comments above this code
("XXX Or perhaps we can rely on the aborts to arrive in the reverse
order, i.e. from the inner-most subxact (when nested)? In which case
we could simply check the last element."). I think what you are
saying is probably right, but we can leave this as it is for now
because this is a minor optimization which can be done later as well
if required. However, if you see any correctness issue, then we can
discuss.

I think more than optimization here we have the question of whether
this loop is required at all or not. Because, by optimizing we are
not adding the complexity, infact it will be simple. I think here we
need more analysis that whether we need to traverse the array or not.
So maybe for time being we can leave this as it is.

8.
+ /*
+ * send feedback to upstream
+ *
+ * XXX Probably should send a valid LSN. But which one?
+ */
+ send_feedback(InvalidXLogRecPtr, false, false);

Why feedback is sent for every change?

I will study this part of the patch and let you know my opinion.

Sure.

Few comments on this patch series:

0001-Immediately-WAL-log-assignments:
------------------------------------------------------------

The commit message still refers to the old design for this patch. I
think you need to modify the commit message as per the latest patch.

0002-Issue-individual-invalidations-with-wal_level-log
----------------------------------------------------------------------------
1.
xact_desc_invalidations(StringInfo buf,
{
..
+ else if (msg->id == SHAREDINVALSNAPSHOT_ID)
+ appendStringInfo(buf, " snapshot %u", msg->sn.relId);

You have removed logging for the above cache but forgot to remove its
reference from one of the places. Also, I think you need to add a
comment somewhere in inval.c to say why you are writing for WAL for
some types of invalidations and not for others?

0003-Extend-the-output-plugin-API-with-stream-methods
--------------------------------------------------------------------------------
1.
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_message_cb</function> are optional.

stream_message_cb is mentioned twice. It seems the second one is for truncate.

2.
size of the transaction size and network bandwidth, the transfer time
+ may significantly increase the apply lag.

/size of the transaction size/size of the transaction

no need to mention size twice.

3.
+    Similarly to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress
transactions)
+    exceeds limit defined by <varname>logical_work_mem</varname> setting.

The guc name used is wrong. /Similarly to/Similar to/

4.
stream_start_cb_wrapper()
{
..
+ /* state.report_location = apply_lsn; */
..
+ /* FIXME ctx->write_location = apply_lsn; */
..
}

See, if we can fix these and similar in the callback for the stop. I
think we don't have final_lsn till we commit/abort. Can we compute
before calling these API's?

0005-Gracefully-handle-concurrent-aborts-of-uncommitte
----------------------------------------------------------------------------------
1.
@@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_CATCH();
{
/* TODO: Encapsulate cleanup
from the PG_TRY and PG_CATCH blocks */
+
if (iterstate)
ReorderBufferIterTXNFinish(rb, iterstate);

Spurious line change.

2. The commit message of this patch refers to Prepared transactions.
I think that needs to be changed.

0006-Implement-streaming-mode-in-ReorderBuffer
-------------------------------------------------------------------------
1.
+
+/* iterator for streaming (only get data from memory) */
+static ReorderBufferStreamIterTXNState * ReorderBufferStreamIterTXNInit(
+
ReorderBuffer *rb,
+
ReorderBufferTXN
*txn);
+
+static ReorderBufferChange *ReorderBufferStreamIterTXNNext(
+    ReorderBuffer *rb,
+
ReorderBufferStreamIterTXNState * state);
+
+static void ReorderBufferStreamIterTXNFinish(
+
ReorderBuffer *rb,
+
ReorderBufferStreamIterTXNState * state);

Do we really need to introduce new APIs for iterating over changes
from streamed transactions? Why can't we reuse the same API's as we
use for committed xacts?

2.
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)

Please write some comments atop ReorderBufferStreamCommit.

3.
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
..
+ if (txn->snapshot_now
== NULL)
+ {
+ dlist_iter subxact_i;
+
+ /* make sure this transaction is streamed for the first time */
+
Assert(!rbtxn_is_streamed(txn));
+
+ /* at the beginning we should have invalid command ID */
+ Assert(txn->command_id ==
InvalidCommandId);
+
+ dlist_foreach(subxact_i, &txn->subtxns)
+ {
+ ReorderBufferTXN *subtxn;
+
+
subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+
+ if (subtxn->base_snapshot != NULL &&
+
(txn->base_snapshot == NULL ||
+ txn->base_snapshot_lsn > subtxn->base_snapshot_lsn))
+ {
+
txn->base_snapshot = subtxn->base_snapshot;

The logic here seems to be correct, but I am not sure why it is not
considered to purge the base snapshot before assigning the subtxn's
snapshot and similarly, we have not purged snapshot for subtxn once we
are done with it. I think we can use
ReorderBufferTransferSnapToParent to replace part of the logic here.
Do you see any reason for doing things differently here?

4. In ReorderBufferStreamTXN, why do you need to use
ReorderBufferCopySnap to assign txn->base_snapshot to snapshot_now.

5. I see a lot of code similarity in ReorderBufferStreamTXN and
existing ReorderBufferCommit. I understand that there are some subtle
differences due to which we need to write this new function but can't
we encapsulate the specific parts of code in functions and then call
from both places. I am talking about code in different cases for
change->action.

6. + * Note: We never stream and serialize a transaction at the same time (e
/(e/(we

I will look into these comments and reply separately.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#154Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#152)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Dec 11, 2019 at 11:46 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 2, 2019 at 3:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have rebased the patch set on the latest head.

0001 looks like a clever approach, but are you sure it doesn't hurt
performance when many small XLOG records are being inserted? I think
XLogRecordAssemble() can get pretty hot in some workloads.

I don't think we have evaluated it yet, but we should do it. The
point to note is that it is only for the case when wal_level is
'logical' (see IsSubTransactionAssignmentPending) in which case we
already log more WAL, so this might not impact much. I guess that it
might be better to have that check in XLogRecordAssemble for the sake
of clarity.

Regarding 0005, it seems to me that this is no good:

+ errmsg("improper heap_getnext call")));

I think we should be using elog() rather than ereport() here, because
this should only happen if there's a bug in a logical decoding plugin.
At first, I thought maybe this should just be an Assert(), but since
there are third-party logical decoding plugins available, checking
this even in non-assert builds seems like a good idea. However, I
think making it translatable is overkill; users should never see this,
only developers.

makes sense. I think we should change it.

+ if (prev_lsn != InvalidXLogRecPtr)
+ Assert(prev_lsn <= change->lsn);

There is no reason to ever write an if statement that contains only an
Assert, and it's bad style. Write Assert(prev_lsn == InvalidXLogRecPtr
|| prev_lsn <= change->lsn), or better yet, use XLogRecPtrIsInvalid.

Agreed.

The purpose and mechanism of the is_schema_sent flag is not clear to
me. The word "schema" here seems to be being used to mean "snapshot,"
which is rather confusing.

I have explained this flag below along with invalidations as both are
slightly related.

I'm also somewhat unclear on what's happening here with invalidations.
Perhaps that's as much a defect in my understanding as it is
reflective of any problem with the patch, but I also don't see any
comments either in 0002 or later patches explaining the theory of
operation. If I've missed some, please point me in the right
direction. Hypothetically speaking, it seems to me that if you just
did InvalidateSystemCaches() every time the snapshot changed, you
wouldn't need anything else (unless we're concerned with
non-transactional invalidation messages like smgr and relmapper
invalidations; not quite sure how those are handled). And, on the
other hand, if we don't do InvalidateSystemCaches() every time the
snapshot changes, then I don't understand why this works now, even
without streaming.

I think the way invalidations work for logical replication is that
normally, we always start a new transaction before decoding each
commit which allows us to accept the invalidations (via
AtStart_Cache). However, if there are catalog changes within the
transaction being decoded, we need to reflect those before trying to
decode the WAL of operation which happened after that catalog change.
As we are not logging the WAL for each invalidation, we need to
execute all the invalidation messages for this transaction at each
catalog change. We are able to do that now as we decode the entire WAL
for a transaction only once we get the commit's WAL which contains all
the invalidation messages. So, we queue them up and execute them for
each catalog change which we identify by WAL record
XLOG_HEAP2_NEW_CID.

The second related concept is that before sending each change to
downstream (via pgoutput), we check whether we need to send the
schema. This we decide based on the local map entry
(RelationSyncEntry) which indicates whether the schema for the
relation is already sent or not. Once the schema of the relation is
sent, the entry for that relation in the map will indicate it. At the
time of invalidation processing we also blew up this map, so it always
reflects the correct state.

Now, to decode an in-progress transaction, we need to ensure that we
have received the WAL for all the invalidations before decoding the
WAL of action that happened immediately after that catalog change.
This is the reason we started WAL logging individual Invalidations.
So, with this change we don't need to execute all the invalidations
for each catalog change, rather execute them as and when their WAL is
being decoded.

The current mechanism to send schema changes won't work for streaming
transactions because after sending the change, subtransaction might
abort. On subtransaction abort, the downstream will simply discard
the changes where we will lose the previous schema change sent. There
is no such problem currently because we process all the aborts before
sending any change. So, the current idea of having a schema_sent flag
in each map entry (RelationSyncEntry) won't work for streaming
transactions. To solve this problem initially patch has kept a flag
'is_schema_sent' for each top-level transaction (in ReorderBufferTXN)
so that we can always send a schema for each (sub)transaction for
streaming transactions, but that won't work if we access multiple
relations in the same subtransaction. To solve this problem, we are
thinking of keeping a list/array of top-level xids in each
RelationSyncEntry. Basically, whenever we send the schema for any
transaction, we note that in RelationSyncEntry and at abort/commit
time we can remove xid from the list. Now, whenever, we check whether
to send schema for any operation in a transaction, we will check if
our xid is present in that list for a particular RelationSyncEntry and
take an action based on that (if xid is present, then we won't send
the schema, otherwise, send it). I think during decode, we should not
have that may open transactions, so the search in the array should be
cheap enough but we can consider some other data structure like hash
as well.

I will think some more and respond to your remaining comments/suggestions.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#155Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#152)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Dec 11, 2019 at 11:46 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 2, 2019 at 3:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have rebased the patch set on the latest head.

0001 looks like a clever approach, but are you sure it doesn't hurt
performance when many small XLOG records are being inserted? I think
XLogRecordAssemble() can get pretty hot in some workloads.

With regard to 0002, logging a separate WAL record for each
invalidation seems painful; I think most operations that generate
invalidations generate a bunch of them all at once. Perhaps you could
just queue up invalidations as they happen, and then force anything
that's been queued up to be emitted into WAL just before you emit any
WAL record that might need to be decoded.

I feel we can log the invalidations of the entire command at one go if
we log at CommandEndInvalidationMessages. We already have all the
invalidations of current command in
transInvalInfo->CurrentCmdInvalidMsgs. This can save us the effort of
maintaining a new separate list/queue for invalidations and to a good
extent, it will ameliorate your concern of logging each invalidation
separately.

0006 contains lots of XXX comments that look like real issues. I guess
those need to be fixed. Also, why don't we do the thing that the
commit message for 0006 says we could "theoretically" do? I don't
understand why we need the k-way merge at all,

I think we can do what is written in the commit message, but then we
need to maintain two paths (one for streaming contexts and other for
non-streaming contexts) unless we want to entirely get rid of storing
subtransaction changes separately which seems like a more fundamental
change. Right now, also to some extent such things are there, but I
have already given a comment to minimize it. Having said that, I
think we can go either way. I think the original intention was to
avoid doing more stuff unless it is really required as this is already
a big patchset, but maybe Tomas has a different idea about this.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#156Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#153)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Dec 12, 2019 at 9:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Dec 11, 2019 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Dec 9, 2019 at 1:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have review the patch set and here are few comments/questions

1.
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ Relation relation,
+ ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}

Should we show the tuple in the streamed change like we do for the
pg_decode_change?

I think so. The patch shows the message in
pg_decode_stream_message(), so why to prohibit showing tuple here?

2. pg_logical_slot_get_changes_guts
It recreate the decoding slot [ctx =
CreateDecodingContext(InvalidXLogRecPtr] but doesn't set the streaming
to false, should we pass a parameter to
pg_logical_slot_get_changes_guts saying whether we want streamed results or not

CreateDecodingContext internally calls StartupDecodingContext which
sets the value of streaming based on if the plugin has provided
callbacks for streaming functions. Isn't that sufficient? Why do we
need additional parameters here?

I don't think that if plugin provides streaming function then we
should stream. Like pgoutput plugin provides streaming function but
we only stream if streaming is on in create subscription command. So
I feel that should be true with any plugin.

How about adding a new boolean parameter (streaming) in
pg_create_logical_replication_slot()?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#157Masahiko Sawada
masahiko.sawada@2ndquadrant.com
In reply to: Dilip Kumar (#147)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, 2 Dec 2019 at 17:32, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:

I have rebased the patch on the latest head and also fix the issue of
"concurrent abort handling of the (sub)transaction." and attached as
(v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
the complete patch set. I have added the version number so that we
can track the changes.

The patch has rotten a bit and does not apply anymore. Could you
please send a rebased version? I have moved it to next CF, waiting on
author.

I have rebased the patch set on the latest head.

Thank you for working on this.

This might have already been discussed but I have a question about the
changes of logical replication worker. In the current logical
replication there is a problem that the response time are doubled when
using synchronous replication because wal senders send changes after
commit. It's worse especially when a transaction makes a lot of
changes. So I expected this feature to reduce the response time by
sending changes even while the transaction is progressing but it
doesn't seem to be. The logical replication worker writes changes to
temporary files and applies these changes when the worker received
commit record (STREAM COMMIT). Since the worker sends the LSN of
commit record as flush LSN to the publisher after applying all
changes, the publisher must wait for all changes are applied to the
subscriber. Another problem would be that the worker doesn't receive
changes during applying changes of other transactions. These things
make me think it's better to have a new worker dedicated to apply
changes like we have the wal receiver process and the startup process.
Maybe we can have 2 workers (receiver and applyer) per subscriptions.
Any thoughts?

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#158Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amit Kapila (#155)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Hello.

At Fri, 13 Dec 2019 14:46:20 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

On Wed, Dec 11, 2019 at 11:46 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 2, 2019 at 3:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have rebased the patch set on the latest head.

0001 looks like a clever approach, but are you sure it doesn't hurt
performance when many small XLOG records are being inserted? I think
XLogRecordAssemble() can get pretty hot in some workloads.

With regard to 0002, logging a separate WAL record for each
invalidation seems painful; I think most operations that generate
invalidations generate a bunch of them all at once. Perhaps you could
just queue up invalidations as they happen, and then force anything
that's been queued up to be emitted into WAL just before you emit any
WAL record that might need to be decoded.

I feel we can log the invalidations of the entire command at one go if
we log at CommandEndInvalidationMessages. We already have all the
invalidations of current command in
transInvalInfo->CurrentCmdInvalidMsgs. This can save us the effort of
maintaining a new separate list/queue for invalidations and to a good
extent, it will ameliorate your concern of logging each invalidation
separately.

I have a question on this. Does that mean that the current logical
decoder (or reorderbuffer) may emit incorrect result if it made a
catalog change during the current transaction being decoded? If so,
this is not a feature but a bug fix.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#159Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#157)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Dec 20, 2019 at 11:47 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Mon, 2 Dec 2019 at 17:32, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:

I have rebased the patch on the latest head and also fix the issue of
"concurrent abort handling of the (sub)transaction." and attached as
(v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
the complete patch set. I have added the version number so that we
can track the changes.

The patch has rotten a bit and does not apply anymore. Could you
please send a rebased version? I have moved it to next CF, waiting on
author.

I have rebased the patch set on the latest head.

Thank you for working on this.

This might have already been discussed but I have a question about the
changes of logical replication worker. In the current logical
replication there is a problem that the response time are doubled when
using synchronous replication because wal senders send changes after
commit. It's worse especially when a transaction makes a lot of
changes. So I expected this feature to reduce the response time by
sending changes even while the transaction is progressing but it
doesn't seem to be. The logical replication worker writes changes to
temporary files and applies these changes when the worker received
commit record (STREAM COMMIT). Since the worker sends the LSN of
commit record as flush LSN to the publisher after applying all
changes, the publisher must wait for all changes are applied to the
subscriber.

The main aim of this feature is to reduce apply lag. Because if we
send all the changes together it can delay there apply because of
network delay, whereas if most of the changes are already sent, then
we will save the effort on sending the entire data at commit time.
This in itself gives us decent benefits. Sure, we can further improve
it by having separate workers (dedicated to apply the changes) as you
are suggesting and in fact, there is a patch for that as well(see the
performance results and bgworker patch at [1]/messages/by-id/8eda5118-2dd0-79a1-4fe9-eec7e334de17@postgrespro.ru), but if try to shove in
all the things in one go, then it will be difficult to get this patch
committed (there are already enough things and the patch is quite big
that to get it right takes a lot of energy). So, the plan is
something like that first we get the basic feature and then try to
improve by having dedicated workers or things like that. Does this
make sense to you?

[1]: /messages/by-id/8eda5118-2dd0-79a1-4fe9-eec7e334de17@postgrespro.ru

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#160Amit Kapila
amit.kapila16@gmail.com
In reply to: Kyotaro Horiguchi (#158)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Dec 20, 2019 at 2:00 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Hello.

At Fri, 13 Dec 2019 14:46:20 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

On Wed, Dec 11, 2019 at 11:46 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 2, 2019 at 3:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have rebased the patch set on the latest head.

0001 looks like a clever approach, but are you sure it doesn't hurt
performance when many small XLOG records are being inserted? I think
XLogRecordAssemble() can get pretty hot in some workloads.

With regard to 0002, logging a separate WAL record for each
invalidation seems painful; I think most operations that generate
invalidations generate a bunch of them all at once. Perhaps you could
just queue up invalidations as they happen, and then force anything
that's been queued up to be emitted into WAL just before you emit any
WAL record that might need to be decoded.

I feel we can log the invalidations of the entire command at one go if
we log at CommandEndInvalidationMessages. We already have all the
invalidations of current command in
transInvalInfo->CurrentCmdInvalidMsgs. This can save us the effort of
maintaining a new separate list/queue for invalidations and to a good
extent, it will ameliorate your concern of logging each invalidation
separately.

I have a question on this. Does that mean that the current logical
decoder (or reorderbuffer)

What does currently refer to here? Is it about HEAD or about the
patch? Without the patch, we decode only at commit time and by that
time we have all invalidations (logged with commit WAL record), so we
just execute them at each catalog change (see the actions in
REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID). The patch has to
separately WAL log each invalidation because we can decode the
intermittent changes, so we can't wait till commit. The above is just
an optimization for the patch. AFAIK, there is no correctness issue
here, but let me know if you see any.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#161vignesh C
vignesh21@gmail.com
In reply to: Dilip Kumar (#147)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:

I have rebased the patch on the latest head and also fix the issue of
"concurrent abort handling of the (sub)transaction." and attached as
(v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
the complete patch set. I have added the version number so that we
can track the changes.

The patch has rotten a bit and does not apply anymore. Could you
please send a rebased version? I have moved it to next CF, waiting on
author.

I have rebased the patch set on the latest head.

Few comments:
assert variable should be within #ifdef USE_ASSERT_CHECKING in patch
v2-0008-Add-support-for-streaming-to-built-in-replication.patch:
+               int64           subidx;
+               bool            found = false;
+               char            path[MAXPGPATH];
+
+               subidx = -1;
+               subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+               /* FIXME optimize the search by bsearch on sorted data */
+               for (i = nsubxacts; i > 0; i--)
+               {
+                       if (subxacts[i - 1].xid == subxid)
+                       {
+                               subidx = (i - 1);
+                               found = true;
+                               break;
+                       }
+               }
+
+               /* We should not receive aborts for unknown subtransactions. */
+               Assert(found);

Add the typedefs like below in typedefs.lst common across the patches:
xl_xact_invalidations, ReorderBufferStreamIterTXNEntry,
ReorderBufferStreamIterTXNState, SubXactInfo

"are written" appears twice in commit message of
v2-0002-Issue-individual-invalidations-with-wal_level-log.patch:
The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

v2-0002-Issue-individual-invalidations-with-wal_level-log.patch patch
does not compile by itself:
reorderbuffer.c:1822:9: error: ‘ReorderBufferTXN’ has no member named
‘is_schema_sent’
+
LocalExecuteInvalidationMessage(&change->data.inval.msg);
+                                       txn->is_schema_sent = false;
+                                       break;
Should we include printing of id here like in earlier cases in
v2-0002-Issue-individual-invalidations-with-wal_level-log.patch:
+                       appendStringInfo(buf, " relcache %u", msg->rc.relId);
+               /* not expected, but print something anyway */
+               else if (msg->id == SHAREDINVALSMGR_ID)
+                       appendStringInfoString(buf, " smgr");
+               /* not expected, but print something anyway */
+               else if (msg->id == SHAREDINVALRELMAP_ID)
+                       appendStringInfo(buf, " relmap db %u", msg->rm.dbId);

There is some code duplication in stream_change_cb_wrapper,
stream_truncate_cb_wrapper, stream_message_cb_wrapper,
stream_abort_cb_wrapper, stream_commit_cb_wrapper,
stream_start_cb_wrapper and stream_stop_cb_wrapper functions in
v2-0003-Extend-the-output-plugin-API-with-stream-methods.patch patch.
Should we have a separate function for common code?

Should we can add function header for AssertChangeLsnOrder in
v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch:
+static void
+AssertChangeLsnOrder(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
This "Assert(txn->first_lsn != InvalidXLogRecPtr)"can be before the
loop, can be checked only once:
+       dlist_foreach(iter, &txn->changes)
+       {
+               ReorderBufferChange *cur_change;
+
+               cur_change = dlist_container(ReorderBufferChange,
node, iter.cur);
+
+               Assert(txn->first_lsn != InvalidXLogRecPtr);
+               Assert(cur_change->lsn != InvalidXLogRecPtr);
+               Assert(txn->first_lsn <= cur_change->lsn);
Should we add function header for ReorderBufferDestroyTupleCidHash in
v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch:
+static void
+ReorderBufferDestroyTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+       if (txn->tuplecid_hash != NULL)
+       {
+               hash_destroy(txn->tuplecid_hash);
+               txn->tuplecid_hash = NULL;
+       }
+}
+
Should we add function header for ReorderBufferStreamCommit in
v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch:
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+       /* we should only call this for previously streamed transactions */
+       Assert(rbtxn_is_streamed(txn));
+
+       ReorderBufferStreamTXN(rb, txn);
+
+       rb->stream_commit(rb, txn, txn->final_lsn);
+
+       ReorderBufferCleanupTXN(rb, txn);
+}
+
Should we add function header for ReorderBufferCanStream in
v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch:
+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+       LogicalDecodingContext *ctx = rb->private_data;
+
+       return ctx->streaming;
+}

patch v2-0008-Add-support-for-streaming-to-built-in-replication.patch
does not apply:
Hunk #18 FAILED at 2035.
Hunk #19 succeeded at 2199 (offset -16 lines).
1 out of 19 hunks FAILED -- saving rejects to file
src/backend/replication/logical/worker.c.rej

Header inclusion may not be required in patch
v2-0008-Add-support-for-streaming-to-built-in-replication.patch:
+++ b/src/backend/replication/logical/launcher.c
@@ -14,6 +14,8 @@
  *
  *-------------------------------------------------------------------------
  */
+#include <sys/types.h>
+#include <unistd.h>

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

#162Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#161)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sun, Dec 22, 2019 at 5:04 PM vignesh C <vignesh21@gmail.com> wrote:

Few comments:
assert variable should be within #ifdef USE_ASSERT_CHECKING in patch
v2-0008-Add-support-for-streaming-to-built-in-replication.patch:
+               int64           subidx;
+               bool            found = false;
+               char            path[MAXPGPATH];
+
+               subidx = -1;
+               subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+               /* FIXME optimize the search by bsearch on sorted data */
+               for (i = nsubxacts; i > 0; i--)
+               {
+                       if (subxacts[i - 1].xid == subxid)
+                       {
+                               subidx = (i - 1);
+                               found = true;
+                               break;
+                       }
+               }
+
+               /* We should not receive aborts for unknown subtransactions. */
+               Assert(found);

We can use PG_USED_FOR_ASSERTS_ONLY for that variable.

Should we include printing of id here like in earlier cases in
v2-0002-Issue-individual-invalidations-with-wal_level-log.patch:
+                       appendStringInfo(buf, " relcache %u", msg->rc.relId);
+               /* not expected, but print something anyway */
+               else if (msg->id == SHAREDINVALSMGR_ID)
+                       appendStringInfoString(buf, " smgr");
+               /* not expected, but print something anyway */
+               else if (msg->id == SHAREDINVALRELMAP_ID)
+                       appendStringInfo(buf, " relmap db %u", msg->rm.dbId);

I am not sure if this patch is logging these invalidations, so not
sure if it makes sense to add more ids in the cases you are referring
to. However, if we change it to logging all invalidations at command
end as being discussed in this thread, then it might be better to do
what you are suggesting.

Should we can add function header for AssertChangeLsnOrder in
v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch:
+static void
+AssertChangeLsnOrder(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
This "Assert(txn->first_lsn != InvalidXLogRecPtr)"can be before the
loop, can be checked only once:
+       dlist_foreach(iter, &txn->changes)
+       {
+               ReorderBufferChange *cur_change;
+
+               cur_change = dlist_container(ReorderBufferChange,
node, iter.cur);
+
+               Assert(txn->first_lsn != InvalidXLogRecPtr);
+               Assert(cur_change->lsn != InvalidXLogRecPtr);
+               Assert(txn->first_lsn <= cur_change->lsn);

This makes sense to me. Another thing about this function, do we
really need "ReorderBuffer *rb" parameter in this function?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#163Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#154)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Dec 12, 2019 at 3:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I don't think we have evaluated it yet, but we should do it. The
point to note is that it is only for the case when wal_level is
'logical' (see IsSubTransactionAssignmentPending) in which case we
already log more WAL, so this might not impact much. I guess that it
might be better to have that check in XLogRecordAssemble for the sake
of clarity.

I don't think that this is really a valid argument. Just because we
have some overhead now doesn't mean that adding more won't hurt. Even
testing the wal_level costs a little something.

I think the way invalidations work for logical replication is that
normally, we always start a new transaction before decoding each
commit which allows us to accept the invalidations (via
AtStart_Cache). However, if there are catalog changes within the
transaction being decoded, we need to reflect those before trying to
decode the WAL of operation which happened after that catalog change.
As we are not logging the WAL for each invalidation, we need to
execute all the invalidation messages for this transaction at each
catalog change. We are able to do that now as we decode the entire WAL
for a transaction only once we get the commit's WAL which contains all
the invalidation messages. So, we queue them up and execute them for
each catalog change which we identify by WAL record
XLOG_HEAP2_NEW_CID.

Thanks for the explanation. That makes sense. But, it's still true,
AFAICS, that instead of doing this stuff with logging invalidations
you could just InvalidateSystemCaches() in the cases where you are
currently applying all of the transaction's invalidations. That
approach might be worse than changing the way invalidations are
logged, but the two approaches deserve to be compared. One approach
has more CPU overhead and the other has more WAL overhead, so it's a
little hard to compare them, but it seems worth mulling over.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#164Masahiko Sawada
masahiko.sawada@2ndquadrant.com
In reply to: Amit Kapila (#159)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, 20 Dec 2019 at 22:30, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Dec 20, 2019 at 11:47 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Mon, 2 Dec 2019 at 17:32, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:

I have rebased the patch on the latest head and also fix the issue of
"concurrent abort handling of the (sub)transaction." and attached as
(v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
the complete patch set. I have added the version number so that we
can track the changes.

The patch has rotten a bit and does not apply anymore. Could you
please send a rebased version? I have moved it to next CF, waiting on
author.

I have rebased the patch set on the latest head.

Thank you for working on this.

This might have already been discussed but I have a question about the
changes of logical replication worker. In the current logical
replication there is a problem that the response time are doubled when
using synchronous replication because wal senders send changes after
commit. It's worse especially when a transaction makes a lot of
changes. So I expected this feature to reduce the response time by
sending changes even while the transaction is progressing but it
doesn't seem to be. The logical replication worker writes changes to
temporary files and applies these changes when the worker received
commit record (STREAM COMMIT). Since the worker sends the LSN of
commit record as flush LSN to the publisher after applying all
changes, the publisher must wait for all changes are applied to the
subscriber.

The main aim of this feature is to reduce apply lag. Because if we
send all the changes together it can delay there apply because of
network delay, whereas if most of the changes are already sent, then
we will save the effort on sending the entire data at commit time.
This in itself gives us decent benefits. Sure, we can further improve
it by having separate workers (dedicated to apply the changes) as you
are suggesting and in fact, there is a patch for that as well(see the
performance results and bgworker patch at [1]), but if try to shove in
all the things in one go, then it will be difficult to get this patch
committed (there are already enough things and the patch is quite big
that to get it right takes a lot of energy). So, the plan is
something like that first we get the basic feature and then try to
improve by having dedicated workers or things like that. Does this
make sense to you?

Thank you for explanation. The plan makes sense. But I think in the
current design it's a problem that logical replication worker doesn't
receive changes (and doesn't check interrupts) during applying
committed changes even if we don't have a worker dedicated for
applying. I think the worker should continue to receive changes and
save them to temporary files even during applying changes. Otherwise
the buffer would be easily full and replication gets stuck.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#165Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#164)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Dec 24, 2019 at 11:17 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Fri, 20 Dec 2019 at 22:30, Amit Kapila <amit.kapila16@gmail.com> wrote:

The main aim of this feature is to reduce apply lag. Because if we
send all the changes together it can delay there apply because of
network delay, whereas if most of the changes are already sent, then
we will save the effort on sending the entire data at commit time.
This in itself gives us decent benefits. Sure, we can further improve
it by having separate workers (dedicated to apply the changes) as you
are suggesting and in fact, there is a patch for that as well(see the
performance results and bgworker patch at [1]), but if try to shove in
all the things in one go, then it will be difficult to get this patch
committed (there are already enough things and the patch is quite big
that to get it right takes a lot of energy). So, the plan is
something like that first we get the basic feature and then try to
improve by having dedicated workers or things like that. Does this
make sense to you?

Thank you for explanation. The plan makes sense. But I think in the
current design it's a problem that logical replication worker doesn't
receive changes (and doesn't check interrupts) during applying
committed changes even if we don't have a worker dedicated for
applying. I think the worker should continue to receive changes and
save them to temporary files even during applying changes.

Won't it beat the purpose of this feature which is to reduce the apply
lag? Basically, it can so happen that while applying commit, it
constantly gets changes of other transactions which will delay the
apply of the current transaction. Also, won't it create some further
work to identify the order of commits? Say while applying commit-1,
it receives 5 other commits that are written to separate temporary
files. How will we later identify which transaction's WAL we need to
apply first? We might deduce by LSN's, but I think that could be
tricky. Another thing is that I think it could lead to some design
complications as well because while applying commit, you need some
sort of callback or something like that to receive and flush totally
unrelated changes. It could lead to another kind of failure mode
wherein while applying commit if it tries to receive another
transaction data and some failure happens while writing the data of
that transaction. I am not sure if it is a good idea to try something
like that.

Otherwise
the buffer would be easily full and replication gets stuck.

Are you telling about network buffer? I think the best way as
discussed is to launch new workers for streamed transactions, but we
can do that as an additional feature. Anyway, as proposed, users can
choose the streaming mode for subscriptions, so there is an option to
turn this selectively.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#166Masahiko Sawada
masahiko.sawada@2ndquadrant.com
In reply to: Amit Kapila (#165)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, 24 Dec 2019 at 17:21, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 24, 2019 at 11:17 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Fri, 20 Dec 2019 at 22:30, Amit Kapila <amit.kapila16@gmail.com> wrote:

The main aim of this feature is to reduce apply lag. Because if we
send all the changes together it can delay there apply because of
network delay, whereas if most of the changes are already sent, then
we will save the effort on sending the entire data at commit time.
This in itself gives us decent benefits. Sure, we can further improve
it by having separate workers (dedicated to apply the changes) as you
are suggesting and in fact, there is a patch for that as well(see the
performance results and bgworker patch at [1]), but if try to shove in
all the things in one go, then it will be difficult to get this patch
committed (there are already enough things and the patch is quite big
that to get it right takes a lot of energy). So, the plan is
something like that first we get the basic feature and then try to
improve by having dedicated workers or things like that. Does this
make sense to you?

Thank you for explanation. The plan makes sense. But I think in the
current design it's a problem that logical replication worker doesn't
receive changes (and doesn't check interrupts) during applying
committed changes even if we don't have a worker dedicated for
applying. I think the worker should continue to receive changes and
save them to temporary files even during applying changes.

Won't it beat the purpose of this feature which is to reduce the apply
lag? Basically, it can so happen that while applying commit, it
constantly gets changes of other transactions which will delay the
apply of the current transaction.

You're right. But it seems to me that it optimizes the apply lags of
only a transaction that made many changes. On the other hand if a
transaction made many changes applying of subsequent changes are
delayed.

Also, won't it create some further
work to identify the order of commits? Say while applying commit-1,
it receives 5 other commits that are written to separate temporary
files. How will we later identify which transaction's WAL we need to
apply first? We might deduce by LSN's, but I think that could be
tricky. Another thing is that I think it could lead to some design
complications as well because while applying commit, you need some
sort of callback or something like that to receive and flush totally
unrelated changes. It could lead to another kind of failure mode
wherein while applying commit if it tries to receive another
transaction data and some failure happens while writing the data of
that transaction. I am not sure if it is a good idea to try something
like that.

It's just an idea but we might want to have new workers dedicated to
apply changes first and then we will have streaming option later. That
way we can reduce the flush lags depending on use cases. The commit
order can be determined by the receiver and shared with the applyer
in shared memory. Once we separated workers the streaming option can
be introduced without such a downside.

Otherwise
the buffer would be easily full and replication gets stuck.

Are you telling about network buffer?

Yes.

I think the best way as
discussed is to launch new workers for streamed transactions, but we
can do that as an additional feature. Anyway, as proposed, users can
choose the streaming mode for subscriptions, so there is an option to
turn this selectively.

Yes. But user who wants to use this feature would want to replicate
many changes but I guess the side effect is quite big. I think that at
least we need to make the logical replication tolerate such situation.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#167Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dilip Kumar (#150)
17 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Dec 10, 2019 at 10:23:19AM +0530, Dilip Kumar wrote:

On Tue, Dec 10, 2019 at 9:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:

I have rebased the patch on the latest head and also fix the issue of
"concurrent abort handling of the (sub)transaction." and attached as
(v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
the complete patch set. I have added the version number so that we
can track the changes.

The patch has rotten a bit and does not apply anymore. Could you
please send a rebased version? I have moved it to next CF, waiting on
author.

I have rebased the patch set on the latest head.

Apart from this, there is one issue reported by my colleague Vignesh.
The issue is that if we use more than two relations in a transaction
then there is an error on standby (no relation map entry for remote
relation ID 16390). After analyzing I have found that for the
streaming transaction an "is_schema_sent" flag is kept in
ReorderBufferTXN. And, I think that is done so that we can send the
schema for each transaction stream so that if any subtransaction gets
aborted we don't lose the logical WAL for that schema. But, this
solution has induced a very basic issue that if a transaction operate
on more than 1 relation then after sending the schema for the first
relation it will mark the flag true and the schema for the subsequent
relations will never be sent.

How about keeping a list of top-level xids in each RelationSyncEntry?
Basically, whenever we send the schema for any transaction, we note
that in RelationSyncEntry and at abort time we can remove xid from the
list. Now, whenever, we check whether to send schema for any
operation in a transaction, we will check if our xid is present in
that list for a particular RelationSyncEntry and take an action based
on that (if xid is present, then we won't send the schema, otherwise,
send it).

The idea make sense to me. I will try to write a patch for this and test.

Yeah, the "is_schema_sent" flag in ReorderBufferTXN does not work - it
needs to be in the RelationSyncEntry. In fact, I already have code for
that in my private repository - I thought the patches I sent here do
include this, but apparently I forgot to include this bit :-(

Attached is a rebased patch series, fixing this. It's essentially v2
with a couple of patches (0003, 0008, 0009 and 0012) replacing the
is_schema_sent with correct handling.

0003 - removes an is_schema_sent reference added prematurely (it's added
by a later patch, causing compile failure)

0008 - adds the is_schema_sent back (essentially reverting 0003)

0009 - removes is_schema_sent entirely

0012 - adds the correct handling of schema flags in pgoutput

I don't know what other changes you've made since v2, so this way it
should be possible to just take 0003, 0008, 0009 and 0012 and slip them
in with minimal hassle.

FWIW thanks to everyone (and Amit and Dilip in particular) working on
this patch series. There's been a lot of great reviews and improvements
since I abandoned this thread for a while. I expect to be able to spend
more time working on this in January.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Immediately-WAL-log-assignments-v3.patchtext/plain; charset=us-asciiDownload
From ae907baaaf7401a4ac906de6127f4318241ac3a5 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 09:53:08 +0530
Subject: [PATCH 01/17] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So instead we write the assignment info into WAL immediately, as
part of the next WAL record (to minimize overhead).
---
 src/backend/access/transam/xact.c        | 45 ++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 +++
 src/backend/replication/logical/decode.c | 39 ++++++++++----------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 98 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 5353b6ab0b..708e5233f4 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -190,6 +190,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -5096,6 +5097,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6002,3 +6004,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index aa9dca0036..a8a8084713 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -88,11 +88,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -194,6 +196,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -397,7 +403,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -743,6 +749,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		curinsert_flags |= XLOG_INCLUDE_XID;
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 67418b05f1..4435c636bc 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1165,6 +1165,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1203,6 +1204,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index bc532d027b..897b755eb4 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel_xid is valid, we need to assign the subxact to the
+	 * toplevel transaction. We need to do this for all records, hence we
+	 * do it before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 record->toplevel_xid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrIds) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(r->toplevel_xid))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 9d2899dea1..5b9740c5c3 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 3fea1993bc..b1976ac653 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -227,6 +227,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 0193611b7f..a676151561 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -147,6 +147,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -280,6 +282,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 9375e54195..bcfba0a101 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
2.21.0

0002-Issue-individual-invalidations-with-wal_level-log-v3.patchtext/plain; charset=us-asciiDownload
From 7b2a948ef7ca6fa43f94ccf11f11f8edfb3fe028 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:26:33 +0530
Subject: [PATCH 02/17] Issue individual invalidations with wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.

The new invalidations are written to WAL immediately, without any
such caching. Perhaps it would be possible to add similar caching,
e.g. at the command level, or something like that?
---
 src/backend/access/rmgrdesc/xactdesc.c        | 52 ++++++++++++++
 src/backend/access/transam/xact.c             |  7 ++
 src/backend/replication/logical/decode.c      | 23 +++++++
 .../replication/logical/reorderbuffer.c       | 56 +++++++++++++--
 src/backend/utils/cache/inval.c               | 69 +++++++++++++++++++
 src/include/access/xact.h                     | 18 ++++-
 src/include/replication/reorderbuffer.h       | 14 ++++
 7 files changed, 231 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 4c411c5322..6cfd6af24e 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,11 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -397,6 +402,14 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs,
+								xlrec->dbId, xlrec->tsId,
+								xlrec->relcacheInitFileInval);
+	}
 }
 
 const char *
@@ -424,7 +437,46 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval)
+{
+	int			i;
+
+	if (relcacheInitFileInval)
+		appendStringInfo(buf, "; relcache init file inval dbid %u tsid %u",
+						 dbId, tsId);
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else if (msg->id == SHAREDINVALSNAPSHOT_ID)
+			appendStringInfo(buf, " snapshot %u", msg->sn.relId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 708e5233f4..da15556357 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6001,6 +6001,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 897b755eb4..9bcefb6e6d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,29 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/* XXX for now we're issuing invalidations one by one */
+				Assert(invals->nmsgs == 1);
+
+				if (!TransactionIdIsValid(xid))
+					break;
+
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->dbId, invals->tsId,
+											 invals->relcacheInitFileInval,
+											 invals->msgs[0]);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 53affeb877..b1feff3e71 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -464,6 +464,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
@@ -1804,17 +1805,23 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation message locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+					txn->is_schema_sent = false;
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -2207,6 +2214,38 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	txn->ntuplecids++;
 }
 
+/*
+ * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn,
+							 Oid dbId, Oid tsId, bool relcacheInitFileInval,
+							 SharedInvalidationMessage msg)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	/* XXX Should we even write invalidations without valid XID? */
+	if (xid == InvalidTransactionId)
+		return;
+
+	Assert(xid != InvalidTransactionId);
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.dbId = dbId;
+	change->data.inval.tsId = tsId;
+	change->data.inval.relcacheInitFileInval = relcacheInitFileInval;
+	change->data.inval.msg = msg;
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Setup the invalidation of the toplevel transaction.
  *
@@ -2656,6 +2695,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -2752,6 +2792,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -3027,6 +3068,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index f09e3a9aff..0682c55b51 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -104,6 +104,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +211,9 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -489,6 +493,18 @@ RegisterCatcacheInvalidation(int cacheId,
 {
 	AddCatcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   cacheId, hashValue, dbId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cc.id = (int8) cacheId;
+		msg.cc.dbId = dbId;
+		msg.cc.hashValue = hashValue;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -501,6 +517,18 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 {
 	AddCatalogInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								  dbId, catId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cat.id = SHAREDINVALCATALOG_ID;
+		msg.cat.dbId = dbId;
+		msg.cat.catId = catId;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -511,6 +539,8 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 static void
 RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 {
+	bool		RelcacheInitFileInval = false;
+
 	AddRelcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   dbId, relId);
 
@@ -529,7 +559,22 @@ RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 	 * as well.  Also zap when we are invalidating whole relcache.
 	 */
 	if (relId == InvalidOid || RelationIdIsInInitFile(relId))
+	{
 		transInvalInfo->RelcacheInitFileInval = true;
+		RelcacheInitFileInval = true;
+	}
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.rc.id = SHAREDINVALRELCACHE_ID;
+		msg.rc.dbId = dbId;
+		msg.rc.relId = relId;
+
+		LogLogicalInvalidations(1, &msg, RelcacheInitFileInval);
+	}
 }
 
 /*
@@ -1501,3 +1546,27 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval)
+{
+	xl_xact_invalidations xlrec;
+
+	/* prepare record */
+	memset(&xlrec, 0, sizeof(xlrec));
+	xlrec.dbId = MyDatabaseId;
+	xlrec.tsId = MyDatabaseTableSpace;
+	xlrec.relcacheInitFileInval = relcacheInitFileInval;
+	xlrec.nmsgs = nmsgs;
+
+	/* perform insertion */
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+	XLogRegisterData((char *) msgs,
+					 nmsgs * sizeof(SharedInvalidationMessage));
+	XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5b9740c5c3..82d49428c2 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -197,6 +197,22 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+/*
+ * Invalidations logged with wal_level=logical.
+ *
+ * XXX Currently nmsgs=1 but that might change in the future.
+ */
+typedef struct xl_xact_invalidations
+{
+	Oid			dbId;			/* MyDatabaseId */
+	Oid			tsId;			/* MyDatabaseTableSpace */
+	bool		relcacheInitFileInval;	/* invalidate relcache init file */
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0867ee9e63..6a7187bbec 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,16 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			Oid			dbId;	/* MyDatabaseId */
+			Oid			tsId;	/* MyDatabaseTableSpace */
+			bool		relcacheInitFileInval;	/* invalidate relcache init
+												 * file */
+			SharedInvalidationMessage msg;	/* invalidation message */
+		}			inval;
 	}			data;
 
 	/*
@@ -448,6 +459,9 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  Oid dbId, Oid tsId, bool relcacheInitFileInval,
+								  SharedInvalidationMessage msg);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
2.21.0

0003-fixup-is_schema_sent-set-too-early-v3.patchtext/plain; charset=us-asciiDownload
From 473be553d5d798eac72700a5352d099295344cd6 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 27 Dec 2019 22:50:55 +0100
Subject: [PATCH 03/17] fixup: is_schema_sent set too early

---
 src/backend/replication/logical/reorderbuffer.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b1feff3e71..c0b97251e2 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1819,7 +1819,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * about the message itself?
 					 */
 					LocalExecuteInvalidationMessage(&change->data.inval.msg);
-					txn->is_schema_sent = false;
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
-- 
2.21.0

0004-Extend-the-output-plugin-API-with-stream-methods-v3.patchtext/plain; charset=us-asciiDownload
From c6f6ee18707a3dea988b98c6c5351d816a67be70 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH 04/17] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 ++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 +++++
 src/include/replication/reorderbuffer.h   |  57 ++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 6c33c4bded..9c77791dd5 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -542,3 +571,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8db968641e..fc4ad65eae 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_message_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction size and network bandwidth, the transfer time
+    may significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similarly to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_work_mem</varname> setting. At
+    that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 7e06615864..b88b58505a 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +204,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -863,6 +911,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	/* FIXME ctx->write_location = apply_lsn; */
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	/* FIXME ctx->write_location = apply_lsn; */
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 6879a2e6d2..1e934d25e6 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -79,6 +79,11 @@ typedef struct LogicalDecodingContext
 	 */
 	void	   *output_writer_private;
 
+	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index d4ce54f26d..a30546250a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
 /*
  * Output plugin callbacks
  */
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6a7187bbec..5b4be2bf88 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -345,6 +345,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -383,6 +429,17 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
 	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
-- 
2.21.0

0005-Cleaning-up-of-flags-in-ReorderBufferTXN-structur-v3.patchtext/plain; charset=us-asciiDownload
From 8452286c8256ba59163f98309c7c59cefbe64845 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 18:08:37 +0200
Subject: [PATCH 05/17] Cleaning up of flags in ReorderBufferTXN structure

---
 .../replication/logical/reorderbuffer.c       | 36 +++++++++----------
 src/include/replication/reorderbuffer.h       | 33 ++++++++++-------
 2 files changed, 38 insertions(+), 31 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c0b97251e2..f74c1996d0 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -732,7 +732,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 			Assert(prev_first_lsn < cur_txn->first_lsn);
 
 		/* known-as-subtxn txns must not be listed */
-		Assert(!cur_txn->is_known_as_subxact);
+		Assert(!rbtxn_is_known_subxact(cur_txn));
 
 		prev_first_lsn = cur_txn->first_lsn;
 	}
@@ -752,7 +752,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 			Assert(prev_base_snap_lsn < cur_txn->base_snapshot_lsn);
 
 		/* known-as-subtxn txns must not be listed */
-		Assert(!cur_txn->is_known_as_subxact);
+		Assert(!rbtxn_is_known_subxact(cur_txn));
 
 		prev_base_snap_lsn = cur_txn->base_snapshot_lsn;
 	}
@@ -775,7 +775,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
 
 	txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
 
-	Assert(!txn->is_known_as_subxact);
+	Assert(!rbtxn_is_known_subxact(txn));
 	Assert(txn->first_lsn != InvalidXLogRecPtr);
 	return txn;
 }
@@ -835,7 +835,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 
 	if (!new_sub)
 	{
-		if (subtxn->is_known_as_subxact)
+		if (rbtxn_is_known_subxact(subtxn))
 		{
 			/* already associated, nothing to do */
 			return;
@@ -851,7 +851,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 		}
 	}
 
-	subtxn->is_known_as_subxact = true;
+	subtxn->txn_flags |= RBTXN_IS_SUBXACT;
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
@@ -1061,7 +1061,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	{
 		ReorderBufferChange *cur_change;
 
-		if (txn->serialized)
+		if (rbtxn_is_serialized(txn))
 		{
 			/* serialize remaining changes */
 			ReorderBufferSerializeTXN(rb, txn);
@@ -1090,7 +1090,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		{
 			ReorderBufferChange *cur_change;
 
-			if (cur_txn->serialized)
+			if (rbtxn_is_serialized(cur_txn))
 			{
 				/* serialize remaining changes */
 				ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1256,7 +1256,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * they originally were happening inside another subtxn, so we won't
 		 * ever recurse more than one level deep here.
 		 */
-		Assert(subtxn->is_known_as_subxact);
+		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
 		ReorderBufferCleanupTXN(rb, subtxn);
@@ -1304,7 +1304,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	/*
 	 * Remove TXN from its containing list.
 	 *
-	 * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+	 * Note: if txn is known as subxact, we are deleting the TXN from its
 	 * parent's list of known subxacts; this leaves the parent's nsubxacts
 	 * count too high, but we don't care.  Otherwise, we are deleting the TXN
 	 * from the LSN-ordered list of toplevel TXNs.
@@ -1319,7 +1319,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	Assert(found);
 
 	/* remove entries spilled to disk */
-	if (txn->serialized)
+	if (rbtxn_is_serialized(txn))
 		ReorderBufferRestoreCleanup(rb, txn);
 
 	/* deallocate */
@@ -1336,7 +1336,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
 		return;
 
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1969,7 +1969,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
 			 * final_lsn to that of their last change; this causes
 			 * ReorderBufferRestoreCleanup to do the right thing.
 			 */
-			if (txn->serialized && txn->final_lsn == 0)
+			if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
 			{
 				ReorderBufferChange *last =
 				dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -2117,7 +2117,7 @@ ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
 	 * operate on its top-level transaction instead.
 	 */
 	txn = ReorderBufferTXNByXid(rb, xid, true, &is_new, lsn, true);
-	if (txn->is_known_as_subxact)
+	if (rbtxn_is_known_subxact(txn))
 		txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
 									NULL, InvalidXLogRecPtr, false);
 	Assert(txn->base_snapshot == NULL);
@@ -2296,7 +2296,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	txn->has_catalog_changes = true;
+	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2313,7 +2313,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
 	if (txn == NULL)
 		return false;
 
-	return txn->has_catalog_changes;
+	return rbtxn_has_catalog_changes(txn);
 }
 
 /*
@@ -2333,7 +2333,7 @@ ReorderBufferXidHasBaseSnapshot(ReorderBuffer *rb, TransactionId xid)
 		return false;
 
 	/* a known subtxn? operate on top-level txn instead */
-	if (txn->is_known_as_subxact)
+	if (rbtxn_is_known_subxact(txn))
 		txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
 									NULL, InvalidXLogRecPtr, false);
 
@@ -2521,12 +2521,12 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	rb->spillBytes += size;
 
 	/* Don't consider already serialized transaction. */
-	rb->spillTxns += txn->serialized ? 0 : 1;
+	rb->spillTxns += rbtxn_is_serialized(txn) ? 0 : 1;
 
 	Assert(spilled == txn->nentries_mem);
 	Assert(dlist_is_empty(&txn->changes));
 	txn->nentries_mem = 0;
-	txn->serialized = true;
+	txn->txn_flags |= RBTXN_IS_SERIALIZED;
 
 	if (fd != -1)
 		CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5b4be2bf88..19c7bac8c0 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -169,18 +169,34 @@ typedef struct ReorderBufferChange
 	dlist_node	node;
 } ReorderBufferChange;
 
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT          0x0002
+#define RBTXN_IS_SERIALIZED       0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn)    (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk?  It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn)       (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
 typedef struct ReorderBufferTXN
 {
+	int     txn_flags;
+
 	/*
 	 * The transactions transaction id, can be a toplevel or sub xid.
 	 */
 	TransactionId xid;
 
-	/* did the TX have catalog changes */
-	bool		has_catalog_changes;
-
 	/* Do we know this is a subxact?  Xid of top-level txn if so */
-	bool		is_known_as_subxact;
 	TransactionId toplevel_xid;
 
 	/*
@@ -248,15 +264,6 @@ typedef struct ReorderBufferTXN
 	 */
 	uint64		nentries_mem;
 
-	/*
-	 * Has this transaction been spilled to disk?  It's not always possible to
-	 * deduce that fact by comparing nentries with nentries_mem, because e.g.
-	 * subtransactions of a large transaction might get serialized together
-	 * with the parent - if they're restored to memory they'd have
-	 * nentries_mem == nentries.
-	 */
-	bool		serialized;
-
 	/*
 	 * List of ReorderBufferChange structs, including new Snapshots and new
 	 * CommandIds
-- 
2.21.0

0006-Gracefully-handle-concurrent-aborts-of-uncommitte-v3.patchtext/plain; charset=us-asciiDownload
From 3656702c911785df27422ecd177e506723b0e33f Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:32:34 +0530
Subject: [PATCH 06/17] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml             |  5 +-
 src/backend/access/heap/heapam.c              | 51 +++++++++++++++++++
 src/backend/access/index/genam.c              | 34 +++++++++++++
 .../replication/logical/reorderbuffer.c       |  9 ++--
 src/backend/utils/time/snapmgr.c              | 25 ++++++++-
 src/include/utils/snapmgr.h                   |  4 +-
 6 files changed, 120 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index fc4ad65eae..da6a6f3233 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,7 +432,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
      includes writing to tables, performing DDL changes, and
      calling <literal>txid_current()</literal>.
     </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0128bb34ef..2a60a7380a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1303,6 +1303,17 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(scan->rs_base.rs_rd) ||
+		  RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_getnext call")));
+
 	/* Note: no locking manipulations needed */
 
 	HEAPDEBUG_1;				/* heap_getnext( info ) */
@@ -1421,6 +1432,16 @@ heap_fetch(Relation relation,
 	OffsetNumber offnum;
 	bool		valid;
 
+	/*
+	 * We don't expect direct calls to heap_fetch with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_fetch call")));
+
 	/*
 	 * Fetch and pin the appropriate page of the relation.
 	 */
@@ -1535,6 +1556,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		valid;
 	bool		skip;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search_buffer with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_hot_search_buffer call")));
+
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
 		*all_dead = first_call;
@@ -1682,6 +1713,16 @@ heap_get_latest_tid(TableScanDesc sscan,
 	 */
 	Assert(ItemPointerIsValid(tid));
 
+	/*
+	 * We don't expect direct calls to heap_get_latest_tid with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_get_latest_tid call")));
+
 	/*
 	 * Loop to chase down t_ctid links.  At top of loop, ctid is the tuple we
 	 * need to examine, and *tid is the TID we will return if ctid turns out
@@ -5481,6 +5522,16 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	ItemId		lp = NULL;
 	HeapTupleHeader htup;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_hot_search call")));
+
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d342..201acfbbf2 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -477,6 +478,17 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
@@ -513,6 +525,17 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return result;
 }
 
@@ -639,6 +662,17 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index f74c1996d0..76d2701233 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -683,7 +683,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 			txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 		/* setup snapshot to allow catalog access */
-		SetupHistoricSnapshot(snapshot_now, NULL);
+		SetupHistoricSnapshot(snapshot_now, NULL, xid);
 		PG_TRY();
 		{
 			rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1533,7 +1533,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1784,7 +1784,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1804,7 +1804,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					}
 
 					break;
@@ -1876,6 +1876,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_CATCH();
 	{
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
+
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
 
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 47b0517596..9fa1e43347 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -153,6 +153,13 @@ static Snapshot SecondarySnapshot = NULL;
 static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
+/*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
 /*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
@@ -2029,10 +2036,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
  *
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive.  This is to re-check XID status while accessing catalog.
+ *
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+					  TransactionId snapshot_xid)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -2041,8 +2052,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 
 	/* setup (cmin, cmax) lookup hash */
 	tuplecid_data = tuplecids;
-}
 
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
+	 */
+	if (TransactionIdIsValid(snapshot_xid) &&
+		!TransactionIdDidCommit(snapshot_xid))
+		CheckXidAlive = snapshot_xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
 /*
  * Make catalog snapshots behave normally again.
@@ -2052,6 +2072,7 @@ TeardownHistoricSnapshot(bool is_error)
 {
 	HistoricSnapshot = NULL;
 	tuplecid_data = NULL;
+	CheckXidAlive = InvalidTransactionId;
 }
 
 bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 67b07df48c..9a8f9ceba3 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,8 +145,10 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+								  TransactionId snapshot_xid);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-- 
2.21.0

0007-Implement-streaming-mode-in-ReorderBuffer-v3.patchtext/plain; charset=us-asciiDownload
From 4dfb923fb4eea66cffb58de3175ae44fa56350b3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 11:42:31 +0530
Subject: [PATCH 07/17] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c   |   38 +-
 .../replication/logical/reorderbuffer.c       | 1075 ++++++++++++++++-
 src/include/replication/reorderbuffer.h       |   32 +
 3 files changed, 1112 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 3e3646716f..cf10dd041d 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 76d2701233..f02c47238a 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -149,6 +149,28 @@ typedef struct ReorderBufferIterTXNState
 	ReorderBufferIterTXNEntry entries[FLEXIBLE_ARRAY_MEMBER];
 } ReorderBufferIterTXNState;
 
+/*
+ * k-way in-order change iteration support structures
+ *
+ * This is a simplified version for streaming, which does not require
+ * serialization to files and only reads changes that are currently in
+ * memory.
+ */
+typedef struct ReorderBufferStreamIterTXNEntry
+{
+	XLogRecPtr	lsn;
+	ReorderBufferChange *change;
+	ReorderBufferTXN *txn;
+}			ReorderBufferStreamIterTXNEntry;
+
+typedef struct ReorderBufferStreamIterTXNState
+{
+	binaryheap *heap;
+	Size		nr_txns;
+	dlist_head	old_change;
+	ReorderBufferStreamIterTXNEntry entries[FLEXIBLE_ARRAY_MEMBER];
+}			ReorderBufferStreamIterTXNState;
+
 /* toast datastructures */
 typedef struct ReorderBufferToastEnt
 {
@@ -213,6 +235,20 @@ static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
 static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
 
+
+/* iterator for streaming (only get data from memory) */
+static ReorderBufferStreamIterTXNState * ReorderBufferStreamIterTXNInit(
+																		ReorderBuffer *rb,
+																		ReorderBufferTXN *txn);
+
+static ReorderBufferChange *ReorderBufferStreamIterTXNNext(
+							   ReorderBuffer *rb,
+							   ReorderBufferStreamIterTXNState * state);
+
+static void ReorderBufferStreamIterTXNFinish(
+								 ReorderBuffer *rb,
+								 ReorderBufferStreamIterTXNState * state);
+
 /*
  * ---------------------------------------
  * Disk serialization support functions
@@ -227,6 +263,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -235,6 +272,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -362,6 +408,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -759,6 +808,33 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+static void
+AssertChangeLsnOrder(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -855,6 +931,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -978,7 +1057,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1006,6 +1085,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	cur_txn_i;
 	int32		off;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(rb, txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1020,6 +1102,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(rb, cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1234,6 +1319,210 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 	pfree(state);
 }
 
+/*
+ * Binary heap comparison function (streaming iterator).
+ */
+static int
+ReorderBufferStreamIterCompare(Datum a, Datum b, void *arg)
+{
+	ReorderBufferStreamIterTXNState *state = (ReorderBufferStreamIterTXNState *) arg;
+	XLogRecPtr	pos_a = state->entries[DatumGetInt32(a)].lsn;
+	XLogRecPtr	pos_b = state->entries[DatumGetInt32(b)].lsn;
+
+	if (pos_a < pos_b)
+		return 1;
+	else if (pos_a == pos_b)
+		return 0;
+	return -1;
+}
+
+/*
+ * Allocate & initialize an iterator which iterates in lsn order over a
+ * transaction and all its subtransactions. This version is meant for
+ * streaming of incomplete transactions.
+ */
+static ReorderBufferStreamIterTXNState *
+ReorderBufferStreamIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Size		nr_txns = 0;
+	ReorderBufferStreamIterTXNState *state;
+	dlist_iter	cur_txn_i;
+	int32		off;
+
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(rb, txn);
+
+	/*
+	 * Calculate the size of our heap: one element for every transaction that
+	 * contains changes.  (Besides the transactions already in the reorder
+	 * buffer, we count the one we were directly passed.)
+	 */
+	if (txn->nentries > 0)
+		nr_txns++;
+
+	dlist_foreach(cur_txn_i, &txn->subtxns)
+	{
+		ReorderBufferTXN *cur_txn;
+
+		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
+
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(rb, cur_txn);
+
+		if (cur_txn->nentries > 0)
+			nr_txns++;
+	}
+
+	/*
+	 * TODO: Consider adding fastpath for the rather common nr_txns=1 case, no
+	 * need to allocate/build a heap then.
+	 */
+
+	/* allocate iteration state */
+	state = (ReorderBufferStreamIterTXNState *)
+		MemoryContextAllocZero(rb->context,
+							   sizeof(ReorderBufferStreamIterTXNState) +
+							   sizeof(ReorderBufferStreamIterTXNEntry) * nr_txns);
+
+	state->nr_txns = nr_txns;
+	dlist_init(&state->old_change);
+
+	/* allocate heap */
+	state->heap = binaryheap_allocate(state->nr_txns,
+									  ReorderBufferStreamIterCompare,
+									  state);
+
+	/*
+	 * Now insert items into the binary heap, in an unordered fashion.  (We
+	 * will run a heap assembly step at the end; this is more efficient.)
+	 */
+
+	off = 0;
+
+	/* add toplevel transaction if it contains changes */
+	if (txn->nentries > 0)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_head_element(ReorderBufferChange, node,
+										&txn->changes);
+
+		state->entries[off].lsn = cur_change->lsn;
+		state->entries[off].change = cur_change;
+		state->entries[off].txn = txn;
+
+		binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
+	}
+
+	/* add subtransactions if they contain changes */
+	dlist_foreach(cur_txn_i, &txn->subtxns)
+	{
+		ReorderBufferTXN *cur_txn;
+
+		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
+
+		if (cur_txn->nentries > 0)
+		{
+			ReorderBufferChange *cur_change;
+
+			cur_change = dlist_head_element(ReorderBufferChange, node,
+											&cur_txn->changes);
+
+			state->entries[off].lsn = cur_change->lsn;
+			state->entries[off].change = cur_change;
+			state->entries[off].txn = cur_txn;
+
+			binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
+		}
+	}
+
+	Assert(off == nr_txns);
+
+	/* assemble a valid binary heap */
+	binaryheap_build(state->heap);
+
+	return state;
+}
+
+/*
+ * Return the next change when iterating over a transaction and its
+ * subtransactions.
+ *
+ * Returns NULL when no further changes exist.
+ */
+static ReorderBufferChange *
+ReorderBufferStreamIterTXNNext(ReorderBuffer *rb, ReorderBufferStreamIterTXNState * state)
+{
+	ReorderBufferChange *change;
+	ReorderBufferStreamIterTXNEntry *entry;
+	int32		off;
+
+	/* nothing there anymore */
+	if (state->heap->bh_size == 0)
+		return NULL;
+
+	off = DatumGetInt32(binaryheap_first(state->heap));
+	entry = &state->entries[off];
+
+	/* free memory we might have "leaked" in the previous *Next call */
+	if (!dlist_is_empty(&state->old_change))
+	{
+		change = dlist_container(ReorderBufferChange, node,
+								 dlist_pop_head_node(&state->old_change));
+		ReorderBufferReturnChange(rb, change);
+		Assert(dlist_is_empty(&state->old_change));
+	}
+
+	change = entry->change;
+
+	/*
+	 * update heap with information about which transaction has the next
+	 * relevant change in LSN order
+	 */
+
+	/* there are in-memory changes */
+	if (dlist_has_next(&entry->txn->changes, &entry->change->node))
+	{
+		dlist_node *next = dlist_next_node(&entry->txn->changes, &change->node);
+		ReorderBufferChange *next_change =
+		dlist_container(ReorderBufferChange, node, next);
+
+		/* txn stays the same */
+		state->entries[off].lsn = next_change->lsn;
+		state->entries[off].change = next_change;
+
+		binaryheap_replace_first(state->heap, Int32GetDatum(off));
+		return change;
+	}
+
+	/* ok, no changes there anymore, remove */
+	binaryheap_remove_first(state->heap);
+
+	return change;
+}
+
+/*
+ * Deallocate the iterator
+ */
+static void
+ReorderBufferStreamIterTXNFinish(ReorderBuffer *rb,
+								 ReorderBufferStreamIterTXNState * state)
+{
+	/* free memory we might have "leaked" in the last *Next call */
+	if (!dlist_is_empty(&state->old_change))
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node,
+								 dlist_pop_head_node(&state->old_change));
+		ReorderBufferReturnChange(rb, change);
+		Assert(dlist_is_empty(&state->old_change));
+	}
+
+	binaryheap_free(state->heap);
+	pfree(state);
+}
+
 /*
  * Cleanup the contents of a transaction, usually after the transaction
  * committed or aborted.
@@ -1327,33 +1616,104 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
  */
 static void
-ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	dlist_iter	iter;
-	HASHCTL		hash_ctl;
+	dlist_mutable_iter iter;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
 
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
-	hash_ctl.entrysize = sizeof(ReorderBufferTupleCidEnt);
-	hash_ctl.hcxt = rb->context;
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
 
-	/*
-	 * create the hash with the exact number of to-be-stored tuplecids from
-	 * the start
-	 */
-	txn->tuplecid_hash =
-		hash_create("ReorderBufferTupleCid", txn->ntuplecids, &hash_ctl,
-					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
 
-	dlist_foreach(iter, &txn->tuplecids)
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
+ * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
+ */
+static void
+ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_iter	iter;
+	HASHCTL		hash_ctl;
+
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+
+	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
+	hash_ctl.entrysize = sizeof(ReorderBufferTupleCidEnt);
+	hash_ctl.hcxt = rb->context;
+
+	/*
+	 * create the hash with the exact number of to-be-stored tuplecids from
+	 * the start
+	 */
+	txn->tuplecid_hash =
+		hash_create("ReorderBufferTupleCid", txn->ntuplecids, &hash_ctl,
+					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	dlist_foreach(iter, &txn->tuplecids)
 	{
 		ReorderBufferTupleCidKey key;
 		ReorderBufferTupleCidEnt *ent;
@@ -1403,6 +1763,16 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 }
 
+static void
+ReorderBufferDestroyTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+}
+
 /*
  * Copy a provided snapshot so we can modify it privately. This is needed so
  * that catalog modifying transactions can look into intermediate catalog
@@ -1476,6 +1846,19 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 		SnapBuildSnapDecRefcount(snap);
 }
 
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
+
+	ReorderBufferStreamTXN(rb, txn);
+
+	rb->stream_commit(rb, txn, txn->final_lsn);
+
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
 /*
  * Perform the replay of a transaction and its non-aborted subtransactions.
  *
@@ -1514,6 +1897,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
 	/*
 	 * If this transaction has no snapshot, it didn't make any changes to the
 	 * database, so there's nothing to decode.  Note that
@@ -1549,6 +1948,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 	PG_TRY();
 	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 		ReorderBufferChange *change;
 		ReorderBufferChange *specinsert = NULL;
 
@@ -1565,6 +1965,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			if (prev_lsn != InvalidXLogRecPtr)
+				Assert(prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1929,6 +2339,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2013,6 +2430,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2148,8 +2572,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2157,6 +2590,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2168,19 +2602,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2209,6 +2652,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2284,6 +2728,9 @@ ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	for (i = 0; i < txn->ninvalidations; i++)
 		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+
+	/* Invalidate current schema as well */
+	txn->is_schema_sent = false;
 }
 
 /*
@@ -2298,6 +2745,23 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * We read catalog changes from WAL, which are not yet sent, so
+	 * invalidate current schema in order output plugin can resend
+	 * schema again.
+	 */
+	txn->is_schema_sent = false;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+	{
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		txn->toptxn->is_schema_sent = false;
+	}
 }
 
 /*
@@ -2401,6 +2865,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
@@ -2421,15 +2917,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2722,6 +3249,498 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id;
+	bool		using_subtxn;
+	Size		streamed = 0;
+	ReorderBufferStreamIterTXNState *volatile iterstate = NULL;
+
+	/*
+	 * If this is a subxact, we need to stream the top-level transaction
+	 * instead.
+	 */
+	if (txn->toptxn)
+	{
+		ReorderBufferStreamTXN(rb, txn->toptxn);
+		return;
+	}
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+
+			if (subtxn->base_snapshot != NULL &&
+				(txn->base_snapshot == NULL ||
+				 txn->base_snapshot_lsn > subtxn->base_snapshot_lsn))
+			{
+				txn->base_snapshot = subtxn->base_snapshot;
+				txn->base_snapshot_lsn = subtxn->base_snapshot_lsn;
+				subtxn->base_snapshot = NULL;
+				subtxn->base_snapshot_lsn = InvalidXLogRecPtr;
+			}
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
+		 * information about subtransactions, which could arrive after streaming start.
+		 */
+		if (!txn->is_schema_sent)
+			snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+												 txn, command_id);
+	}
+
+	/*
+	 * build data to be able to lookup the CommandIds of catalog tuples
+	 */
+	ReorderBufferBuildTupleCidHash(rb, txn);
+
+	/* setup the initial snapshot */
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, txn->xid);
+
+	/*
+	 * Decoding needs access to syscaches et al., which in turn use
+	 * heavyweight locks and such. Thus we need to have enough state around to
+	 * keep track of those.  The easiest way is to simply use a transaction
+	 * internally.  That also allows us to easily enforce that nothing writes
+	 * to the database by checking for xid assignments.
+	 *
+	 * When we're called via the SQL SRF there's already a transaction
+	 * started, so start an explicit subtransaction there.
+	 */
+	using_subtxn = IsTransactionOrTransactionBlock();
+
+	PG_TRY();
+	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+		ReorderBufferChange *change;
+		ReorderBufferChange *specinsert = NULL;
+
+		if (using_subtxn)
+			BeginInternalSubTransaction("stream");
+		else
+			StartTransactionCommand();
+
+		/* start streaming this chunk of transaction */
+		rb->stream_start(rb, txn);
+
+		iterstate = ReorderBufferStreamIterTXNInit(rb, txn);
+		while ((change = ReorderBufferStreamIterTXNNext(rb, iterstate)) != NULL)
+		{
+			Relation	relation = NULL;
+			Oid			reloid;
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			if (prev_lsn != InvalidXLogRecPtr)
+				Assert(prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* we're going to stream this change */
+			streamed++;
+
+			switch (change->action)
+			{
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+
+					/*
+					 * Confirmation for speculative insertion arrived. Simply
+					 * use as a normal record. It'll be cleaned up at the end
+					 * of INSERT processing.
+					 */
+					Assert(specinsert->data.tp.oldtuple == NULL);
+					change = specinsert;
+					change->action = REORDER_BUFFER_CHANGE_INSERT;
+
+					/* intentionally fall through */
+				case REORDER_BUFFER_CHANGE_INSERT:
+				case REORDER_BUFFER_CHANGE_UPDATE:
+				case REORDER_BUFFER_CHANGE_DELETE:
+					Assert(snapshot_now);
+
+					reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
+												change->data.tp.relnode.relNode);
+
+					/*
+					 * Catalog tuple without data, emitted while catalog was
+					 * in the process of being rewritten.
+					 */
+					if (reloid == InvalidOid &&
+						change->data.tp.newtuple == NULL &&
+						change->data.tp.oldtuple == NULL)
+						goto change_done;
+					else if (reloid == InvalidOid)
+						elog(ERROR, "could not map filenode \"%s\" to relation OID",
+							 relpathperm(change->data.tp.relnode,
+										 MAIN_FORKNUM));
+
+					relation = RelationIdGetRelation(reloid);
+
+					if (relation == NULL)
+						elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
+							 reloid,
+							 relpathperm(change->data.tp.relnode,
+										 MAIN_FORKNUM));
+
+					if (!RelationIsLogicallyLogged(relation))
+						goto change_done;
+
+					/*
+					 * For now ignore sequence changes entirely. Most of the
+					 * time they don't log changes using records we
+					 * understand, so it doesn't make sense to handle the few
+					 * cases we do.
+					 */
+					if (relation->rd_rel->relkind == RELKIND_SEQUENCE)
+						goto change_done;
+
+					/* user-triggered change */
+					if (!IsToastRelation(relation))
+					{
+						ReorderBufferToastReplace(rb, txn, relation, change);
+						rb->stream_change(rb, txn, relation, change);
+
+						/*
+						 * Only clear reassembled toast chunks if we're sure
+						 * they're not required anymore. The creator of the
+						 * tuple tells us.
+						 */
+						if (change->data.tp.clear_toast_afterwards)
+							ReorderBufferToastReset(rb, txn);
+					}
+					/* we're not interested in toast deletions */
+					else if (change->action == REORDER_BUFFER_CHANGE_INSERT)
+					{
+						/*
+						 * Need to reassemble the full toasted Datum in
+						 * memory, to ensure the chunks don't get reused till
+						 * we're done remove it from the list of this
+						 * transaction's changes. Otherwise it will get
+						 * freed/reused while restoring spooled data from
+						 * disk.
+						 */
+						dlist_delete(&change->node);
+						ReorderBufferToastAppendChunk(rb, txn, relation,
+													  change);
+					}
+
+			change_done:
+
+					/*
+					 * Either speculative insertion was confirmed, or it was
+					 * unsuccessful and the record isn't needed anymore.
+					 */
+					if (specinsert != NULL)
+					{
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+
+					if (relation != NULL)
+					{
+						RelationClose(relation);
+						relation = NULL;
+					}
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
+
+					/*
+					 * Speculative insertions are dealt with by delaying the
+					 * processing of the insert until the confirmation record
+					 * arrives. For that we simply unlink the record from the
+					 * chain, so it does not get freed/reused while restoring
+					 * spooled data from disk.
+					 *
+					 * This is safe in the face of concurrent catalog changes
+					 * because the relevant relation can't be changed between
+					 * speculative insertion and confirmation due to
+					 * CheckTableNotInUse() and locking.
+					 */
+
+					/* clear out a pending (and thus failed) speculation */
+					if (specinsert != NULL)
+					{
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+
+					/* and memorize the pending insertion */
+					dlist_delete(&change->node);
+					specinsert = change;
+					break;
+
+				case REORDER_BUFFER_CHANGE_TRUNCATE:
+					{
+						int			i;
+						int			nrelids = change->data.truncate.nrelids;
+						int			nrelations = 0;
+						Relation   *relations;
+
+						relations = palloc0(nrelids * sizeof(Relation));
+						for (i = 0; i < nrelids; i++)
+						{
+							Oid			relid = change->data.truncate.relids[i];
+							Relation	relation;
+
+							relation = RelationIdGetRelation(relid);
+
+							if (relation == NULL)
+								elog(ERROR, "could not open relation with OID %u", relid);
+
+							if (!RelationIsLogicallyLogged(relation))
+								continue;
+
+							relations[nrelations++] = relation;
+						}
+
+						rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+						for (i = 0; i < nrelations; i++)
+							RelationClose(relations[i]);
+
+						break;
+					}
+
+				case REORDER_BUFFER_CHANGE_MESSAGE:
+
+					rb->stream_message(rb, txn, change->lsn, true,
+									   change->data.msg.prefix,
+									   change->data.msg.message_size,
+									   change->data.msg.message);
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
+					/* get rid of the old */
+					TeardownHistoricSnapshot(false);
+
+					if (snapshot_now->copied)
+					{
+						ReorderBufferFreeSnap(rb, snapshot_now);
+						snapshot_now =
+							ReorderBufferCopySnap(rb, change->data.snapshot,
+												  txn, command_id);
+					}
+
+					/*
+					 * Restored from disk, need to be careful not to double
+					 * free. We could introduce refcounting for that, but for
+					 * now this seems infrequent enough not to care.
+					 */
+					else if (change->data.snapshot->copied)
+					{
+						snapshot_now =
+							ReorderBufferCopySnap(rb, change->data.snapshot,
+												  txn, command_id);
+					}
+					else
+					{
+						snapshot_now = change->data.snapshot;
+					}
+
+					/*
+					 * TOCHECK: Snapshot changed, then invalidate current schema to reflect
+					 * possible catalog changes.
+					 */
+					txn->is_schema_sent = false;
+
+					/* and continue with the new one */
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+										  txn->xid);
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
+					Assert(change->data.command_id != InvalidCommandId);
+
+					if (command_id < change->data.command_id)
+					{
+						command_id = change->data.command_id;
+
+						if (!snapshot_now->copied)
+						{
+							/* we don't use the global one anymore */
+							snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+																 txn, command_id);
+						}
+
+						snapshot_now->curcid = command_id;
+
+						TeardownHistoricSnapshot(false);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+											  txn->xid);
+					}
+
+					break;
+
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation message locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+					txn->is_schema_sent = false;
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+					elog(ERROR, "tuplecid value in changequeue");
+					break;
+			}
+		}
+
+		/*
+		 * There's a speculative insertion remaining, just clean in up, it
+		 * can't have been successful, otherwise we'd gotten a confirmation
+		 * record.
+		 */
+		if (specinsert)
+		{
+			ReorderBufferReturnChange(rb, specinsert);
+			specinsert = NULL;
+		}
+
+		/* clean up the iterator */
+		ReorderBufferStreamIterTXNFinish(rb, iterstate);
+		iterstate = NULL;
+
+		/* call stream_stop callback */
+		rb->stream_stop(rb, txn);
+
+		/* this is just a sanity check against bad output plugin behaviour */
+		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
+			elog(ERROR, "output plugin used XID %u",
+				 GetCurrentTransactionId());
+
+		/* remember the command ID and snapshot for the streaming run */
+		txn->command_id = command_id;
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+
+		/* cleanup */
+		TeardownHistoricSnapshot(false);
+
+		/*
+		 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+		 * any memory. We could also keep the hash table and update it with
+		 * new ctid values, but this seems simpler and good enough for now.
+		 */
+		ReorderBufferDestroyTupleCidHash(rb, txn);
+
+		/*
+		 * Aborting the current (sub-)transaction as a whole has the right
+		 * semantics. We want all locks acquired in here to be released, not
+		 * reassigned to the parent and we do not want any database access
+		 * have persistent effects.
+		 */
+		AbortCurrentTransaction();
+
+		/* make sure there's no cache pollution */
+		ReorderBufferExecuteInvalidations(rb, txn);
+
+		if (using_subtxn)
+			RollbackAndReleaseCurrentSubTransaction();
+	}
+	PG_CATCH();
+	{
+		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
+		if (iterstate)
+			ReorderBufferStreamIterTXNFinish(rb, iterstate);
+
+		TeardownHistoricSnapshot(true);
+
+		/*
+		 * Force cache invalidation to happen outside of a valid transaction
+		 * to prevent catalog access as we just caught an error.
+		 */
+		AbortCurrentTransaction();
+
+		/* make sure there's no cache pollution */
+		ReorderBufferExecuteInvalidations(rb, txn);
+
+		if (using_subtxn)
+			RollbackAndReleaseCurrentSubTransaction();
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	/*
+	 * Discard the changes that we just streamed, and mark the transactions
+	 * as streamed (if they contained changes).
+	 */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 19c7bac8c0..7d08e2fd39 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* does the txn have catalog changes */
 #define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -187,6 +188,20 @@ typedef struct ReorderBufferChange
  */
 #define rbtxn_is_serialized(txn)       (txn->txn_flags & RBTXN_IS_SERIALIZED)
 
+/*
+ * Has this transaction been streamed to downstream? Similarly to spilling
+ * to disk, it's not trivial to deduce this from nentries and nentries_mem,
+ * for various reasons. For example, all changes may be in subtransactions
+ * in which case we'd have nentries==0 for the toplevel one, and it'd say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.
+ *
+ * Note: We never stream and serialize a transaction at the same time (e
+ * only do spill to disk when streaming is not supported by the plugin),
+ * so only one of those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn)         (txn->txn_flags & RBTXN_IS_STREAMED)
+
 typedef struct ReorderBufferTXN
 {
 	int     txn_flags;
@@ -221,6 +236,16 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	final_lsn;
 
+	/*
+	 * Do we need to send schema for this transaction in output plugin?
+	 */
+	bool		is_schema_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
@@ -251,6 +276,13 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
+	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
 	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
-- 
2.21.0

0008-fixup-add-is_schema_sent-back-v3.patchtext/plain; charset=us-asciiDownload
From 6e512a0622a5a532aaa788b049ba028aa8fc0115 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 27 Dec 2019 23:02:38 +0100
Subject: [PATCH 08/17] fixup: add is_schema_sent back

---
 src/backend/replication/logical/reorderbuffer.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index f02c47238a..0ab319182a 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2229,6 +2229,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * about the message itself?
 					 */
 					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+					txn->is_schema_sent = false;
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
-- 
2.21.0

0009-fixup-get-rid-of-is_schema_sent-entirely-v3.patchtext/plain; charset=us-asciiDownload
From 2a7baf771f161a165907b7f8038aed6d12e3cc36 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 27 Dec 2019 23:46:15 +0100
Subject: [PATCH 09/17] fixup: get rid of is_schema_sent entirely

We'll do this in the pgoutput.c code directly, not in reorderbuffer.
---
 .../replication/logical/reorderbuffer.c       | 26 ++-----------------
 src/include/replication/reorderbuffer.h       |  5 ----
 2 files changed, 2 insertions(+), 29 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 0ab319182a..85db15ea3b 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2229,7 +2229,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * about the message itself?
 					 */
 					LocalExecuteInvalidationMessage(&change->data.inval.msg);
-					txn->is_schema_sent = false;
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
@@ -2729,9 +2728,6 @@ ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	for (i = 0; i < txn->ninvalidations; i++)
 		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
-
-	/* Invalidate current schema as well */
-	txn->is_schema_sent = false;
 }
 
 /*
@@ -2747,22 +2743,12 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 
-	/*
-	 * We read catalog changes from WAL, which are not yet sent, so
-	 * invalidate current schema in order output plugin can resend
-	 * schema again.
-	 */
-	txn->is_schema_sent = false;
-
 	/*
 	 * TOCHECK: Mark toplevel transaction as having catalog changes too
 	 * if one of its children has.
 	 */
 	if (txn->toptxn != NULL)
-	{
 		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
-		txn->toptxn->is_schema_sent = false;
-	}
 }
 
 /*
@@ -3345,9 +3331,8 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
 		 * information about subtransactions, which could arrive after streaming start.
 		 */
-		if (!txn->is_schema_sent)
-			snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
-												 txn, command_id);
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
 	}
 
 	/*
@@ -3602,12 +3587,6 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 						snapshot_now = change->data.snapshot;
 					}
 
-					/*
-					 * TOCHECK: Snapshot changed, then invalidate current schema to reflect
-					 * possible catalog changes.
-					 */
-					txn->is_schema_sent = false;
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
 										  txn->xid);
@@ -3646,7 +3625,6 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 					 * about the message itself?
 					 */
 					LocalExecuteInvalidationMessage(&change->data.inval.msg);
-					txn->is_schema_sent = false;
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 7d08e2fd39..e2b8db0ff1 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -236,11 +236,6 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	final_lsn;
 
-	/*
-	 * Do we need to send schema for this transaction in output plugin?
-	 */
-	bool		is_schema_sent;
-
 	/*
 	 * Toplevel transaction for this subxact (NULL for top-level).
 	 */
-- 
2.21.0

0010-Support-logical_decoding_work_mem-set-from-create-v3.patchtext/plain; charset=us-asciiDownload
From 129547a473edd42fa66f1490d36f8aaede529867 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 11:51:04 +0530
Subject: [PATCH 10/17] Support logical_decoding_work_mem set from create
 subscription command

---
 doc/src/sgml/config.sgml                      | 21 +++++++++
 doc/src/sgml/ref/create_subscription.sgml     | 12 +++++
 src/backend/catalog/pg_subscription.c         |  1 +
 src/backend/commands/subscriptioncmds.c       | 44 ++++++++++++++++---
 .../libpqwalreceiver/libpqwalreceiver.c       |  3 ++
 src/backend/replication/logical/worker.c      |  1 +
 src/backend/replication/pgoutput/pgoutput.c   | 30 ++++++++++++-
 src/include/catalog/pg_subscription.h         |  3 ++
 src/include/replication/walreceiver.h         |  1 +
 9 files changed, 108 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d1c90282f..8b1923c9de 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1751,6 +1751,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-logical-decoding-work-mem" xreflabel="logical_decoding_work_mem">
+      <term><varname>logical_decoding_work_mem</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>logical_decoding_work_mem</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to be used by logical decoding,
+        before some of the decoded changes are either written to local disk.
+        This limits the amount of memory used by logical streaming replication
+        connections. It defaults to 64 megabytes (<literal>64MB</literal>).
+        Since each replication connection only uses a single buffer of this size,
+        and an installation normally doesn't have many such connections
+        concurrently (as limited by <varname>max_wal_senders</varname>), it's
+        safe to set this value significantly higher than <varname>work_mem</varname>,
+        reducing the amount of decoded changes written to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c244fb..91790b0c95 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>work_mem</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          Limits the amount of memory used to decode changes on the
+          publisher.  If not specified, the publisher will use the default
+          specified by <varname>logical_decoding_work_mem</varname>. When
+          needed, additional data are spilled to disk.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 68d88ff499..2a276482c1 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->workmem = subform->subworkmem;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 5408edcfc2..fbb447379f 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -58,7 +58,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, int *logical_wm,
+						   bool *logical_wm_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -89,6 +90,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (logical_wm)
+		*logical_wm_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -174,6 +177,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0 && logical_wm)
+		{
+			if (*logical_wm_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*logical_wm_given = true;
+			*logical_wm = defGetInt32(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -317,6 +330,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	int			logical_wm;
+	bool		logical_wm_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -333,7 +348,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &logical_wm, &logical_wm_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -411,6 +426,12 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (logical_wm_given)
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(logical_wm);
+	else
+		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +690,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				int			logical_wm;
+				bool		logical_wm_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &logical_wm, &logical_wm_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +721,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (logical_wm_given)
+				{
+					values[Anum_pg_subscription_subworkmem - 1] =
+						Int32GetDatum(logical_wm);
+					replaces[Anum_pg_subscription_subworkmem - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +739,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +778,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +815,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 545d2fcd05..0ab6855ad8 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -406,6 +406,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		appendStringInfo(&cmd, ", work_mem '%d'",
+						 options->proto.logical.work_mem);
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 63ba0ae234..c80acd3eb0 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1729,6 +1729,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.work_mem = MySubscription->workmem;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 3483c1b877..cf6e03b9a7 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -18,6 +18,7 @@
 #include "replication/logicalproto.h"
 #include "replication/origin.h"
 #include "replication/pgoutput.h"
+#include "utils/guc.h"
 #include "utils/int8.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
@@ -87,11 +88,12 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, int *logical_decoding_work_mem)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		work_mem_given = false;
 
 	foreach(lc, options)
 	{
@@ -137,6 +139,29 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0)
+		{
+			int64	parsed;
+
+			if (work_mem_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			work_mem_given = true;
+
+			if (!scanint8(strVal(defel->arg), true, &parsed))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid work_mem")));
+
+			if (parsed > PG_INT32_MAX || parsed < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("work_mem \"%s\" out of range",
+								strVal(defel->arg))));
+
+			*logical_decoding_work_mem = (int)parsed;
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -171,7 +196,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&logical_decoding_work_mem);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3cb13d897e..10ea113e4d 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	int32		subworkmem;		/* Memory to use to decode changes. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	int			workmem;		/* Memory to decode changes. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 41714eaf0c..1db706af54 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -162,6 +162,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			int			work_mem;	/* Memory limit to use for decoding */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
-- 
2.21.0

0011-Add-support-for-streaming-to-built-in-replication-v3.patchtext/plain; charset=us-asciiDownload
From 83135d7f8db59c80c2ab3b1a5c5542d9b0400fdb Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 27 Dec 2019 23:05:20 +0100
Subject: [PATCH 11/17] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml      |    5 +-
 doc/src/sgml/ref/create_subscription.sgml     |   12 +
 src/backend/catalog/pg_subscription.c         |    1 +
 src/backend/commands/subscriptioncmds.c       |   60 +-
 src/backend/postmaster/pgstat.c               |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |    8 +-
 src/backend/replication/logical/launcher.c    |    2 +
 src/backend/replication/logical/logical.c     |    4 +-
 src/backend/replication/logical/proto.c       |  157 ++-
 src/backend/replication/logical/worker.c      | 1031 +++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c   |  263 ++++-
 src/backend/replication/slotfuncs.c           |    7 +
 src/backend/replication/walsender.c           |    6 +
 src/include/catalog/pg_subscription.h         |    3 +
 src/include/pgstat.h                          |    6 +-
 src/include/replication/logicalproto.h        |   42 +-
 src/include/replication/walreceiver.h         |    1 +
 src/test/subscription/t/009_stream_simple.pl  |   86 ++
 src/test/subscription/t/010_stream_subxact.pl |  102 ++
 src/test/subscription/t/011_stream_ddl.pl     |   95 ++
 .../t/012_stream_subxact_abort.pl             |   82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |   84 ++
 22 files changed, 2027 insertions(+), 42 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 6dfb2e4d3e..e1fb9075e1 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
+      and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 91790b0c95..d9abf5e64c 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -218,6 +218,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 2a276482c1..15a6f5a8b3 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->workmem = subform->subworkmem;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index fbb447379f..b2b93d6234 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
 						   bool *refresh, int *logical_wm,
-						   bool *logical_wm_given)
+						   bool *logical_wm_given, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -92,6 +93,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*refresh = true;
 	if (logical_wm)
 		*logical_wm_given = false;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -186,6 +189,26 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 
 			*logical_wm_given = true;
 			*logical_wm = defGetInt32(defel);
+
+			/*
+			 * Check that the value is not smaller than 64kB (which is
+			 * the minimum value for logical_work_mem).
+			 */
+			if (*logical_wm < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("%d is outside the valid range for parameter \"work_mem\" (64 .. 2147483647)",
+								*logical_wm)));
+		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
 		}
 		else
 			ereport(ERROR,
@@ -332,6 +355,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	int			logical_wm;
 	bool		logical_wm_given;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -348,7 +373,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL, &logical_wm, &logical_wm_given);
+							   NULL, &logical_wm, &logical_wm_given,
+							   &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -430,7 +456,15 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 		values[Anum_pg_subscription_subworkmem - 1] =
 			Int32GetDatum(logical_wm);
 	else
-		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(-1);
+
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
 
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
@@ -692,11 +726,14 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				int			logical_wm;
 				bool		logical_wm_given;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
 										   NULL, &synchronous_commit, NULL,
-										   &logical_wm, &logical_wm_given);
+										   &logical_wm, &logical_wm_given,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -728,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subworkmem - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -740,7 +784,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
 										   NULL, NULL, NULL, NULL, NULL,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -778,7 +822,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh, NULL, NULL);
+										   NULL, &refresh, NULL, NULL,
+										   NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -815,7 +860,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7410b2ff5e..a479ce9329 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4102,6 +4102,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 0ab6855ad8..9970170e47 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -406,8 +406,12 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
-		appendStringInfo(&cmd, ", work_mem '%d'",
-						 options->proto.logical.work_mem);
+		if (options->proto.logical.work_mem != -1)
+			appendStringInfo(&cmd, ", work_mem '%d'",
+							 options->proto.logical.work_mem);
+
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
 
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index c57b578b48..0a013ed220 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,6 +14,8 @@
  *
  *-------------------------------------------------------------------------
  */
+#include <sys/types.h>
+#include <unistd.h>
 
 #include "postgres.h"
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index b88b58505a..ad43ab365e 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1149,7 +1149,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1194,7 +1194,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index e7df47de3e..5a379fb6bc 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,7 +139,8 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
@@ -147,6 +148,10 @@ logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -182,8 +187,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -191,6 +196,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -252,13 +261,18 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
+	pq_sendbyte(out, 'D');		/* action DELETE */
+
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
-	pq_sendbyte(out, 'D');		/* action DELETE */
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
 
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
@@ -300,6 +314,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -309,6 +324,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -351,12 +370,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -401,7 +424,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -409,6 +432,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -689,3 +716,119 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendint32(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgint(in, 4) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+}
+
+TransactionId
+logicalrep_read_stream_stop(StringInfo in)
+{
+	TransactionId xid;
+
+	xid = pq_getmsgint(in, 4);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index c80acd3eb0..cf053e948b 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in /tmp by default, and the filenames include both
+ * the XID of the toplevel transaction and OID of the subscription. This
+ * is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -31,6 +52,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -60,6 +82,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -67,6 +90,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -106,12 +130,59 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	off_t		offset;			/* offset in the file */
+}			SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
+static bool handle_streamed_transaction(const char action, StringInfo s);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -163,6 +234,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -528,6 +635,318 @@ apply_handle_origin(StringInfo s)
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* FIXME optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/* We should not receive aborts for unknown subtransactions. */
+		Assert(found);
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	/* XXX Should this be allocated in another memory context? */
+
+	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	ensure_transaction();
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pfree(buffer);
+	pfree(s2.data);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -541,6 +960,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -556,6 +978,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -591,6 +1016,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -694,6 +1122,9 @@ apply_handle_update(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -814,6 +1245,9 @@ apply_handle_delete(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -913,6 +1347,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1004,6 +1441,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1100,6 +1553,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 	}
 }
 
+/*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i]);
+}
+
 /*
  * Apply main loop.
  */
@@ -1116,6 +1585,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1564,6 +2036,564 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	CloseTransientFile(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	CloseTransientFile(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * TODO: Add missing_ok flag to specify in which cases it's OK not to
+ * find the files, and when it's an error.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialied in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -1730,6 +2760,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.work_mem = MySubscription->workmem;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index cf6e03b9a7..8490ea4717 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -45,16 +45,42 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
 
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
+
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 
-/* Entry in the map used to remember which relation schemas we sent. */
+/*
+ * Entry in the map used to remember which relation schemas we sent.
+ *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
+	TransactionId	xid;		/* transaction that created the record */
 	bool		schema_sent;	/* did we send the schema? */
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -64,6 +90,7 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
@@ -84,16 +111,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, int *logical_decoding_work_mem)
+						List **publication_names, int *logical_decoding_work_mem,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		work_mem_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -162,6 +199,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*logical_decoding_work_mem = (int)parsed;
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -174,6 +228,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -197,7 +252,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&logical_decoding_work_mem);
+								&logical_decoding_work_mem,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -217,6 +273,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -284,9 +361,42 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (!relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = txn->is_schema_sent;
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (!schema_sent)
 	{
 		TupleDesc	desc;
 		int			i;
@@ -312,19 +422,26 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 				continue;
 
 			OutputPluginPrepareWrite(ctx, false);
-			logicalrep_write_typ(ctx->out, att->atttypid);
+			logicalrep_write_typ(ctx->out, xid, att->atttypid);
 			OutputPluginWrite(ctx, false);
 		}
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_rel(ctx->out, relation);
+		logicalrep_write_rel(ctx->out, xid, relation);
 		OutputPluginWrite(ctx, false);
-		relentry->schema_sent = true;
+		relentry->xid = change->txn->xid;
+
+		if (in_streaming)
+			txn->is_schema_sent = true;
+		else
+			relentry->schema_sent = true;
 	}
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -333,6 +450,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -361,14 +482,14 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
 	{
 		case REORDER_BUFFER_CHANGE_INSERT:
 			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_insert(ctx->out, relation,
+			logicalrep_write_insert(ctx->out, xid, relation,
 									&change->data.tp.newtuple->tuple);
 			OutputPluginWrite(ctx, true);
 			break;
@@ -378,7 +499,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				&change->data.tp.oldtuple->tuple : NULL;
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple,
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
 										&change->data.tp.newtuple->tuple);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -387,7 +508,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			if (change->data.tp.oldtuple)
 			{
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation,
+				logicalrep_write_delete(ctx->out, xid, relation,
 										&change->data.tp.oldtuple->tuple);
 				OutputPluginWrite(ctx, true);
 			}
@@ -413,6 +534,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -433,13 +558,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -512,6 +638,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -622,6 +833,34 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+	}
+
+}
+
 /*
  * Relcache invalidation callback
  */
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index ba08ad405f..8eb3160041 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -147,6 +147,13 @@ create_logical_replication_slot(char *name, char *plugin,
 									logical_read_local_xlog_page, NULL, NULL,
 									NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+
 	/* build initial snapshot, might take a while */
 	DecodingContextFindStartpoint(ctx);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1f23665432..2e0743ac8f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -968,6 +968,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndPrepareWrite, WalSndWriteData,
 										WalSndUpdateProgress);
 
+		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
 		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 10ea113e4d..8793676258 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -50,6 +50,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	int32		subworkmem;		/* Memory to use to decode changes. */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -76,6 +78,7 @@ typedef struct Subscription
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	int			workmem;		/* Memory to decode changes. */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f2e873d048..c522703d8c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -945,7 +945,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 3fc430af01..bf02cbc19d 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -85,25 +89,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out, TransactionId xid);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 1db706af54..3d19b5d88e 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -163,6 +163,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			int			work_mem;	/* Memory limit to use for decoding */
+			bool		streaming;	/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.21.0

0012-fixup-add-proper-schema-tracking-v3.patchtext/plain; charset=us-asciiDownload
From 7a45713c4161e4efdf82148f56af3328283b20e9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 27 Dec 2019 23:56:04 +0100
Subject: [PATCH 12/17] fixup: add proper schema tracking

---
 src/backend/replication/pgoutput/pgoutput.c | 45 ++++++++++++++++++++-
 1 file changed, 43 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 8490ea4717..0148f4c01e 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -82,6 +82,8 @@ typedef struct RelationSyncEntry
 	Oid			relid;			/* relation oid */
 	TransactionId	xid;		/* transaction that created the record */
 	bool		schema_sent;	/* did we send the schema? */
+	List	   *streamed_txns;	/* streamed toplevel transactions with
+								 * this schema */
 	bool		replicate_valid;
 	PublicationActions pubactions;
 } RelationSyncEntry;
@@ -96,6 +98,11 @@ static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -366,6 +373,7 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 {
 	bool	schema_sent;
 	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
 
 	/*
 	 * Remember XID of the (sub)transaction for the change. We don't care if
@@ -378,6 +386,11 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 	if (in_streaming)
 		xid = change->txn->xid;
 
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
 	/*
 	 * Do we need to send the schema? We do track streamed transactions
 	 * separately, because those may not be applied later (and the regular
@@ -391,7 +404,7 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		 * occur when streaming already started, so we have to track new catalog
 		 * changes somehow.
 		 */
-		schema_sent = txn->is_schema_sent;
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
 	}
 	else
 		schema_sent = relentry->schema_sent;
@@ -432,7 +445,7 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->xid = change->txn->xid;
 
 		if (in_streaming)
-			txn->is_schema_sent = true;
+			set_schema_sent_in_streamed_txn(relentry, topxid);
 		else
 			relentry->schema_sent = true;
 	}
@@ -759,6 +772,34 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  */
-- 
2.21.0

0013-Track-statistics-for-streaming-v3.patchtext/plain; charset=us-asciiDownload
From 0b9a94e41e81ef0ce58ada49da55cb10e2cc5a1f Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 2 Dec 2019 09:58:50 +0530
Subject: [PATCH 13/17] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                  | 25 +++++++++++++++
 src/backend/catalog/system_views.sql          |  5 ++-
 .../replication/logical/reorderbuffer.c       | 13 ++++++++
 src/backend/replication/walsender.c           | 32 ++++++++++++++++---
 src/include/catalog/pg_proc.dat               |  6 ++--
 src/include/replication/reorderbuffer.h       | 13 +++++---
 src/include/replication/walsender_private.h   |  5 +++
 src/test/regress/expected/rules.out           |  7 ++--
 8 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dcb58115af..180ea880a4 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1996,6 +1996,31 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       may get spilled repeatedly, and this counter gets incremented on every
       such invocation.</entry>
     </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
+
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f7800f01a6..58976118db 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -779,7 +779,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 85db15ea3b..65f876d0f0 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -358,6 +358,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3709,6 +3713,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 	PG_END_TRY();
 
+	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	/*
 	 * Discard the changes that we just streamed, and mark the transactions
 	 * as streamed (if they contained changes).
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2e0743ac8f..9f93f11d8a 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1293,7 +1293,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1314,7 +1314,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2357,6 +2358,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3185,7 +3189,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3242,6 +3246,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillTxns;
 		int64		spillCount;
 		int64		spillBytes;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3265,6 +3272,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		memset(nulls, 0, sizeof(nulls));
@@ -3351,6 +3361,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3598,12 +3613,19 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillTxns = rb->spillTxns;
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
+	
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index ac8f64b219..3b897a5d70 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5166,9 +5166,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e2b8db0ff1..e132c3c5ea 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -506,15 +506,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index a6b32051ac..7efc332319 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -85,6 +85,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 80a07825b9..5ab21a80d0 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1955,9 +1955,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
2.21.0

0014-Enable-streaming-for-all-subscription-TAP-tests-v3.patchtext/plain; charset=us-asciiDownload
From 3b857c0da517c7aed243253e8411767a0e973d63 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH 14/17] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 77a1560b23..8cd1993393 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -65,7 +65,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2fbc..94c71f8ae2 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 81547f65fa..8dfeafc772 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9181..a6fae9c3f1 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17a78..202871a658 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10a19..70c86b22ac 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6d63..f9c8d1d348 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 81520a7332..0c9c6b3dd4 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bdba9..21f50c7012 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133f69..30561d8f96 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae38592b..9a6bac6822 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..ed56fbf96c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cba4c..4df1ddef63 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1a8a..c3caff6149 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7c2f..c62eb521e7 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30f59..2be7542831 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0578..2da9607a7d 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9435..96ffc091b0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
2.21.0

0015-BUGFIX-set-final_lsn-for-subxacts-before-cleanup-v3.patchtext/plain; charset=us-asciiDownload
From 016b2a43955bb56f3a439d7a003fe9d0d6412125 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:14:45 +0200
Subject: [PATCH 15/17] BUGFIX: set final_lsn for subxacts before cleanup

---
 src/backend/replication/logical/reorderbuffer.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 65f876d0f0..5883d14a02 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1544,6 +1544,10 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
+		/* make sure subtxn has final_lsn */
+		if (subtxn->final_lsn == InvalidXLogRecPtr)
+			subtxn->final_lsn = txn->final_lsn;
+
 		/*
 		 * Subtransactions are always associated to the toplevel TXN, even if
 		 * they originally were happening inside another subtxn, so we won't
-- 
2.21.0

0016-Add-TAP-test-for-streaming-vs.-DDL-v3.patchtext/plain; charset=us-asciiDownload
From 9fd14e817ba569654dc32ef75aadde7300ad1b0f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH 16/17] Add TAP test for streaming vs. DDL

---
 .../subscription/t/014_stream_through_ddl.pl  | 98 +++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000000..b8d78b1972
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.21.0

0017-Extend-handling-of-concurrent-aborts-for-streamin-v3.patchtext/plain; charset=us-asciiDownload
From c9d0c5da2641d0746657d465893bb65eb9d73b7e Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 22 Nov 2019 12:43:38 +0530
Subject: [PATCH 17/17] Extend handling of concurrent aborts for streaming
 transaction

---
 .../replication/logical/reorderbuffer.c       | 36 +++++++++++++++++--
 src/include/replication/reorderbuffer.h       |  5 +++
 2 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5883d14a02..0987032208 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2349,9 +2349,9 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 
 	/*
 	 * When the (sub)transaction was streamed, notify the remote node
-	 * about the abort.
+	 * about the abort only if we have sent any data for this transaction.
 	 */
-	if (rbtxn_is_streamed(txn))
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
 		rb->stream_abort(rb, txn, lsn);
 
 	/* cosmetic... */
@@ -3267,6 +3267,7 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	volatile CommandId command_id;
 	bool		using_subtxn;
 	Size		streamed = 0;
+	MemoryContext ccxt = CurrentMemoryContext;
 	ReorderBufferStreamIterTXNState *volatile iterstate = NULL;
 
 	/*
@@ -3396,6 +3397,13 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			/* we're going to stream this change */
 			streamed++;
 
+			/*
+			 * Set the CheckXidAlive to the current (sub)xid for which this
+			 * change belongs to so that we can detect the abort while we are
+			 * decoding.
+			 */
+			CheckXidAlive = change->txn->xid;
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -3457,6 +3465,10 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 						ReorderBufferToastReplace(rb, txn, relation, change);
 						rb->stream_change(rb, txn, relation, change);
 
+						/* Remember that we have sent some data for this txn.*/
+						if (!change->txn->any_data_sent)
+							change->txn->any_data_sent = true;
+
 						/*
 						 * Only clear reassembled toast chunks if we're sure
 						 * they're not required anymore. The creator of the
@@ -3695,6 +3707,9 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferStreamIterTXNFinish(rb, iterstate);
@@ -3713,7 +3728,22 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		PG_RE_THROW();
+		/* re-throw only if it's not an abort */
+		if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+		else
+		{
+			/* remember the command ID and snapshot for the streaming run */
+			txn->command_id = command_id;
+			txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+													  txn, command_id);
+			rb->stream_stop(rb, txn);
+
+			FlushErrorState();
+		}
 	}
 	PG_END_TRY();
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e132c3c5ea..6186465a85 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -236,6 +236,11 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	final_lsn;
 
+	/*
+	 * Have we sent any changes for this transaction in output plugin?
+	 */
+	bool		any_data_sent;
+
 	/*
 	 * Toplevel transaction for this subxact (NULL for top-level).
 	 */
-- 
2.21.0

#168Dilip Kumar
dilipbalaut@gmail.com
In reply to: Tomas Vondra (#167)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Dec 28, 2019 at 9:33 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Dec 10, 2019 at 10:23:19AM +0530, Dilip Kumar wrote:

On Tue, Dec 10, 2019 at 9:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:

I have rebased the patch on the latest head and also fix the issue of
"concurrent abort handling of the (sub)transaction." and attached as
(v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
the complete patch set. I have added the version number so that we
can track the changes.

The patch has rotten a bit and does not apply anymore. Could you
please send a rebased version? I have moved it to next CF, waiting on
author.

I have rebased the patch set on the latest head.

Apart from this, there is one issue reported by my colleague Vignesh.
The issue is that if we use more than two relations in a transaction
then there is an error on standby (no relation map entry for remote
relation ID 16390). After analyzing I have found that for the
streaming transaction an "is_schema_sent" flag is kept in
ReorderBufferTXN. And, I think that is done so that we can send the
schema for each transaction stream so that if any subtransaction gets
aborted we don't lose the logical WAL for that schema. But, this
solution has induced a very basic issue that if a transaction operate
on more than 1 relation then after sending the schema for the first
relation it will mark the flag true and the schema for the subsequent
relations will never be sent.

How about keeping a list of top-level xids in each RelationSyncEntry?
Basically, whenever we send the schema for any transaction, we note
that in RelationSyncEntry and at abort time we can remove xid from the
list. Now, whenever, we check whether to send schema for any
operation in a transaction, we will check if our xid is present in
that list for a particular RelationSyncEntry and take an action based
on that (if xid is present, then we won't send the schema, otherwise,
send it).

The idea make sense to me. I will try to write a patch for this and test.

Yeah, the "is_schema_sent" flag in ReorderBufferTXN does not work - it
needs to be in the RelationSyncEntry. In fact, I already have code for
that in my private repository - I thought the patches I sent here do
include this, but apparently I forgot to include this bit :-(

Attached is a rebased patch series, fixing this. It's essentially v2
with a couple of patches (0003, 0008, 0009 and 0012) replacing the
is_schema_sent with correct handling.

0003 - removes an is_schema_sent reference added prematurely (it's added
by a later patch, causing compile failure)

0008 - adds the is_schema_sent back (essentially reverting 0003)

0009 - removes is_schema_sent entirely

0012 - adds the correct handling of schema flags in pgoutput

I don't know what other changes you've made since v2, so this way it
should be possible to just take 0003, 0008, 0009 and 0012 and slip them
in with minimal hassle.

FWIW thanks to everyone (and Amit and Dilip in particular) working on
this patch series. There's been a lot of great reviews and improvements
since I abandoned this thread for a while. I expect to be able to spend
more time working on this in January.

+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+ MemoryContext oldctx;
+
+ oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+ entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+ MemoryContextSwitchTo(oldctx);
+}
I was looking into the schema tracking solution and I have one
question, Shouldn't we remove the topxid from the list if the
(sub)transaction is aborted? because once it is aborted we need to
resent the schema.  I think we can remove the xid from the list in the
cleanup_rel_sync_cache function?

I have observed some more issues

1. Currently, In ReorderBufferCommit, it is always expected that
whenever we get REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we must
have already got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT and in
SPEC_CONFIRM we send the tuple we got in SPECT_INSERT. But, now those
two messages can be in different streams. So we need to find a way to
handle this. Maybe once we get SPEC_INSERT then we can remember the
tuple and then if we get the SPECT_CONFIRM in the next stream we can
send that tuple?

2. During commit time in DecodeCommit we check whether we need to skip
the changes of the transaction or not by calling
SnapBuildXactNeedsSkip but since now we support streaming so it's
possible that before we decode the commit WAL, we might have already
sent the changes to the output plugin even though we could have
skipped those changes. So my question is instead of checking at the
commit time can't we check before adding to ReorderBuffer itself or we
can truncate the changes if SnapBuildXactNeedsSkip is true whenever
logical_decoding_workmem limit is reached. Am I missing something
here?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#169Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#153)
19 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Dec 12, 2019 at 9:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Yesterday, Tomas has posted the latest version of the patch set which
contain the fix for schema send part. Meanwhile, I was working on few
review comments/bugfixes and refactoring. I have tried to merge those
changes with the latest patch set except the refactoring related to
"0006-Implement-streaming-mode-in-ReorderBuffer" patch, because Tomas
has also made some changes in the same patch. I have created a
separate patch for the same so that we can review the changes and then
we can merge them to the main patch.

On Wed, Dec 11, 2019 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Dec 9, 2019 at 1:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have review the patch set and here are few comments/questions

1.
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ Relation relation,
+ ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}

Should we show the tuple in the streamed change like we do for the
pg_decode_change?

I think so. The patch shows the message in
pg_decode_stream_message(), so why to prohibit showing tuple here?

Yeah, we can do that. One option is that we can directly register
"pg_decode_change" function as stream_change_cb plugin and that will
show the tuple, another option is that we can write a similar function
as pg_decode_change and change the message which includes the text
"STREAM" so that the user can distinguish between tuple from committed
transaction and the in-progress transaction.

While analyzing this solution I have encountered one more issue, the
problem is that currently, during commit time in DecodeCommit we check
whether we need to skip the changes of the transaction or not by
calling SnapBuildXactNeedsSkip but since now we support streaming so
it's possible that before commit wal arrive we might have already sent
the changes to the output plugin even though we could have skipped
those changes. So my question is instead of checking at the commit
time can't we check before adding to ReorderBuffer itself or we can
truncate the changes if SnapBuildXactNeedsSkip is true whenever
logical_decoding_workmem limit is reached.

Few comments on this patch series:

0001-Immediately-WAL-log-assignments:
------------------------------------------------------------

The commit message still refers to the old design for this patch. I
think you need to modify the commit message as per the latest patch.

Done

0002-Issue-individual-invalidations-with-wal_level-log
----------------------------------------------------------------------------
1.
xact_desc_invalidations(StringInfo buf,
{
..
+ else if (msg->id == SHAREDINVALSNAPSHOT_ID)
+ appendStringInfo(buf, " snapshot %u", msg->sn.relId);

You have removed logging for the above cache but forgot to remove its
reference from one of the places. Also, I think you need to add a
comment somewhere in inval.c to say why you are writing for WAL for
some types of invalidations and not for others?

Done

0003-Extend-the-output-plugin-API-with-stream-methods
--------------------------------------------------------------------------------
1.
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_message_cb</function> are optional.

stream_message_cb is mentioned twice. It seems the second one is for truncate.

Done

2.
size of the transaction size and network bandwidth, the transfer time
+ may significantly increase the apply lag.

/size of the transaction size/size of the transaction

no need to mention size twice.

Done

3.
+    Similarly to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress
transactions)
+    exceeds limit defined by <varname>logical_work_mem</varname> setting.

The guc name used is wrong. /Similarly to/Similar to/

Done

4.
stream_start_cb_wrapper()
{
..
+ /* state.report_location = apply_lsn; */
..
+ /* FIXME ctx->write_location = apply_lsn; */
..
}

See, if we can fix these and similar in the callback for the stop. I
think we don't have final_lsn till we commit/abort. Can we compute
before calling these API's?

Done

0005-Gracefully-handle-concurrent-aborts-of-uncommitte
----------------------------------------------------------------------------------
1.
@@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_CATCH();
{
/* TODO: Encapsulate cleanup
from the PG_TRY and PG_CATCH blocks */
+
if (iterstate)
ReorderBufferIterTXNFinish(rb, iterstate);

Spurious line change.

Done

2. The commit message of this patch refers to Prepared transactions.
I think that needs to be changed.

0006-Implement-streaming-mode-in-ReorderBuffer
-------------------------------------------------------------------------
1.
+
+/* iterator for streaming (only get data from memory) */
+static ReorderBufferStreamIterTXNState * ReorderBufferStreamIterTXNInit(
+
ReorderBuffer *rb,
+
ReorderBufferTXN
*txn);
+
+static ReorderBufferChange *ReorderBufferStreamIterTXNNext(
+    ReorderBuffer *rb,
+
ReorderBufferStreamIterTXNState * state);
+
+static void ReorderBufferStreamIterTXNFinish(
+
ReorderBuffer *rb,
+
ReorderBufferStreamIterTXNState * state);

Do we really need to introduce new APIs for iterating over changes
from streamed transactions? Why can't we reuse the same API's as we
use for committed xacts?

Done

2.
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)

Please write some comments atop ReorderBufferStreamCommit.

Done

3.
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
..
+ if (txn->snapshot_now
== NULL)
+ {
+ dlist_iter subxact_i;
+
+ /* make sure this transaction is streamed for the first time */
+
Assert(!rbtxn_is_streamed(txn));
+
+ /* at the beginning we should have invalid command ID */
+ Assert(txn->command_id ==
InvalidCommandId);
+
+ dlist_foreach(subxact_i, &txn->subtxns)
+ {
+ ReorderBufferTXN *subtxn;
+
+
subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+
+ if (subtxn->base_snapshot != NULL &&
+
(txn->base_snapshot == NULL ||
+ txn->base_snapshot_lsn > subtxn->base_snapshot_lsn))
+ {
+
txn->base_snapshot = subtxn->base_snapshot;

The logic here seems to be correct, but I am not sure why it is not
considered to purge the base snapshot before assigning the subtxn's
snapshot and similarly, we have not purged snapshot for subtxn once we
are done with it. I think we can use
ReorderBufferTransferSnapToParent to replace part of the logic here.
Do you see any reason for doing things differently here?

Done

4. In ReorderBufferStreamTXN, why do you need to use
ReorderBufferCopySnap to assign txn->base_snapshot to snapshot_now.

IMHO, here instead of directly copying the base snapshot we are
modifying it by passing command id and thats the reason we are copying
it.

5. I see a lot of code similarity in ReorderBufferStreamTXN and
existing ReorderBufferCommit. I understand that there are some subtle
differences due to which we need to write this new function but can't
we encapsulate the specific parts of code in functions and then call
from both places. I am talking about code in different cases for
change->action.

Done

6. + * Note: We never stream and serialize a transaction at the same time (e
/(e/(we

Done

I have also found one bug in
"v3-0012-fixup-add-proper-schema-tracking.patch" due to which some of
the streaming test cases were failing, I have created a separate patch
to fix the same.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v4-0001-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=v4-0001-Immediately-WAL-log-assignments.patchDownload
From 4fd022529dc2c4f5c4d96c0e69537ae5f40684e7 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 09:53:08 +0530
Subject: [PATCH v4 01/19] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 45 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 39 +++++++++++++--------------
 src/include/access/xact.h                |  3 +++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 +++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 98 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 5353b6a..708e523 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -190,6 +190,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -5096,6 +5097,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6002,3 +6004,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index aa9dca0..a8a8084 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -88,11 +88,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -194,6 +196,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -397,7 +403,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -743,6 +749,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		curinsert_flags |= XLOG_INCLUDE_XID;
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 67418b0..4435c63 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1165,6 +1165,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1203,6 +1204,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index bc532d0..897b755 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel_xid is valid, we need to assign the subxact to the
+	 * toplevel transaction. We need to do this for all records, hence we
+	 * do it before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 record->toplevel_xid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrIds) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(r->toplevel_xid))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 9d2899d..5b9740c 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 3fea199..b1976ac 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -227,6 +227,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 0193611..a676151 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -147,6 +147,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -280,6 +282,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 9375e54..bcfba0a 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v4-0002-Issue-individual-invalidations-with-wal_level-log.patchapplication/octet-stream; name=v4-0002-Issue-individual-invalidations-with-wal_level-log.patchDownload
From c067b209589fa6288c6aef48955a21bd8ee2732d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:26:33 +0530
Subject: [PATCH v4 02/19] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.

The new invalidations are written to WAL immediately, without any
such caching. Perhaps it would be possible to add similar caching,
e.g. at the command level, or something like that?
---
 src/backend/access/rmgrdesc/xactdesc.c          | 50 ++++++++++++++++++
 src/backend/access/transam/xact.c               |  7 +++
 src/backend/replication/logical/decode.c        | 23 +++++++++
 src/backend/replication/logical/reorderbuffer.c | 56 +++++++++++++++++---
 src/backend/utils/cache/inval.c                 | 69 +++++++++++++++++++++++++
 src/include/access/xact.h                       | 18 ++++++-
 src/include/replication/reorderbuffer.h         | 14 +++++
 7 files changed, 229 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 4c411c5..7ecdc0e 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,11 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -397,6 +402,14 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs,
+								xlrec->dbId, xlrec->tsId,
+								xlrec->relcacheInitFileInval);
+	}
 }
 
 const char *
@@ -424,7 +437,44 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval)
+{
+	int			i;
+
+	if (relcacheInitFileInval)
+		appendStringInfo(buf, "; relcache init file inval dbid %u tsid %u",
+						 dbId, tsId);
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 708e523..da15556 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6001,6 +6001,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 897b755..9bcefb6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,29 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/* XXX for now we're issuing invalidations one by one */
+				Assert(invals->nmsgs == 1);
+
+				if (!TransactionIdIsValid(xid))
+					break;
+
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->dbId, invals->tsId,
+											 invals->relcacheInitFileInval,
+											 invals->msgs[0]);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 53affeb..b1feff3 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -464,6 +464,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
@@ -1804,17 +1805,23 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation message locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+					txn->is_schema_sent = false;
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -2209,6 +2216,38 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 
 /*
  * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn,
+							 Oid dbId, Oid tsId, bool relcacheInitFileInval,
+							 SharedInvalidationMessage msg)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	/* XXX Should we even write invalidations without valid XID? */
+	if (xid == InvalidTransactionId)
+		return;
+
+	Assert(xid != InvalidTransactionId);
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.dbId = dbId;
+	change->data.inval.tsId = tsId;
+	change->data.inval.relcacheInitFileInval = relcacheInitFileInval;
+	change->data.inval.msg = msg;
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Setup the invalidation of the toplevel transaction.
  *
  * This needs to be done before ReorderBufferCommit is called!
  */
@@ -2656,6 +2695,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -2752,6 +2792,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -3027,6 +3068,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index f09e3a9..0682c55 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -104,6 +104,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +211,9 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -489,6 +493,18 @@ RegisterCatcacheInvalidation(int cacheId,
 {
 	AddCatcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   cacheId, hashValue, dbId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cc.id = (int8) cacheId;
+		msg.cc.dbId = dbId;
+		msg.cc.hashValue = hashValue;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -501,6 +517,18 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 {
 	AddCatalogInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								  dbId, catId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cat.id = SHAREDINVALCATALOG_ID;
+		msg.cat.dbId = dbId;
+		msg.cat.catId = catId;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -511,6 +539,8 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 static void
 RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 {
+	bool		RelcacheInitFileInval = false;
+
 	AddRelcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   dbId, relId);
 
@@ -529,7 +559,22 @@ RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 	 * as well.  Also zap when we are invalidating whole relcache.
 	 */
 	if (relId == InvalidOid || RelationIdIsInInitFile(relId))
+	{
 		transInvalInfo->RelcacheInitFileInval = true;
+		RelcacheInitFileInval = true;
+	}
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.rc.id = SHAREDINVALRELCACHE_ID;
+		msg.rc.dbId = dbId;
+		msg.rc.relId = relId;
+
+		LogLogicalInvalidations(1, &msg, RelcacheInitFileInval);
+	}
 }
 
 /*
@@ -1501,3 +1546,27 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval)
+{
+	xl_xact_invalidations xlrec;
+
+	/* prepare record */
+	memset(&xlrec, 0, sizeof(xlrec));
+	xlrec.dbId = MyDatabaseId;
+	xlrec.tsId = MyDatabaseTableSpace;
+	xlrec.relcacheInitFileInval = relcacheInitFileInval;
+	xlrec.nmsgs = nmsgs;
+
+	/* perform insertion */
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+	XLogRegisterData((char *) msgs,
+					 nmsgs * sizeof(SharedInvalidationMessage));
+	XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5b9740c..82d4942 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,22 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ *
+ * XXX Currently nmsgs=1 but that might change in the future.
+ */
+typedef struct xl_xact_invalidations
+{
+	Oid			dbId;			/* MyDatabaseId */
+	Oid			tsId;			/* MyDatabaseTableSpace */
+	bool		relcacheInitFileInval;	/* invalidate relcache init file */
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0867ee9..6a7187b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,16 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			Oid			dbId;	/* MyDatabaseId */
+			Oid			tsId;	/* MyDatabaseTableSpace */
+			bool		relcacheInitFileInval;	/* invalidate relcache init
+												 * file */
+			SharedInvalidationMessage msg;	/* invalidation message */
+		}			inval;
 	}			data;
 
 	/*
@@ -448,6 +459,9 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  Oid dbId, Oid tsId, bool relcacheInitFileInval,
+								  SharedInvalidationMessage msg);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
1.8.3.1

v4-0003-fixup-is_schema_sent-set-too-early.patchapplication/octet-stream; name=v4-0003-fixup-is_schema_sent-set-too-early.patchDownload
From 5c284042e2a21f7f7ed7bdc84a5eccec77e4c05f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 27 Dec 2019 22:50:55 +0100
Subject: [PATCH v4 03/19] fixup: is_schema_sent set too early

---
 src/backend/replication/logical/reorderbuffer.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b1feff3..c0b9725 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1819,7 +1819,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * about the message itself?
 					 */
 					LocalExecuteInvalidationMessage(&change->data.inval.msg);
-					txn->is_schema_sent = false;
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
-- 
1.8.3.1

v4-0004-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=v4-0004-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From 8a7ab4d4c83e2453ea926727c3c7985d4e0f7222 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v4 04/19] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 +++++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  57 +++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 6c33c4b..9c77791 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -542,3 +571,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8db9686..ace21ec 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 7e06615..57edf54 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +204,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -863,6 +911,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->final_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 6879a2e..1e934d2 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index d4ce54f..a305462 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6a7187b..5b4be2b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -345,6 +345,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -384,6 +430,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v4-0005-Cleaning-up-of-flags-in-ReorderBufferTXN-structur.patchapplication/octet-stream; name=v4-0005-Cleaning-up-of-flags-in-ReorderBufferTXN-structur.patchDownload
From 5ebe0327a929d10d4fa82f27e4828dc60c6e2701 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 18:08:37 +0200
Subject: [PATCH v4 05/19] Cleaning up of flags in ReorderBufferTXN structure

---
 src/backend/replication/logical/reorderbuffer.c | 36 ++++++++++++-------------
 src/include/replication/reorderbuffer.h         | 33 ++++++++++++++---------
 2 files changed, 38 insertions(+), 31 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c0b9725..f74c199 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -732,7 +732,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 			Assert(prev_first_lsn < cur_txn->first_lsn);
 
 		/* known-as-subtxn txns must not be listed */
-		Assert(!cur_txn->is_known_as_subxact);
+		Assert(!rbtxn_is_known_subxact(cur_txn));
 
 		prev_first_lsn = cur_txn->first_lsn;
 	}
@@ -752,7 +752,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 			Assert(prev_base_snap_lsn < cur_txn->base_snapshot_lsn);
 
 		/* known-as-subtxn txns must not be listed */
-		Assert(!cur_txn->is_known_as_subxact);
+		Assert(!rbtxn_is_known_subxact(cur_txn));
 
 		prev_base_snap_lsn = cur_txn->base_snapshot_lsn;
 	}
@@ -775,7 +775,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
 
 	txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
 
-	Assert(!txn->is_known_as_subxact);
+	Assert(!rbtxn_is_known_subxact(txn));
 	Assert(txn->first_lsn != InvalidXLogRecPtr);
 	return txn;
 }
@@ -835,7 +835,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 
 	if (!new_sub)
 	{
-		if (subtxn->is_known_as_subxact)
+		if (rbtxn_is_known_subxact(subtxn))
 		{
 			/* already associated, nothing to do */
 			return;
@@ -851,7 +851,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 		}
 	}
 
-	subtxn->is_known_as_subxact = true;
+	subtxn->txn_flags |= RBTXN_IS_SUBXACT;
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
@@ -1061,7 +1061,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	{
 		ReorderBufferChange *cur_change;
 
-		if (txn->serialized)
+		if (rbtxn_is_serialized(txn))
 		{
 			/* serialize remaining changes */
 			ReorderBufferSerializeTXN(rb, txn);
@@ -1090,7 +1090,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		{
 			ReorderBufferChange *cur_change;
 
-			if (cur_txn->serialized)
+			if (rbtxn_is_serialized(cur_txn))
 			{
 				/* serialize remaining changes */
 				ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1256,7 +1256,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * they originally were happening inside another subtxn, so we won't
 		 * ever recurse more than one level deep here.
 		 */
-		Assert(subtxn->is_known_as_subxact);
+		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
 		ReorderBufferCleanupTXN(rb, subtxn);
@@ -1304,7 +1304,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	/*
 	 * Remove TXN from its containing list.
 	 *
-	 * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+	 * Note: if txn is known as subxact, we are deleting the TXN from its
 	 * parent's list of known subxacts; this leaves the parent's nsubxacts
 	 * count too high, but we don't care.  Otherwise, we are deleting the TXN
 	 * from the LSN-ordered list of toplevel TXNs.
@@ -1319,7 +1319,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	Assert(found);
 
 	/* remove entries spilled to disk */
-	if (txn->serialized)
+	if (rbtxn_is_serialized(txn))
 		ReorderBufferRestoreCleanup(rb, txn);
 
 	/* deallocate */
@@ -1336,7 +1336,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
 		return;
 
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1969,7 +1969,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
 			 * final_lsn to that of their last change; this causes
 			 * ReorderBufferRestoreCleanup to do the right thing.
 			 */
-			if (txn->serialized && txn->final_lsn == 0)
+			if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
 			{
 				ReorderBufferChange *last =
 				dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -2117,7 +2117,7 @@ ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
 	 * operate on its top-level transaction instead.
 	 */
 	txn = ReorderBufferTXNByXid(rb, xid, true, &is_new, lsn, true);
-	if (txn->is_known_as_subxact)
+	if (rbtxn_is_known_subxact(txn))
 		txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
 									NULL, InvalidXLogRecPtr, false);
 	Assert(txn->base_snapshot == NULL);
@@ -2296,7 +2296,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	txn->has_catalog_changes = true;
+	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2313,7 +2313,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
 	if (txn == NULL)
 		return false;
 
-	return txn->has_catalog_changes;
+	return rbtxn_has_catalog_changes(txn);
 }
 
 /*
@@ -2333,7 +2333,7 @@ ReorderBufferXidHasBaseSnapshot(ReorderBuffer *rb, TransactionId xid)
 		return false;
 
 	/* a known subtxn? operate on top-level txn instead */
-	if (txn->is_known_as_subxact)
+	if (rbtxn_is_known_subxact(txn))
 		txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
 									NULL, InvalidXLogRecPtr, false);
 
@@ -2521,12 +2521,12 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	rb->spillBytes += size;
 
 	/* Don't consider already serialized transaction. */
-	rb->spillTxns += txn->serialized ? 0 : 1;
+	rb->spillTxns += rbtxn_is_serialized(txn) ? 0 : 1;
 
 	Assert(spilled == txn->nentries_mem);
 	Assert(dlist_is_empty(&txn->changes));
 	txn->nentries_mem = 0;
-	txn->serialized = true;
+	txn->txn_flags |= RBTXN_IS_SERIALIZED;
 
 	if (fd != -1)
 		CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5b4be2b..19c7bac 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -169,18 +169,34 @@ typedef struct ReorderBufferChange
 	dlist_node	node;
 } ReorderBufferChange;
 
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT          0x0002
+#define RBTXN_IS_SERIALIZED       0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn)    (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk?  It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn)       (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
 typedef struct ReorderBufferTXN
 {
+	int     txn_flags;
+
 	/*
 	 * The transactions transaction id, can be a toplevel or sub xid.
 	 */
 	TransactionId xid;
 
-	/* did the TX have catalog changes */
-	bool		has_catalog_changes;
-
 	/* Do we know this is a subxact?  Xid of top-level txn if so */
-	bool		is_known_as_subxact;
 	TransactionId toplevel_xid;
 
 	/*
@@ -249,15 +265,6 @@ typedef struct ReorderBufferTXN
 	uint64		nentries_mem;
 
 	/*
-	 * Has this transaction been spilled to disk?  It's not always possible to
-	 * deduce that fact by comparing nentries with nentries_mem, because e.g.
-	 * subtransactions of a large transaction might get serialized together
-	 * with the parent - if they're restored to memory they'd have
-	 * nentries_mem == nentries.
-	 */
-	bool		serialized;
-
-	/*
 	 * List of ReorderBufferChange structs, including new Snapshots and new
 	 * CommandIds
 	 */
-- 
1.8.3.1

v4-0006-Gracefully-handle-concurrent-aborts-of-uncommitte.patchapplication/octet-stream; name=v4-0006-Gracefully-handle-concurrent-aborts-of-uncommitte.patchDownload
From 8b0ad280cae050bae5e6229cb3485967db919bb3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:32:34 +0530
Subject: [PATCH v4 06/19] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml               |  5 ++-
 src/backend/access/heap/heapam.c                | 41 +++++++++++++++++++++++++
 src/backend/access/index/genam.c                | 34 ++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c |  8 ++---
 src/backend/utils/time/snapmgr.c                | 25 +++++++++++++--
 src/include/utils/snapmgr.h                     |  4 ++-
 6 files changed, 109 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index ace21ec..319349a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,7 +432,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
      includes writing to tables, performing DDL changes, and
      calling <literal>txid_current()</literal>.
     </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0128bb3..a27eac0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1303,6 +1303,15 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(scan->rs_base.rs_rd) ||
+		  RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+		elog(ERROR, "improper heap_getnext call");
+
 	/* Note: no locking manipulations needed */
 
 	HEAPDEBUG_1;				/* heap_getnext( info ) */
@@ -1422,6 +1431,14 @@ heap_fetch(Relation relation,
 	bool		valid;
 
 	/*
+	 * We don't expect direct calls to heap_fetch with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "improper heap_fetch call");
+
+	/*
 	 * Fetch and pin the appropriate page of the relation.
 	 */
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
@@ -1535,6 +1552,14 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		valid;
 	bool		skip;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search_buffer with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "improper heap_hot_search_buffer call");
+
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
 		*all_dead = first_call;
@@ -1683,6 +1708,14 @@ heap_get_latest_tid(TableScanDesc sscan,
 	Assert(ItemPointerIsValid(tid));
 
 	/*
+	 * We don't expect direct calls to heap_get_latest_tid with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "improper heap_get_latest_tid call");
+
+	/*
 	 * Loop to chase down t_ctid links.  At top of loop, ctid is the tuple we
 	 * need to examine, and *tid is the TID we will return if ctid turns out
 	 * to be bogus.
@@ -5481,6 +5514,14 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	ItemId		lp = NULL;
 	HeapTupleHeader htup;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "improper heap_hot_search call");
+
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d..201acfb 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -477,6 +478,17 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
@@ -513,6 +525,17 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return result;
 }
 
@@ -639,6 +662,17 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index f74c199..3e87597 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -683,7 +683,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 			txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 		/* setup snapshot to allow catalog access */
-		SetupHistoricSnapshot(snapshot_now, NULL);
+		SetupHistoricSnapshot(snapshot_now, NULL, xid);
 		PG_TRY();
 		{
 			rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1533,7 +1533,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1784,7 +1784,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1804,7 +1804,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					}
 
 					break;
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 47b0517..9fa1e43 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -154,6 +154,13 @@ static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
 /*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
+/*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2029,10 +2036,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
  *
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive.  This is to re-check XID status while accessing catalog.
+ *
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+					  TransactionId snapshot_xid)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -2041,8 +2052,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 
 	/* setup (cmin, cmax) lookup hash */
 	tuplecid_data = tuplecids;
-}
 
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
+	 */
+	if (TransactionIdIsValid(snapshot_xid) &&
+		!TransactionIdDidCommit(snapshot_xid))
+		CheckXidAlive = snapshot_xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
 /*
  * Make catalog snapshots behave normally again.
@@ -2052,6 +2072,7 @@ TeardownHistoricSnapshot(bool is_error)
 {
 	HistoricSnapshot = NULL;
 	tuplecid_data = NULL;
+	CheckXidAlive = InvalidTransactionId;
 }
 
 bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 67b07df..9a8f9ce 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,8 +145,10 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+								  TransactionId snapshot_xid);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-- 
1.8.3.1

v4-0007-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v4-0007-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From 4bc6427ef3439096a34d12dabb28ba60d09045d7 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 11:42:31 +0530
Subject: [PATCH v4 07/19] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c     |   38 +-
 src/backend/replication/logical/reorderbuffer.c | 1075 ++++++++++++++++++++++-
 src/include/replication/reorderbuffer.h         |   32 +
 3 files changed, 1112 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 3e36467..cf10dd0 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3e87597..232d9f4 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -149,6 +149,28 @@ typedef struct ReorderBufferIterTXNState
 	ReorderBufferIterTXNEntry entries[FLEXIBLE_ARRAY_MEMBER];
 } ReorderBufferIterTXNState;
 
+/*
+ * k-way in-order change iteration support structures
+ *
+ * This is a simplified version for streaming, which does not require
+ * serialization to files and only reads changes that are currently in
+ * memory.
+ */
+typedef struct ReorderBufferStreamIterTXNEntry
+{
+	XLogRecPtr	lsn;
+	ReorderBufferChange *change;
+	ReorderBufferTXN *txn;
+}			ReorderBufferStreamIterTXNEntry;
+
+typedef struct ReorderBufferStreamIterTXNState
+{
+	binaryheap *heap;
+	Size		nr_txns;
+	dlist_head	old_change;
+	ReorderBufferStreamIterTXNEntry entries[FLEXIBLE_ARRAY_MEMBER];
+}			ReorderBufferStreamIterTXNState;
+
 /* toast datastructures */
 typedef struct ReorderBufferToastEnt
 {
@@ -213,6 +235,20 @@ static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
 static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
 
+
+/* iterator for streaming (only get data from memory) */
+static ReorderBufferStreamIterTXNState * ReorderBufferStreamIterTXNInit(
+																		ReorderBuffer *rb,
+																		ReorderBufferTXN *txn);
+
+static ReorderBufferChange *ReorderBufferStreamIterTXNNext(
+							   ReorderBuffer *rb,
+							   ReorderBufferStreamIterTXNState * state);
+
+static void ReorderBufferStreamIterTXNFinish(
+								 ReorderBuffer *rb,
+								 ReorderBufferStreamIterTXNState * state);
+
 /*
  * ---------------------------------------
  * Disk serialization support functions
@@ -227,6 +263,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -235,6 +272,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -362,6 +408,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -759,6 +808,33 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+static void
+AssertChangeLsnOrder(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -855,6 +931,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -978,7 +1057,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1006,6 +1085,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	cur_txn_i;
 	int32		off;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(rb, txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1020,6 +1102,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(rb, cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1235,6 +1320,210 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 }
 
 /*
+ * Binary heap comparison function (streaming iterator).
+ */
+static int
+ReorderBufferStreamIterCompare(Datum a, Datum b, void *arg)
+{
+	ReorderBufferStreamIterTXNState *state = (ReorderBufferStreamIterTXNState *) arg;
+	XLogRecPtr	pos_a = state->entries[DatumGetInt32(a)].lsn;
+	XLogRecPtr	pos_b = state->entries[DatumGetInt32(b)].lsn;
+
+	if (pos_a < pos_b)
+		return 1;
+	else if (pos_a == pos_b)
+		return 0;
+	return -1;
+}
+
+/*
+ * Allocate & initialize an iterator which iterates in lsn order over a
+ * transaction and all its subtransactions. This version is meant for
+ * streaming of incomplete transactions.
+ */
+static ReorderBufferStreamIterTXNState *
+ReorderBufferStreamIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Size		nr_txns = 0;
+	ReorderBufferStreamIterTXNState *state;
+	dlist_iter	cur_txn_i;
+	int32		off;
+
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(rb, txn);
+
+	/*
+	 * Calculate the size of our heap: one element for every transaction that
+	 * contains changes.  (Besides the transactions already in the reorder
+	 * buffer, we count the one we were directly passed.)
+	 */
+	if (txn->nentries > 0)
+		nr_txns++;
+
+	dlist_foreach(cur_txn_i, &txn->subtxns)
+	{
+		ReorderBufferTXN *cur_txn;
+
+		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
+
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(rb, cur_txn);
+
+		if (cur_txn->nentries > 0)
+			nr_txns++;
+	}
+
+	/*
+	 * TODO: Consider adding fastpath for the rather common nr_txns=1 case, no
+	 * need to allocate/build a heap then.
+	 */
+
+	/* allocate iteration state */
+	state = (ReorderBufferStreamIterTXNState *)
+		MemoryContextAllocZero(rb->context,
+							   sizeof(ReorderBufferStreamIterTXNState) +
+							   sizeof(ReorderBufferStreamIterTXNEntry) * nr_txns);
+
+	state->nr_txns = nr_txns;
+	dlist_init(&state->old_change);
+
+	/* allocate heap */
+	state->heap = binaryheap_allocate(state->nr_txns,
+									  ReorderBufferStreamIterCompare,
+									  state);
+
+	/*
+	 * Now insert items into the binary heap, in an unordered fashion.  (We
+	 * will run a heap assembly step at the end; this is more efficient.)
+	 */
+
+	off = 0;
+
+	/* add toplevel transaction if it contains changes */
+	if (txn->nentries > 0)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_head_element(ReorderBufferChange, node,
+										&txn->changes);
+
+		state->entries[off].lsn = cur_change->lsn;
+		state->entries[off].change = cur_change;
+		state->entries[off].txn = txn;
+
+		binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
+	}
+
+	/* add subtransactions if they contain changes */
+	dlist_foreach(cur_txn_i, &txn->subtxns)
+	{
+		ReorderBufferTXN *cur_txn;
+
+		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
+
+		if (cur_txn->nentries > 0)
+		{
+			ReorderBufferChange *cur_change;
+
+			cur_change = dlist_head_element(ReorderBufferChange, node,
+											&cur_txn->changes);
+
+			state->entries[off].lsn = cur_change->lsn;
+			state->entries[off].change = cur_change;
+			state->entries[off].txn = cur_txn;
+
+			binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
+		}
+	}
+
+	Assert(off == nr_txns);
+
+	/* assemble a valid binary heap */
+	binaryheap_build(state->heap);
+
+	return state;
+}
+
+/*
+ * Return the next change when iterating over a transaction and its
+ * subtransactions.
+ *
+ * Returns NULL when no further changes exist.
+ */
+static ReorderBufferChange *
+ReorderBufferStreamIterTXNNext(ReorderBuffer *rb, ReorderBufferStreamIterTXNState * state)
+{
+	ReorderBufferChange *change;
+	ReorderBufferStreamIterTXNEntry *entry;
+	int32		off;
+
+	/* nothing there anymore */
+	if (state->heap->bh_size == 0)
+		return NULL;
+
+	off = DatumGetInt32(binaryheap_first(state->heap));
+	entry = &state->entries[off];
+
+	/* free memory we might have "leaked" in the previous *Next call */
+	if (!dlist_is_empty(&state->old_change))
+	{
+		change = dlist_container(ReorderBufferChange, node,
+								 dlist_pop_head_node(&state->old_change));
+		ReorderBufferReturnChange(rb, change);
+		Assert(dlist_is_empty(&state->old_change));
+	}
+
+	change = entry->change;
+
+	/*
+	 * update heap with information about which transaction has the next
+	 * relevant change in LSN order
+	 */
+
+	/* there are in-memory changes */
+	if (dlist_has_next(&entry->txn->changes, &entry->change->node))
+	{
+		dlist_node *next = dlist_next_node(&entry->txn->changes, &change->node);
+		ReorderBufferChange *next_change =
+		dlist_container(ReorderBufferChange, node, next);
+
+		/* txn stays the same */
+		state->entries[off].lsn = next_change->lsn;
+		state->entries[off].change = next_change;
+
+		binaryheap_replace_first(state->heap, Int32GetDatum(off));
+		return change;
+	}
+
+	/* ok, no changes there anymore, remove */
+	binaryheap_remove_first(state->heap);
+
+	return change;
+}
+
+/*
+ * Deallocate the iterator
+ */
+static void
+ReorderBufferStreamIterTXNFinish(ReorderBuffer *rb,
+								 ReorderBufferStreamIterTXNState * state)
+{
+	/* free memory we might have "leaked" in the last *Next call */
+	if (!dlist_is_empty(&state->old_change))
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node,
+								 dlist_pop_head_node(&state->old_change));
+		ReorderBufferReturnChange(rb, change);
+		Assert(dlist_is_empty(&state->old_change));
+	}
+
+	binaryheap_free(state->heap);
+	pfree(state);
+}
+
+/*
  * Cleanup the contents of a transaction, usually after the transaction
  * committed or aborted.
  */
@@ -1327,33 +1616,104 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
  */
 static void
-ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	dlist_iter	iter;
-	HASHCTL		hash_ctl;
+	dlist_mutable_iter iter;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
 
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
-	hash_ctl.entrysize = sizeof(ReorderBufferTupleCidEnt);
-	hash_ctl.hcxt = rb->context;
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
 
-	/*
-	 * create the hash with the exact number of to-be-stored tuplecids from
-	 * the start
-	 */
-	txn->tuplecid_hash =
-		hash_create("ReorderBufferTupleCid", txn->ntuplecids, &hash_ctl,
-					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
 
-	dlist_foreach(iter, &txn->tuplecids)
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
+ * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
+ */
+static void
+ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_iter	iter;
+	HASHCTL		hash_ctl;
+
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+
+	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
+	hash_ctl.entrysize = sizeof(ReorderBufferTupleCidEnt);
+	hash_ctl.hcxt = rb->context;
+
+	/*
+	 * create the hash with the exact number of to-be-stored tuplecids from
+	 * the start
+	 */
+	txn->tuplecid_hash =
+		hash_create("ReorderBufferTupleCid", txn->ntuplecids, &hash_ctl,
+					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	dlist_foreach(iter, &txn->tuplecids)
 	{
 		ReorderBufferTupleCidKey key;
 		ReorderBufferTupleCidEnt *ent;
@@ -1403,6 +1763,16 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 }
 
+static void
+ReorderBufferDestroyTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+}
+
 /*
  * Copy a provided snapshot so we can modify it privately. This is needed so
  * that catalog modifying transactions can look into intermediate catalog
@@ -1476,6 +1846,19 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 		SnapBuildSnapDecRefcount(snap);
 }
 
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
+
+	ReorderBufferStreamTXN(rb, txn);
+
+	rb->stream_commit(rb, txn, txn->final_lsn);
+
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
 /*
  * Perform the replay of a transaction and its non-aborted subtransactions.
  *
@@ -1515,6 +1898,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
 	 * If this transaction has no snapshot, it didn't make any changes to the
 	 * database, so there's nothing to decode.  Note that
 	 * ReorderBufferCommitChild will have transferred any snapshots from
@@ -1549,6 +1948,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 	PG_TRY();
 	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 		ReorderBufferChange *change;
 		ReorderBufferChange *specinsert = NULL;
 
@@ -1565,6 +1965,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			if (prev_lsn != InvalidXLogRecPtr)
+				Assert(prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1928,6 +2338,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2012,6 +2429,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2147,8 +2571,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2156,6 +2589,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2167,19 +2601,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2208,6 +2651,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2283,6 +2727,9 @@ ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	for (i = 0; i < txn->ninvalidations; i++)
 		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+
+	/* Invalidate current schema as well */
+	txn->is_schema_sent = false;
 }
 
 /*
@@ -2297,6 +2744,23 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * We read catalog changes from WAL, which are not yet sent, so
+	 * invalidate current schema in order output plugin can resend
+	 * schema again.
+	 */
+	txn->is_schema_sent = false;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+	{
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+		txn->toptxn->is_schema_sent = false;
+	}
 }
 
 /*
@@ -2401,6 +2865,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
  *
@@ -2420,15 +2916,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2721,6 +3248,498 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id;
+	bool		using_subtxn;
+	Size		streamed = 0;
+	ReorderBufferStreamIterTXNState *volatile iterstate = NULL;
+
+	/*
+	 * If this is a subxact, we need to stream the top-level transaction
+	 * instead.
+	 */
+	if (txn->toptxn)
+	{
+		ReorderBufferStreamTXN(rb, txn->toptxn);
+		return;
+	}
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+
+			if (subtxn->base_snapshot != NULL &&
+				(txn->base_snapshot == NULL ||
+				 txn->base_snapshot_lsn > subtxn->base_snapshot_lsn))
+			{
+				txn->base_snapshot = subtxn->base_snapshot;
+				txn->base_snapshot_lsn = subtxn->base_snapshot_lsn;
+				subtxn->base_snapshot = NULL;
+				subtxn->base_snapshot_lsn = InvalidXLogRecPtr;
+			}
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
+		 * information about subtransactions, which could arrive after streaming start.
+		 */
+		if (!txn->is_schema_sent)
+			snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+												 txn, command_id);
+	}
+
+	/*
+	 * build data to be able to lookup the CommandIds of catalog tuples
+	 */
+	ReorderBufferBuildTupleCidHash(rb, txn);
+
+	/* setup the initial snapshot */
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, txn->xid);
+
+	/*
+	 * Decoding needs access to syscaches et al., which in turn use
+	 * heavyweight locks and such. Thus we need to have enough state around to
+	 * keep track of those.  The easiest way is to simply use a transaction
+	 * internally.  That also allows us to easily enforce that nothing writes
+	 * to the database by checking for xid assignments.
+	 *
+	 * When we're called via the SQL SRF there's already a transaction
+	 * started, so start an explicit subtransaction there.
+	 */
+	using_subtxn = IsTransactionOrTransactionBlock();
+
+	PG_TRY();
+	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+		ReorderBufferChange *change;
+		ReorderBufferChange *specinsert = NULL;
+
+		if (using_subtxn)
+			BeginInternalSubTransaction("stream");
+		else
+			StartTransactionCommand();
+
+		/* start streaming this chunk of transaction */
+		rb->stream_start(rb, txn);
+
+		iterstate = ReorderBufferStreamIterTXNInit(rb, txn);
+		while ((change = ReorderBufferStreamIterTXNNext(rb, iterstate)) != NULL)
+		{
+			Relation	relation = NULL;
+			Oid			reloid;
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			if (prev_lsn != InvalidXLogRecPtr)
+				Assert(prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* we're going to stream this change */
+			streamed++;
+
+			switch (change->action)
+			{
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+
+					/*
+					 * Confirmation for speculative insertion arrived. Simply
+					 * use as a normal record. It'll be cleaned up at the end
+					 * of INSERT processing.
+					 */
+					Assert(specinsert->data.tp.oldtuple == NULL);
+					change = specinsert;
+					change->action = REORDER_BUFFER_CHANGE_INSERT;
+
+					/* intentionally fall through */
+				case REORDER_BUFFER_CHANGE_INSERT:
+				case REORDER_BUFFER_CHANGE_UPDATE:
+				case REORDER_BUFFER_CHANGE_DELETE:
+					Assert(snapshot_now);
+
+					reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
+												change->data.tp.relnode.relNode);
+
+					/*
+					 * Catalog tuple without data, emitted while catalog was
+					 * in the process of being rewritten.
+					 */
+					if (reloid == InvalidOid &&
+						change->data.tp.newtuple == NULL &&
+						change->data.tp.oldtuple == NULL)
+						goto change_done;
+					else if (reloid == InvalidOid)
+						elog(ERROR, "could not map filenode \"%s\" to relation OID",
+							 relpathperm(change->data.tp.relnode,
+										 MAIN_FORKNUM));
+
+					relation = RelationIdGetRelation(reloid);
+
+					if (relation == NULL)
+						elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
+							 reloid,
+							 relpathperm(change->data.tp.relnode,
+										 MAIN_FORKNUM));
+
+					if (!RelationIsLogicallyLogged(relation))
+						goto change_done;
+
+					/*
+					 * For now ignore sequence changes entirely. Most of the
+					 * time they don't log changes using records we
+					 * understand, so it doesn't make sense to handle the few
+					 * cases we do.
+					 */
+					if (relation->rd_rel->relkind == RELKIND_SEQUENCE)
+						goto change_done;
+
+					/* user-triggered change */
+					if (!IsToastRelation(relation))
+					{
+						ReorderBufferToastReplace(rb, txn, relation, change);
+						rb->stream_change(rb, txn, relation, change);
+
+						/*
+						 * Only clear reassembled toast chunks if we're sure
+						 * they're not required anymore. The creator of the
+						 * tuple tells us.
+						 */
+						if (change->data.tp.clear_toast_afterwards)
+							ReorderBufferToastReset(rb, txn);
+					}
+					/* we're not interested in toast deletions */
+					else if (change->action == REORDER_BUFFER_CHANGE_INSERT)
+					{
+						/*
+						 * Need to reassemble the full toasted Datum in
+						 * memory, to ensure the chunks don't get reused till
+						 * we're done remove it from the list of this
+						 * transaction's changes. Otherwise it will get
+						 * freed/reused while restoring spooled data from
+						 * disk.
+						 */
+						dlist_delete(&change->node);
+						ReorderBufferToastAppendChunk(rb, txn, relation,
+													  change);
+					}
+
+			change_done:
+
+					/*
+					 * Either speculative insertion was confirmed, or it was
+					 * unsuccessful and the record isn't needed anymore.
+					 */
+					if (specinsert != NULL)
+					{
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+
+					if (relation != NULL)
+					{
+						RelationClose(relation);
+						relation = NULL;
+					}
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
+
+					/*
+					 * Speculative insertions are dealt with by delaying the
+					 * processing of the insert until the confirmation record
+					 * arrives. For that we simply unlink the record from the
+					 * chain, so it does not get freed/reused while restoring
+					 * spooled data from disk.
+					 *
+					 * This is safe in the face of concurrent catalog changes
+					 * because the relevant relation can't be changed between
+					 * speculative insertion and confirmation due to
+					 * CheckTableNotInUse() and locking.
+					 */
+
+					/* clear out a pending (and thus failed) speculation */
+					if (specinsert != NULL)
+					{
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+
+					/* and memorize the pending insertion */
+					dlist_delete(&change->node);
+					specinsert = change;
+					break;
+
+				case REORDER_BUFFER_CHANGE_TRUNCATE:
+					{
+						int			i;
+						int			nrelids = change->data.truncate.nrelids;
+						int			nrelations = 0;
+						Relation   *relations;
+
+						relations = palloc0(nrelids * sizeof(Relation));
+						for (i = 0; i < nrelids; i++)
+						{
+							Oid			relid = change->data.truncate.relids[i];
+							Relation	relation;
+
+							relation = RelationIdGetRelation(relid);
+
+							if (relation == NULL)
+								elog(ERROR, "could not open relation with OID %u", relid);
+
+							if (!RelationIsLogicallyLogged(relation))
+								continue;
+
+							relations[nrelations++] = relation;
+						}
+
+						rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+						for (i = 0; i < nrelations; i++)
+							RelationClose(relations[i]);
+
+						break;
+					}
+
+				case REORDER_BUFFER_CHANGE_MESSAGE:
+
+					rb->stream_message(rb, txn, change->lsn, true,
+									   change->data.msg.prefix,
+									   change->data.msg.message_size,
+									   change->data.msg.message);
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
+					/* get rid of the old */
+					TeardownHistoricSnapshot(false);
+
+					if (snapshot_now->copied)
+					{
+						ReorderBufferFreeSnap(rb, snapshot_now);
+						snapshot_now =
+							ReorderBufferCopySnap(rb, change->data.snapshot,
+												  txn, command_id);
+					}
+
+					/*
+					 * Restored from disk, need to be careful not to double
+					 * free. We could introduce refcounting for that, but for
+					 * now this seems infrequent enough not to care.
+					 */
+					else if (change->data.snapshot->copied)
+					{
+						snapshot_now =
+							ReorderBufferCopySnap(rb, change->data.snapshot,
+												  txn, command_id);
+					}
+					else
+					{
+						snapshot_now = change->data.snapshot;
+					}
+
+					/*
+					 * TOCHECK: Snapshot changed, then invalidate current schema to reflect
+					 * possible catalog changes.
+					 */
+					txn->is_schema_sent = false;
+
+					/* and continue with the new one */
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+										  txn->xid);
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
+					Assert(change->data.command_id != InvalidCommandId);
+
+					if (command_id < change->data.command_id)
+					{
+						command_id = change->data.command_id;
+
+						if (!snapshot_now->copied)
+						{
+							/* we don't use the global one anymore */
+							snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+																 txn, command_id);
+						}
+
+						snapshot_now->curcid = command_id;
+
+						TeardownHistoricSnapshot(false);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+											  txn->xid);
+					}
+
+					break;
+
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation message locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+					txn->is_schema_sent = false;
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+					elog(ERROR, "tuplecid value in changequeue");
+					break;
+			}
+		}
+
+		/*
+		 * There's a speculative insertion remaining, just clean in up, it
+		 * can't have been successful, otherwise we'd gotten a confirmation
+		 * record.
+		 */
+		if (specinsert)
+		{
+			ReorderBufferReturnChange(rb, specinsert);
+			specinsert = NULL;
+		}
+
+		/* clean up the iterator */
+		ReorderBufferStreamIterTXNFinish(rb, iterstate);
+		iterstate = NULL;
+
+		/* call stream_stop callback */
+		rb->stream_stop(rb, txn);
+
+		/* this is just a sanity check against bad output plugin behaviour */
+		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
+			elog(ERROR, "output plugin used XID %u",
+				 GetCurrentTransactionId());
+
+		/* remember the command ID and snapshot for the streaming run */
+		txn->command_id = command_id;
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+
+		/* cleanup */
+		TeardownHistoricSnapshot(false);
+
+		/*
+		 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+		 * any memory. We could also keep the hash table and update it with
+		 * new ctid values, but this seems simpler and good enough for now.
+		 */
+		ReorderBufferDestroyTupleCidHash(rb, txn);
+
+		/*
+		 * Aborting the current (sub-)transaction as a whole has the right
+		 * semantics. We want all locks acquired in here to be released, not
+		 * reassigned to the parent and we do not want any database access
+		 * have persistent effects.
+		 */
+		AbortCurrentTransaction();
+
+		/* make sure there's no cache pollution */
+		ReorderBufferExecuteInvalidations(rb, txn);
+
+		if (using_subtxn)
+			RollbackAndReleaseCurrentSubTransaction();
+	}
+	PG_CATCH();
+	{
+		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
+		if (iterstate)
+			ReorderBufferStreamIterTXNFinish(rb, iterstate);
+
+		TeardownHistoricSnapshot(true);
+
+		/*
+		 * Force cache invalidation to happen outside of a valid transaction
+		 * to prevent catalog access as we just caught an error.
+		 */
+		AbortCurrentTransaction();
+
+		/* make sure there's no cache pollution */
+		ReorderBufferExecuteInvalidations(rb, txn);
+
+		if (using_subtxn)
+			RollbackAndReleaseCurrentSubTransaction();
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	/*
+	 * Discard the changes that we just streamed, and mark the transactions
+	 * as streamed (if they contained changes).
+	 */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 19c7bac..7d08e2f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* does the txn have catalog changes */
 #define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -187,6 +188,20 @@ typedef struct ReorderBufferChange
  */
 #define rbtxn_is_serialized(txn)       (txn->txn_flags & RBTXN_IS_SERIALIZED)
 
+/*
+ * Has this transaction been streamed to downstream? Similarly to spilling
+ * to disk, it's not trivial to deduce this from nentries and nentries_mem,
+ * for various reasons. For example, all changes may be in subtransactions
+ * in which case we'd have nentries==0 for the toplevel one, and it'd say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.
+ *
+ * Note: We never stream and serialize a transaction at the same time (e
+ * only do spill to disk when streaming is not supported by the plugin),
+ * so only one of those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn)         (txn->txn_flags & RBTXN_IS_STREAMED)
+
 typedef struct ReorderBufferTXN
 {
 	int     txn_flags;
@@ -222,6 +237,16 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	final_lsn;
 
 	/*
+	 * Do we need to send schema for this transaction in output plugin?
+	 */
+	bool		is_schema_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
+	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
 	XLogRecPtr	end_lsn;
@@ -252,6 +277,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
-- 
1.8.3.1

v4-0008-fixup-add-is_schema_sent-back.patchapplication/octet-stream; name=v4-0008-fixup-add-is_schema_sent-back.patchDownload
From dcb320769da1aa3c4bd7c80c3dcee22a0fb3ed33 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 27 Dec 2019 23:02:38 +0100
Subject: [PATCH v4 08/19] fixup: add is_schema_sent back

---
 src/backend/replication/logical/reorderbuffer.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 232d9f4..78b5c00 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2229,6 +2229,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * about the message itself?
 					 */
 					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+					txn->is_schema_sent = false;
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
-- 
1.8.3.1

v4-0009-fixup-get-rid-of-is_schema_sent-entirely.patchapplication/octet-stream; name=v4-0009-fixup-get-rid-of-is_schema_sent-entirely.patchDownload
From 1f6b7447ac3d80fc47a603d106b2d26bfc46c710 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 27 Dec 2019 23:46:15 +0100
Subject: [PATCH v4 09/19] fixup: get rid of is_schema_sent entirely

We'll do this in the pgoutput.c code directly, not in reorderbuffer.
---
 src/backend/replication/logical/reorderbuffer.c | 26 ++-----------------------
 src/include/replication/reorderbuffer.h         |  5 -----
 2 files changed, 2 insertions(+), 29 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 78b5c00..dda651e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2229,7 +2229,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * about the message itself?
 					 */
 					LocalExecuteInvalidationMessage(&change->data.inval.msg);
-					txn->is_schema_sent = false;
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
@@ -2728,9 +2727,6 @@ ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	for (i = 0; i < txn->ninvalidations; i++)
 		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
-
-	/* Invalidate current schema as well */
-	txn->is_schema_sent = false;
 }
 
 /*
@@ -2747,21 +2743,11 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 
 	/*
-	 * We read catalog changes from WAL, which are not yet sent, so
-	 * invalidate current schema in order output plugin can resend
-	 * schema again.
-	 */
-	txn->is_schema_sent = false;
-
-	/*
 	 * TOCHECK: Mark toplevel transaction as having catalog changes too
 	 * if one of its children has.
 	 */
 	if (txn->toptxn != NULL)
-	{
 		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
-		txn->toptxn->is_schema_sent = false;
-	}
 }
 
 /*
@@ -3344,9 +3330,8 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
 		 * information about subtransactions, which could arrive after streaming start.
 		 */
-		if (!txn->is_schema_sent)
-			snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
-												 txn, command_id);
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
 	}
 
 	/*
@@ -3601,12 +3586,6 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 						snapshot_now = change->data.snapshot;
 					}
 
-					/*
-					 * TOCHECK: Snapshot changed, then invalidate current schema to reflect
-					 * possible catalog changes.
-					 */
-					txn->is_schema_sent = false;
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
 										  txn->xid);
@@ -3645,7 +3624,6 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 					 * about the message itself?
 					 */
 					LocalExecuteInvalidationMessage(&change->data.inval.msg);
-					txn->is_schema_sent = false;
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 7d08e2f..e2b8db0 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -237,11 +237,6 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	final_lsn;
 
 	/*
-	 * Do we need to send schema for this transaction in output plugin?
-	 */
-	bool		is_schema_sent;
-
-	/*
 	 * Toplevel transaction for this subxact (NULL for top-level).
 	 */
 	struct ReorderBufferTXN *toptxn;
-- 
1.8.3.1

v4-0010-Support-logical_decoding_work_mem-set-from-create.patchapplication/octet-stream; name=v4-0010-Support-logical_decoding_work_mem-set-from-create.patchDownload
From 218d6f72deea3db0109db556086d73d0092cf9e7 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 11:51:04 +0530
Subject: [PATCH v4 10/19] Support logical_decoding_work_mem set from create
 subscription command

---
 doc/src/sgml/config.sgml                           | 21 +++++++++++
 doc/src/sgml/ref/create_subscription.sgml          | 12 ++++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/commands/subscriptioncmds.c            | 44 +++++++++++++++++++---
 .../libpqwalreceiver/libpqwalreceiver.c            |  3 ++
 src/backend/replication/logical/worker.c           |  1 +
 src/backend/replication/pgoutput/pgoutput.c        | 30 ++++++++++++++-
 src/include/catalog/pg_subscription.h              |  3 ++
 src/include/replication/walreceiver.h              |  1 +
 9 files changed, 108 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d1c902..8b1923c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1751,6 +1751,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-logical-decoding-work-mem" xreflabel="logical_decoding_work_mem">
+      <term><varname>logical_decoding_work_mem</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>logical_decoding_work_mem</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to be used by logical decoding,
+        before some of the decoded changes are either written to local disk.
+        This limits the amount of memory used by logical streaming replication
+        connections. It defaults to 64 megabytes (<literal>64MB</literal>).
+        Since each replication connection only uses a single buffer of this size,
+        and an installation normally doesn't have many such connections
+        concurrently (as limited by <varname>max_wal_senders</varname>), it's
+        safe to set this value significantly higher than <varname>work_mem</varname>,
+        reducing the amount of decoded changes written to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c24..91790b0 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>work_mem</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          Limits the amount of memory used to decode changes on the
+          publisher.  If not specified, the publisher will use the default
+          specified by <varname>logical_decoding_work_mem</varname>. When
+          needed, additional data are spilled to disk.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 68d88ff..2a27648 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->workmem = subform->subworkmem;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 5408edc..fbb4473 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -58,7 +58,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, int *logical_wm,
+						   bool *logical_wm_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -89,6 +90,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (logical_wm)
+		*logical_wm_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -174,6 +177,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0 && logical_wm)
+		{
+			if (*logical_wm_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*logical_wm_given = true;
+			*logical_wm = defGetInt32(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -317,6 +330,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	int			logical_wm;
+	bool		logical_wm_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -333,7 +348,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &logical_wm, &logical_wm_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -411,6 +426,12 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (logical_wm_given)
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(logical_wm);
+	else
+		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +690,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				int			logical_wm;
+				bool		logical_wm_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &logical_wm, &logical_wm_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +721,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (logical_wm_given)
+				{
+					values[Anum_pg_subscription_subworkmem - 1] =
+						Int32GetDatum(logical_wm);
+					replaces[Anum_pg_subscription_subworkmem - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +739,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +778,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +815,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 545d2fc..0ab6855 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -406,6 +406,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		appendStringInfo(&cmd, ", work_mem '%d'",
+						 options->proto.logical.work_mem);
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 63ba0ae..c80acd3 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1729,6 +1729,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.work_mem = MySubscription->workmem;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 3483c1b..cf6e03b 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -18,6 +18,7 @@
 #include "replication/logicalproto.h"
 #include "replication/origin.h"
 #include "replication/pgoutput.h"
+#include "utils/guc.h"
 #include "utils/int8.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
@@ -87,11 +88,12 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, int *logical_decoding_work_mem)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		work_mem_given = false;
 
 	foreach(lc, options)
 	{
@@ -137,6 +139,29 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0)
+		{
+			int64	parsed;
+
+			if (work_mem_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			work_mem_given = true;
+
+			if (!scanint8(strVal(defel->arg), true, &parsed))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid work_mem")));
+
+			if (parsed > PG_INT32_MAX || parsed < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("work_mem \"%s\" out of range",
+								strVal(defel->arg))));
+
+			*logical_decoding_work_mem = (int)parsed;
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -171,7 +196,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&logical_decoding_work_mem);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3cb13d8..10ea113 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	int32		subworkmem;		/* Memory to use to decode changes. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	int			workmem;		/* Memory to decode changes. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 41714ea..1db706a 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -162,6 +162,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			int			work_mem;	/* Memory limit to use for decoding */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
-- 
1.8.3.1

v4-0011-Add-support-for-streaming-to-built-in-replication.patchapplication/octet-stream; name=v4-0011-Add-support-for-streaming-to-built-in-replication.patchDownload
From 7e4f8b8f47289299f5b856f0e77925d4b8cbb166 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 27 Dec 2019 23:05:20 +0100
Subject: [PATCH v4 11/19] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |    5 +-
 doc/src/sgml/ref/create_subscription.sgml          |   12 +
 src/backend/catalog/pg_subscription.c              |    1 +
 src/backend/commands/subscriptioncmds.c            |   60 +-
 src/backend/postmaster/pgstat.c                    |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    8 +-
 src/backend/replication/logical/launcher.c         |    2 +
 src/backend/replication/logical/logical.c          |    4 +-
 src/backend/replication/logical/proto.c            |  157 ++-
 src/backend/replication/logical/worker.c           | 1031 ++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c        |  263 ++++-
 src/backend/replication/slotfuncs.c                |    7 +
 src/backend/replication/walsender.c                |    6 +
 src/include/catalog/pg_subscription.h              |    3 +
 src/include/pgstat.h                               |    6 +-
 src/include/replication/logicalproto.h             |   42 +-
 src/include/replication/walreceiver.h              |    1 +
 src/test/subscription/t/009_stream_simple.pl       |   86 ++
 src/test/subscription/t/010_stream_subxact.pl      |  102 ++
 src/test/subscription/t/011_stream_ddl.pl          |   95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |   82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |   84 ++
 22 files changed, 2027 insertions(+), 42 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 6dfb2e4..e1fb907 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
+      and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 91790b0..d9abf5e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -218,6 +218,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 2a27648..15a6f5a 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->workmem = subform->subworkmem;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index fbb4473..b2b93d6 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
 						   bool *refresh, int *logical_wm,
-						   bool *logical_wm_given)
+						   bool *logical_wm_given, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -92,6 +93,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*refresh = true;
 	if (logical_wm)
 		*logical_wm_given = false;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -186,6 +189,26 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 
 			*logical_wm_given = true;
 			*logical_wm = defGetInt32(defel);
+
+			/*
+			 * Check that the value is not smaller than 64kB (which is
+			 * the minimum value for logical_work_mem).
+			 */
+			if (*logical_wm < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("%d is outside the valid range for parameter \"work_mem\" (64 .. 2147483647)",
+								*logical_wm)));
+		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
 		}
 		else
 			ereport(ERROR,
@@ -332,6 +355,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	int			logical_wm;
 	bool		logical_wm_given;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -348,7 +373,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL, &logical_wm, &logical_wm_given);
+							   NULL, &logical_wm, &logical_wm_given,
+							   &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -430,7 +456,15 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 		values[Anum_pg_subscription_subworkmem - 1] =
 			Int32GetDatum(logical_wm);
 	else
-		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(-1);
+
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
 
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
@@ -692,11 +726,14 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				int			logical_wm;
 				bool		logical_wm_given;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
 										   NULL, &synchronous_commit, NULL,
-										   &logical_wm, &logical_wm_given);
+										   &logical_wm, &logical_wm_given,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -728,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subworkmem - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -740,7 +784,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
 										   NULL, NULL, NULL, NULL, NULL,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -778,7 +822,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh, NULL, NULL);
+										   NULL, &refresh, NULL, NULL,
+										   NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -815,7 +860,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7410b2f..a479ce9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4102,6 +4102,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 0ab6855..9970170 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -406,8 +406,12 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
-		appendStringInfo(&cmd, ", work_mem '%d'",
-						 options->proto.logical.work_mem);
+		if (options->proto.logical.work_mem != -1)
+			appendStringInfo(&cmd, ", work_mem '%d'",
+							 options->proto.logical.work_mem);
+
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
 
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index c57b578..0a013ed 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,6 +14,8 @@
  *
  *-------------------------------------------------------------------------
  */
+#include <sys/types.h>
+#include <unistd.h>
 
 #include "postgres.h"
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 57edf54..30a6ee4 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1149,7 +1149,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1194,7 +1194,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index e7df47d..5a379fb 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,7 +139,8 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
@@ -147,6 +148,10 @@ logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -182,8 +187,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -191,6 +196,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -252,13 +261,18 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
+	pq_sendbyte(out, 'D');		/* action DELETE */
+
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
-	pq_sendbyte(out, 'D');		/* action DELETE */
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
 
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
@@ -300,6 +314,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -309,6 +324,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -351,12 +370,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -401,7 +424,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -409,6 +432,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -689,3 +716,119 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendint32(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgint(in, 4) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+}
+
+TransactionId
+logicalrep_read_stream_stop(StringInfo in)
+{
+	TransactionId xid;
+
+	xid = pq_getmsgint(in, 4);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index c80acd3..cf053e9 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in /tmp by default, and the filenames include both
+ * the XID of the toplevel transaction and OID of the subscription. This
+ * is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -31,6 +52,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -60,6 +82,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -67,6 +90,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -106,12 +130,59 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	off_t		offset;			/* offset in the file */
+}			SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
+static bool handle_streamed_transaction(const char action, StringInfo s);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -163,6 +234,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -529,6 +636,318 @@ apply_handle_origin(StringInfo s)
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* FIXME optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/* We should not receive aborts for unknown subtransactions. */
+		Assert(found);
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	/* XXX Should this be allocated in another memory context? */
+
+	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	ensure_transaction();
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pfree(buffer);
+	pfree(s2.data);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -541,6 +960,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -556,6 +978,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -591,6 +1016,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -694,6 +1122,9 @@ apply_handle_update(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -814,6 +1245,9 @@ apply_handle_delete(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -913,6 +1347,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1004,6 +1441,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1101,6 +1554,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 }
 
 /*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i]);
+}
+
+/*
  * Apply main loop.
  */
 static void
@@ -1116,6 +1585,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1564,6 +2036,564 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	CloseTransientFile(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	CloseTransientFile(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * TODO: Add missing_ok flag to specify in which cases it's OK not to
+ * find the files, and when it's an error.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialied in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -1730,6 +2760,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.work_mem = MySubscription->workmem;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index cf6e03b..8490ea4 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -45,16 +45,42 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
 
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
+
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 
-/* Entry in the map used to remember which relation schemas we sent. */
+/*
+ * Entry in the map used to remember which relation schemas we sent.
+ *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
+	TransactionId	xid;		/* transaction that created the record */
 	bool		schema_sent;	/* did we send the schema? */
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -64,6 +90,7 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
@@ -84,16 +111,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, int *logical_decoding_work_mem)
+						List **publication_names, int *logical_decoding_work_mem,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		work_mem_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -162,6 +199,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*logical_decoding_work_mem = (int)parsed;
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -174,6 +228,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -197,7 +252,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&logical_decoding_work_mem);
+								&logical_decoding_work_mem,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -217,6 +273,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -284,9 +361,42 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (!relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = txn->is_schema_sent;
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (!schema_sent)
 	{
 		TupleDesc	desc;
 		int			i;
@@ -312,19 +422,26 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 				continue;
 
 			OutputPluginPrepareWrite(ctx, false);
-			logicalrep_write_typ(ctx->out, att->atttypid);
+			logicalrep_write_typ(ctx->out, xid, att->atttypid);
 			OutputPluginWrite(ctx, false);
 		}
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_rel(ctx->out, relation);
+		logicalrep_write_rel(ctx->out, xid, relation);
 		OutputPluginWrite(ctx, false);
-		relentry->schema_sent = true;
+		relentry->xid = change->txn->xid;
+
+		if (in_streaming)
+			txn->is_schema_sent = true;
+		else
+			relentry->schema_sent = true;
 	}
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -333,6 +450,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -361,14 +482,14 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
 	{
 		case REORDER_BUFFER_CHANGE_INSERT:
 			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_insert(ctx->out, relation,
+			logicalrep_write_insert(ctx->out, xid, relation,
 									&change->data.tp.newtuple->tuple);
 			OutputPluginWrite(ctx, true);
 			break;
@@ -378,7 +499,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				&change->data.tp.oldtuple->tuple : NULL;
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple,
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
 										&change->data.tp.newtuple->tuple);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -387,7 +508,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			if (change->data.tp.oldtuple)
 			{
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation,
+				logicalrep_write_delete(ctx->out, xid, relation,
 										&change->data.tp.oldtuple->tuple);
 				OutputPluginWrite(ctx, true);
 			}
@@ -413,6 +534,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -433,13 +558,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -513,6 +639,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -623,6 +834,34 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+	}
+
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index ba08ad4..8eb3160 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -147,6 +147,13 @@ create_logical_replication_slot(char *name, char *plugin,
 									logical_read_local_xlog_page, NULL, NULL,
 									NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+
 	/* build initial snapshot, might take a while */
 	DecodingContextFindStartpoint(ctx);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1f23665..2e0743a 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -969,6 +969,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndUpdateProgress);
 
 		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
+		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
 		 * messages or send keepalives. As we possibly need to wait for
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 10ea113..8793676 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -50,6 +50,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	int32		subworkmem;		/* Memory to use to decode changes. */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -76,6 +78,7 @@ typedef struct Subscription
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	int			workmem;		/* Memory to decode changes. */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f2e873d..c522703 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -945,7 +945,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 3fc430a..bf02cbc 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -85,25 +89,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out, TransactionId xid);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 1db706a..3d19b5d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -163,6 +163,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			int			work_mem;	/* Memory limit to use for decoding */
+			bool		streaming;	/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v4-0012-fixup-add-proper-schema-tracking.patchapplication/octet-stream; name=v4-0012-fixup-add-proper-schema-tracking.patchDownload
From 13f056ac3452588e6f8e3e832f222f3f6416857f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 27 Dec 2019 23:56:04 +0100
Subject: [PATCH v4 12/19] fixup: add proper schema tracking

---
 src/backend/replication/pgoutput/pgoutput.c | 45 +++++++++++++++++++++++++++--
 1 file changed, 43 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 8490ea4..0148f4c 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -82,6 +82,8 @@ typedef struct RelationSyncEntry
 	Oid			relid;			/* relation oid */
 	TransactionId	xid;		/* transaction that created the record */
 	bool		schema_sent;	/* did we send the schema? */
+	List	   *streamed_txns;	/* streamed toplevel transactions with
+								 * this schema */
 	bool		replicate_valid;
 	PublicationActions pubactions;
 } RelationSyncEntry;
@@ -96,6 +98,11 @@ static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -366,6 +373,7 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 {
 	bool	schema_sent;
 	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
 
 	/*
 	 * Remember XID of the (sub)transaction for the change. We don't care if
@@ -378,6 +386,11 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 	if (in_streaming)
 		xid = change->txn->xid;
 
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
 	/*
 	 * Do we need to send the schema? We do track streamed transactions
 	 * separately, because those may not be applied later (and the regular
@@ -391,7 +404,7 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		 * occur when streaming already started, so we have to track new catalog
 		 * changes somehow.
 		 */
-		schema_sent = txn->is_schema_sent;
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
 	}
 	else
 		schema_sent = relentry->schema_sent;
@@ -432,7 +445,7 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->xid = change->txn->xid;
 
 		if (in_streaming)
-			txn->is_schema_sent = true;
+			set_schema_sent_in_streamed_txn(relentry, topxid);
 		else
 			relentry->schema_sent = true;
 	}
@@ -760,6 +773,34 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  */
 static RelationSyncEntry *
-- 
1.8.3.1

v4-0013-Track-statistics-for-streaming.patchapplication/octet-stream; name=v4-0013-Track-statistics-for-streaming.patchDownload
From 88f482bd96c5ffbdc30e1dc76c79ddabd4ecd554 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 2 Dec 2019 09:58:50 +0530
Subject: [PATCH v4 13/19] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                    | 25 +++++++++++++++++++
 src/backend/catalog/system_views.sql            |  5 +++-
 src/backend/replication/logical/reorderbuffer.c | 13 ++++++++++
 src/backend/replication/walsender.c             | 32 +++++++++++++++++++++----
 src/include/catalog/pg_proc.dat                 |  6 ++---
 src/include/replication/reorderbuffer.h         | 13 ++++++----
 src/include/replication/walsender_private.h     |  5 ++++
 src/test/regress/expected/rules.out             |  7 ++++--
 8 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dcb5811..180ea88 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1996,6 +1996,31 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       may get spilled repeatedly, and this counter gets incremented on every
       such invocation.</entry>
     </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
+
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f7800f0..5897611 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -779,7 +779,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index dda651e..1949dc4 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -358,6 +358,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3709,6 +3713,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	PG_END_TRY();
 
 	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
+	/*
 	 * Discard the changes that we just streamed, and mark the transactions
 	 * as streamed (if they contained changes).
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2e0743a..9f93f11 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1293,7 +1293,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1314,7 +1314,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2357,6 +2358,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3185,7 +3189,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3242,6 +3246,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillTxns;
 		int64		spillCount;
 		int64		spillBytes;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3265,6 +3272,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		memset(nulls, 0, sizeof(nulls));
@@ -3351,6 +3361,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3598,12 +3613,19 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillTxns = rb->spillTxns;
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
+	
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index ac8f64b..3b897a5 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5166,9 +5166,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e2b8db0..e132c3c 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -506,15 +506,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index a6b3205..7efc332 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -85,6 +85,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 80a0782..5ab21a8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1955,9 +1955,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
1.8.3.1

v4-0014-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v4-0014-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From a3d35c1aeb5b8dc4d2f810958966f5bc689bfd06 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v4 14/19] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 77a1560..8cd1993 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -65,7 +65,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 81547f6..8dfeafc 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 81520a7..0c9c6b3 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v4-0015-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patchapplication/octet-stream; name=v4-0015-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patchDownload
From 441194d7823933ca81bf8cfb86d4780e91e166eb Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:14:45 +0200
Subject: [PATCH v4 15/19] BUGFIX: set final_lsn for subxacts before cleanup

---
 src/backend/replication/logical/reorderbuffer.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1949dc4..a3e2c06 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1544,6 +1544,10 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
+		/* make sure subtxn has final_lsn */
+		if (subtxn->final_lsn == InvalidXLogRecPtr)
+			subtxn->final_lsn = txn->final_lsn;
+
 		/*
 		 * Subtransactions are always associated to the toplevel TXN, even if
 		 * they originally were happening inside another subtxn, so we won't
-- 
1.8.3.1

v4-0016-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v4-0016-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From 403e8a2caee4f6ec7be98ac3c0ac8b60fb5af952 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v4 16/19] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v4-0017-Extend-handling-of-concurrent-aborts-for-streamin.patchapplication/octet-stream; name=v4-0017-Extend-handling-of-concurrent-aborts-for-streamin.patchDownload
From 4030d805e9358a374f522c2c11c0ac15e36e19a7 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 22 Nov 2019 12:43:38 +0530
Subject: [PATCH v4 17/19] Extend handling of concurrent aborts for streaming
 transaction

---
 src/backend/replication/logical/reorderbuffer.c | 36 ++++++++++++++++++++++---
 src/include/replication/reorderbuffer.h         |  5 ++++
 2 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index a3e2c06..789b425 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2348,9 +2348,9 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 
 	/*
 	 * When the (sub)transaction was streamed, notify the remote node
-	 * about the abort.
+	 * about the abort only if we have sent any data for this transaction.
 	 */
-	if (rbtxn_is_streamed(txn))
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
 		rb->stream_abort(rb, txn, lsn);
 
 	/* cosmetic... */
@@ -3266,6 +3266,7 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	volatile CommandId command_id;
 	bool		using_subtxn;
 	Size		streamed = 0;
+	MemoryContext ccxt = CurrentMemoryContext;
 	ReorderBufferStreamIterTXNState *volatile iterstate = NULL;
 
 	/*
@@ -3395,6 +3396,13 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			/* we're going to stream this change */
 			streamed++;
 
+			/*
+			 * Set the CheckXidAlive to the current (sub)xid for which this
+			 * change belongs to so that we can detect the abort while we are
+			 * decoding.
+			 */
+			CheckXidAlive = change->txn->xid;
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -3456,6 +3464,10 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 						ReorderBufferToastReplace(rb, txn, relation, change);
 						rb->stream_change(rb, txn, relation, change);
 
+						/* Remember that we have sent some data for this txn.*/
+						if (!change->txn->any_data_sent)
+							change->txn->any_data_sent = true;
+
 						/*
 						 * Only clear reassembled toast chunks if we're sure
 						 * they're not required anymore. The creator of the
@@ -3694,6 +3706,9 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferStreamIterTXNFinish(rb, iterstate);
@@ -3712,7 +3727,22 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		PG_RE_THROW();
+		/* re-throw only if it's not an abort */
+		if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+		else
+		{
+			/* remember the command ID and snapshot for the streaming run */
+			txn->command_id = command_id;
+			txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+													  txn, command_id);
+			rb->stream_stop(rb, txn);
+
+			FlushErrorState();
+		}
 	}
 	PG_END_TRY();
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e132c3c..6186465 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -237,6 +237,11 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	final_lsn;
 
 	/*
+	 * Have we sent any changes for this transaction in output plugin?
+	 */
+	bool		any_data_sent;
+
+	/*
 	 * Toplevel transaction for this subxact (NULL for top-level).
 	 */
 	struct ReorderBufferTXN *toptxn;
-- 
1.8.3.1

v4-0018-Review-comment-fix-and-refactoring.patchapplication/octet-stream; name=v4-0018-Review-comment-fix-and-refactoring.patchDownload
From 76cea7d3a261cde0492547db75e4c95ea787ad32 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Sun, 29 Dec 2019 15:41:26 +0530
Subject: [PATCH v4 18/19] Review comment fix and refactoring

---
 src/backend/replication/logical/reorderbuffer.c | 995 ++++++------------------
 1 file changed, 237 insertions(+), 758 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 789b425..8b3f112 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -149,28 +149,6 @@ typedef struct ReorderBufferIterTXNState
 	ReorderBufferIterTXNEntry entries[FLEXIBLE_ARRAY_MEMBER];
 } ReorderBufferIterTXNState;
 
-/*
- * k-way in-order change iteration support structures
- *
- * This is a simplified version for streaming, which does not require
- * serialization to files and only reads changes that are currently in
- * memory.
- */
-typedef struct ReorderBufferStreamIterTXNEntry
-{
-	XLogRecPtr	lsn;
-	ReorderBufferChange *change;
-	ReorderBufferTXN *txn;
-}			ReorderBufferStreamIterTXNEntry;
-
-typedef struct ReorderBufferStreamIterTXNState
-{
-	binaryheap *heap;
-	Size		nr_txns;
-	dlist_head	old_change;
-	ReorderBufferStreamIterTXNEntry entries[FLEXIBLE_ARRAY_MEMBER];
-}			ReorderBufferStreamIterTXNState;
-
 /* toast datastructures */
 typedef struct ReorderBufferToastEnt
 {
@@ -235,20 +213,6 @@ static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
 static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
 
-
-/* iterator for streaming (only get data from memory) */
-static ReorderBufferStreamIterTXNState * ReorderBufferStreamIterTXNInit(
-																		ReorderBuffer *rb,
-																		ReorderBufferTXN *txn);
-
-static ReorderBufferChange *ReorderBufferStreamIterTXNNext(
-							   ReorderBuffer *rb,
-							   ReorderBufferStreamIterTXNState * state);
-
-static void ReorderBufferStreamIterTXNFinish(
-								 ReorderBuffer *rb,
-								 ReorderBufferStreamIterTXNState * state);
-
 /*
  * ---------------------------------------
  * Disk serialization support functions
@@ -1324,210 +1288,6 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 }
 
 /*
- * Binary heap comparison function (streaming iterator).
- */
-static int
-ReorderBufferStreamIterCompare(Datum a, Datum b, void *arg)
-{
-	ReorderBufferStreamIterTXNState *state = (ReorderBufferStreamIterTXNState *) arg;
-	XLogRecPtr	pos_a = state->entries[DatumGetInt32(a)].lsn;
-	XLogRecPtr	pos_b = state->entries[DatumGetInt32(b)].lsn;
-
-	if (pos_a < pos_b)
-		return 1;
-	else if (pos_a == pos_b)
-		return 0;
-	return -1;
-}
-
-/*
- * Allocate & initialize an iterator which iterates in lsn order over a
- * transaction and all its subtransactions. This version is meant for
- * streaming of incomplete transactions.
- */
-static ReorderBufferStreamIterTXNState *
-ReorderBufferStreamIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
-{
-	Size		nr_txns = 0;
-	ReorderBufferStreamIterTXNState *state;
-	dlist_iter	cur_txn_i;
-	int32		off;
-
-	/* Check ordering of changes in the toplevel transaction. */
-	AssertChangeLsnOrder(rb, txn);
-
-	/*
-	 * Calculate the size of our heap: one element for every transaction that
-	 * contains changes.  (Besides the transactions already in the reorder
-	 * buffer, we count the one we were directly passed.)
-	 */
-	if (txn->nentries > 0)
-		nr_txns++;
-
-	dlist_foreach(cur_txn_i, &txn->subtxns)
-	{
-		ReorderBufferTXN *cur_txn;
-
-		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
-
-		/* Check ordering of changes in this subtransaction. */
-		AssertChangeLsnOrder(rb, cur_txn);
-
-		if (cur_txn->nentries > 0)
-			nr_txns++;
-	}
-
-	/*
-	 * TODO: Consider adding fastpath for the rather common nr_txns=1 case, no
-	 * need to allocate/build a heap then.
-	 */
-
-	/* allocate iteration state */
-	state = (ReorderBufferStreamIterTXNState *)
-		MemoryContextAllocZero(rb->context,
-							   sizeof(ReorderBufferStreamIterTXNState) +
-							   sizeof(ReorderBufferStreamIterTXNEntry) * nr_txns);
-
-	state->nr_txns = nr_txns;
-	dlist_init(&state->old_change);
-
-	/* allocate heap */
-	state->heap = binaryheap_allocate(state->nr_txns,
-									  ReorderBufferStreamIterCompare,
-									  state);
-
-	/*
-	 * Now insert items into the binary heap, in an unordered fashion.  (We
-	 * will run a heap assembly step at the end; this is more efficient.)
-	 */
-
-	off = 0;
-
-	/* add toplevel transaction if it contains changes */
-	if (txn->nentries > 0)
-	{
-		ReorderBufferChange *cur_change;
-
-		cur_change = dlist_head_element(ReorderBufferChange, node,
-										&txn->changes);
-
-		state->entries[off].lsn = cur_change->lsn;
-		state->entries[off].change = cur_change;
-		state->entries[off].txn = txn;
-
-		binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
-	}
-
-	/* add subtransactions if they contain changes */
-	dlist_foreach(cur_txn_i, &txn->subtxns)
-	{
-		ReorderBufferTXN *cur_txn;
-
-		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
-
-		if (cur_txn->nentries > 0)
-		{
-			ReorderBufferChange *cur_change;
-
-			cur_change = dlist_head_element(ReorderBufferChange, node,
-											&cur_txn->changes);
-
-			state->entries[off].lsn = cur_change->lsn;
-			state->entries[off].change = cur_change;
-			state->entries[off].txn = cur_txn;
-
-			binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
-		}
-	}
-
-	Assert(off == nr_txns);
-
-	/* assemble a valid binary heap */
-	binaryheap_build(state->heap);
-
-	return state;
-}
-
-/*
- * Return the next change when iterating over a transaction and its
- * subtransactions.
- *
- * Returns NULL when no further changes exist.
- */
-static ReorderBufferChange *
-ReorderBufferStreamIterTXNNext(ReorderBuffer *rb, ReorderBufferStreamIterTXNState * state)
-{
-	ReorderBufferChange *change;
-	ReorderBufferStreamIterTXNEntry *entry;
-	int32		off;
-
-	/* nothing there anymore */
-	if (state->heap->bh_size == 0)
-		return NULL;
-
-	off = DatumGetInt32(binaryheap_first(state->heap));
-	entry = &state->entries[off];
-
-	/* free memory we might have "leaked" in the previous *Next call */
-	if (!dlist_is_empty(&state->old_change))
-	{
-		change = dlist_container(ReorderBufferChange, node,
-								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
-		Assert(dlist_is_empty(&state->old_change));
-	}
-
-	change = entry->change;
-
-	/*
-	 * update heap with information about which transaction has the next
-	 * relevant change in LSN order
-	 */
-
-	/* there are in-memory changes */
-	if (dlist_has_next(&entry->txn->changes, &entry->change->node))
-	{
-		dlist_node *next = dlist_next_node(&entry->txn->changes, &change->node);
-		ReorderBufferChange *next_change =
-		dlist_container(ReorderBufferChange, node, next);
-
-		/* txn stays the same */
-		state->entries[off].lsn = next_change->lsn;
-		state->entries[off].change = next_change;
-
-		binaryheap_replace_first(state->heap, Int32GetDatum(off));
-		return change;
-	}
-
-	/* ok, no changes there anymore, remove */
-	binaryheap_remove_first(state->heap);
-
-	return change;
-}
-
-/*
- * Deallocate the iterator
- */
-static void
-ReorderBufferStreamIterTXNFinish(ReorderBuffer *rb,
-								 ReorderBufferStreamIterTXNState * state)
-{
-	/* free memory we might have "leaked" in the last *Next call */
-	if (!dlist_is_empty(&state->old_change))
-	{
-		ReorderBufferChange *change;
-
-		change = dlist_container(ReorderBufferChange, node,
-								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
-		Assert(dlist_is_empty(&state->old_change));
-	}
-
-	binaryheap_free(state->heap);
-	pfree(state);
-}
-
-/*
  * Cleanup the contents of a transaction, usually after the transaction
  * committed or aborted.
  */
@@ -1854,6 +1614,11 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 		SnapBuildSnapDecRefcount(snap);
 }
 
+/*
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
+ */
 static void
 ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
@@ -1868,86 +1633,38 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
  *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
 	bool		using_subtxn;
+	Size		streamed = 0;
+	MemoryContext ccxt = CurrentMemoryContext;
 	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	/*
+	 * build data to be able to lookup the CommandIds of catalog tuples
+	 */
+	ReorderBufferBuildTupleCidHash(rb, txn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	/* setup the initial snapshot */
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, txn->xid);
 
 	/*
-	 * If the transaction was (partially) streamed, we need to commit it in a
-	 * 'streamed' way. That is, we first stream the remaining part of the
-	 * transaction, and then invoke stream_commit message.
-	 *
-	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
-	 * transaction, so we don't pass that directly.
-	 *
-	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
-	 */
-	if (rbtxn_is_streamed(txn))
-	{
-		ReorderBufferStreamCommit(rb, txn);
-		return;
-	}
-
-	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
-	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
-		return;
-	}
-
-	snapshot_now = txn->base_snapshot;
-
-	/* build data to be able to lookup the CommandIds of catalog tuples */
-	ReorderBufferBuildTupleCidHash(rb, txn);
-
-	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
-
-	/*
-	 * Decoding needs access to syscaches et al., which in turn use
-	 * heavyweight locks and such. Thus we need to have enough state around to
-	 * keep track of those.  The easiest way is to simply use a transaction
-	 * internally.  That also allows us to easily enforce that nothing writes
-	 * to the database by checking for xid assignments.
+	 * Decoding needs access to syscaches et al., which in turn use
+	 * heavyweight locks and such. Thus we need to have enough state around to
+	 * keep track of those.  The easiest way is to simply use a transaction
+	 * internally.  That also allows us to easily enforce that nothing writes
+	 * to the database by checking for xid assignments.
 	 *
 	 * When we're called via the SQL SRF there's already a transaction
 	 * started, so start an explicit subtransaction there.
@@ -1961,11 +1678,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction("stream");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* start streaming this chunk of transaction */
+		if (streaming)
+			rb->stream_start(rb, txn);
+		else
+			rb->begin(rb, txn);
 
 		iterstate = ReorderBufferIterTXNInit(rb, txn);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1978,11 +1699,23 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			 * subtransactions. The changes may have the same LSN due to
 			 * MULTI_INSERT xlog records.
 			 */
-			if (prev_lsn != InvalidXLogRecPtr)
-				Assert(prev_lsn <= change->lsn);
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
 
 			prev_lsn = change->lsn;
 
+			if (streaming)
+			{
+				/*
+				 * Set the CheckXidAlive to the current (sub)xid for which this
+				 * change belongs to so that we can detect the abort while we are
+				 * decoding.
+				 */
+				CheckXidAlive = change->txn->xid;
+
+				/* Increment the stream count. */
+				streamed++;
+			}
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1992,8 +1725,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * use as a normal record. It'll be cleaned up at the end
 					 * of INSERT processing.
 					 */
-					if (specinsert == NULL)
-						elog(ERROR, "invalid ordering of speculative insertion changes");
 					Assert(specinsert->data.tp.oldtuple == NULL);
 					change = specinsert;
 					change->action = REORDER_BUFFER_CHANGE_INSERT;
@@ -2059,7 +1790,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						if (streaming)
+						{
+							rb->stream_change(rb, txn, relation, change);
+
+							/* Remember that we have sent some data for this xid. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_change(rb, txn, relation, change);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -2080,8 +1819,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						 * freed/reused while restoring spooled data from
 						 * disk.
 						 */
-						Assert(change->data.tp.newtuple != NULL);
-
 						dlist_delete(&change->node);
 						ReorderBufferToastAppendChunk(rb, txn, relation,
 													  change);
@@ -2099,7 +1836,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -2157,7 +1894,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						if (streaming)
+						{
+							rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+							/* Remember that we have sent some data. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_truncate(rb, txn, nrelations, relations, change);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -2166,10 +1911,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					if (streaming)
+						rb->stream_message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
+					else
+						rb->message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -2200,9 +1951,9 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+										  txn->xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -2222,7 +1973,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+											  txn->xid);
 					}
 
 					break;
@@ -2260,16 +2012,46 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			/*
+			 * Set the last last of the stream as the final lsn before calling
+			 * stream stop.
+			 */
+			txn->final_lsn = prev_lsn;
+			rb->stream_stop(rb, txn);
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
-		/* cleanup */
-		TeardownHistoricSnapshot(false);
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+		{
+			txn->command_id = command_id;
+			txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+													  txn, command_id);
+		}
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
+		/*
+		 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+		 * any memory. We could also keep the hash table and update it with
+		 * new ctid values, but this seems simpler and good enough for now.
+		 */
+		ReorderBufferDestroyTupleCidHash(rb, txn);
 
 		/*
 		 * Aborting the current (sub-)transaction as a whole has the right
@@ -2285,14 +2067,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then Discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -2311,18 +2101,117 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		if (streaming)
+		{
+			/* Discard the changes that we just streamed. */
+			ReorderBufferTruncateTXN(rb, txn);
 
-		PG_RE_THROW();
+			/* re-throw only if it's not an abort */
+			if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+			{
+				MemoryContextSwitchTo(ecxt);
+				PG_RE_THROW();
+			}
+			else
+			{
+				/* remember the command ID and snapshot for the streaming run */
+				txn->command_id = command_id;
+				txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+														  txn, command_id);
+				/*
+				 * Set the last last of the stream as the final lsn before
+				 * calling stream stop.
+				 */
+				txn->final_lsn = prev_lsn;
+				rb->stream_stop(rb, txn);
+
+				FlushErrorState();
+			}
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
 /*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * We currently can only decode a transaction's contents when its commit
+ * record is read because that's the only place where we know about cache
+ * invalidations. Thus, once a toplevel commit is read, we iterate over the top
+ * and subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -3264,10 +3153,6 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
 	volatile Snapshot snapshot_now;
 	volatile CommandId command_id;
-	bool		using_subtxn;
-	Size		streamed = 0;
-	MemoryContext ccxt = CurrentMemoryContext;
-	ReorderBufferStreamIterTXNState *volatile iterstate = NULL;
 
 	/*
 	 * If this is a subxact, we need to stream the top-level transaction
@@ -3305,16 +3190,7 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			ReorderBufferTXN *subtxn;
 
 			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
-
-			if (subtxn->base_snapshot != NULL &&
-				(txn->base_snapshot == NULL ||
-				 txn->base_snapshot_lsn > subtxn->base_snapshot_lsn))
-			{
-				txn->base_snapshot = subtxn->base_snapshot;
-				txn->base_snapshot_lsn = subtxn->base_snapshot_lsn;
-				subtxn->base_snapshot = NULL;
-				subtxn->base_snapshot_lsn = InvalidXLogRecPtr;
-			}
+			ReorderBufferTransferSnapToParent(txn, subtxn);
 		}
 
 		command_id = FirstCommandId;
@@ -3344,407 +3220,10 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
-	 * build data to be able to lookup the CommandIds of catalog tuples
-	 */
-	ReorderBufferBuildTupleCidHash(rb, txn);
-
-	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, txn->xid);
-
-	/*
-	 * Decoding needs access to syscaches et al., which in turn use
-	 * heavyweight locks and such. Thus we need to have enough state around to
-	 * keep track of those.  The easiest way is to simply use a transaction
-	 * internally.  That also allows us to easily enforce that nothing writes
-	 * to the database by checking for xid assignments.
-	 *
-	 * When we're called via the SQL SRF there's already a transaction
-	 * started, so start an explicit subtransaction there.
+	 * Access the main routine to decode the changes and send to output plugin.
 	 */
-	using_subtxn = IsTransactionOrTransactionBlock();
-
-	PG_TRY();
-	{
-		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
-		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
-
-		if (using_subtxn)
-			BeginInternalSubTransaction("stream");
-		else
-			StartTransactionCommand();
-
-		/* start streaming this chunk of transaction */
-		rb->stream_start(rb, txn);
-
-		iterstate = ReorderBufferStreamIterTXNInit(rb, txn);
-		while ((change = ReorderBufferStreamIterTXNNext(rb, iterstate)) != NULL)
-		{
-			Relation	relation = NULL;
-			Oid			reloid;
-
-			/*
-			 * Enforce correct ordering of changes, merged from multiple
-			 * subtransactions. The changes may have the same LSN due to
-			 * MULTI_INSERT xlog records.
-			 */
-			if (prev_lsn != InvalidXLogRecPtr)
-				Assert(prev_lsn <= change->lsn);
-
-			prev_lsn = change->lsn;
-
-			/* we're going to stream this change */
-			streamed++;
-
-			/*
-			 * Set the CheckXidAlive to the current (sub)xid for which this
-			 * change belongs to so that we can detect the abort while we are
-			 * decoding.
-			 */
-			CheckXidAlive = change->txn->xid;
-
-			switch (change->action)
-			{
-				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
-
-					/*
-					 * Confirmation for speculative insertion arrived. Simply
-					 * use as a normal record. It'll be cleaned up at the end
-					 * of INSERT processing.
-					 */
-					Assert(specinsert->data.tp.oldtuple == NULL);
-					change = specinsert;
-					change->action = REORDER_BUFFER_CHANGE_INSERT;
-
-					/* intentionally fall through */
-				case REORDER_BUFFER_CHANGE_INSERT:
-				case REORDER_BUFFER_CHANGE_UPDATE:
-				case REORDER_BUFFER_CHANGE_DELETE:
-					Assert(snapshot_now);
-
-					reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
-												change->data.tp.relnode.relNode);
-
-					/*
-					 * Catalog tuple without data, emitted while catalog was
-					 * in the process of being rewritten.
-					 */
-					if (reloid == InvalidOid &&
-						change->data.tp.newtuple == NULL &&
-						change->data.tp.oldtuple == NULL)
-						goto change_done;
-					else if (reloid == InvalidOid)
-						elog(ERROR, "could not map filenode \"%s\" to relation OID",
-							 relpathperm(change->data.tp.relnode,
-										 MAIN_FORKNUM));
-
-					relation = RelationIdGetRelation(reloid);
-
-					if (relation == NULL)
-						elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
-							 reloid,
-							 relpathperm(change->data.tp.relnode,
-										 MAIN_FORKNUM));
-
-					if (!RelationIsLogicallyLogged(relation))
-						goto change_done;
-
-					/*
-					 * For now ignore sequence changes entirely. Most of the
-					 * time they don't log changes using records we
-					 * understand, so it doesn't make sense to handle the few
-					 * cases we do.
-					 */
-					if (relation->rd_rel->relkind == RELKIND_SEQUENCE)
-						goto change_done;
-
-					/* user-triggered change */
-					if (!IsToastRelation(relation))
-					{
-						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->stream_change(rb, txn, relation, change);
-
-						/* Remember that we have sent some data for this txn.*/
-						if (!change->txn->any_data_sent)
-							change->txn->any_data_sent = true;
-
-						/*
-						 * Only clear reassembled toast chunks if we're sure
-						 * they're not required anymore. The creator of the
-						 * tuple tells us.
-						 */
-						if (change->data.tp.clear_toast_afterwards)
-							ReorderBufferToastReset(rb, txn);
-					}
-					/* we're not interested in toast deletions */
-					else if (change->action == REORDER_BUFFER_CHANGE_INSERT)
-					{
-						/*
-						 * Need to reassemble the full toasted Datum in
-						 * memory, to ensure the chunks don't get reused till
-						 * we're done remove it from the list of this
-						 * transaction's changes. Otherwise it will get
-						 * freed/reused while restoring spooled data from
-						 * disk.
-						 */
-						dlist_delete(&change->node);
-						ReorderBufferToastAppendChunk(rb, txn, relation,
-													  change);
-					}
-
-			change_done:
-
-					/*
-					 * Either speculative insertion was confirmed, or it was
-					 * unsuccessful and the record isn't needed anymore.
-					 */
-					if (specinsert != NULL)
-					{
-						ReorderBufferReturnChange(rb, specinsert);
-						specinsert = NULL;
-					}
-
-					if (relation != NULL)
-					{
-						RelationClose(relation);
-						relation = NULL;
-					}
-					break;
-
-				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
-
-					/*
-					 * Speculative insertions are dealt with by delaying the
-					 * processing of the insert until the confirmation record
-					 * arrives. For that we simply unlink the record from the
-					 * chain, so it does not get freed/reused while restoring
-					 * spooled data from disk.
-					 *
-					 * This is safe in the face of concurrent catalog changes
-					 * because the relevant relation can't be changed between
-					 * speculative insertion and confirmation due to
-					 * CheckTableNotInUse() and locking.
-					 */
-
-					/* clear out a pending (and thus failed) speculation */
-					if (specinsert != NULL)
-					{
-						ReorderBufferReturnChange(rb, specinsert);
-						specinsert = NULL;
-					}
-
-					/* and memorize the pending insertion */
-					dlist_delete(&change->node);
-					specinsert = change;
-					break;
-
-				case REORDER_BUFFER_CHANGE_TRUNCATE:
-					{
-						int			i;
-						int			nrelids = change->data.truncate.nrelids;
-						int			nrelations = 0;
-						Relation   *relations;
-
-						relations = palloc0(nrelids * sizeof(Relation));
-						for (i = 0; i < nrelids; i++)
-						{
-							Oid			relid = change->data.truncate.relids[i];
-							Relation	relation;
-
-							relation = RelationIdGetRelation(relid);
-
-							if (relation == NULL)
-								elog(ERROR, "could not open relation with OID %u", relid);
-
-							if (!RelationIsLogicallyLogged(relation))
-								continue;
-
-							relations[nrelations++] = relation;
-						}
-
-						rb->stream_truncate(rb, txn, nrelations, relations, change);
-
-						for (i = 0; i < nrelations; i++)
-							RelationClose(relations[i]);
-
-						break;
-					}
-
-				case REORDER_BUFFER_CHANGE_MESSAGE:
-
-					rb->stream_message(rb, txn, change->lsn, true,
-									   change->data.msg.prefix,
-									   change->data.msg.message_size,
-									   change->data.msg.message);
-					break;
-
-				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
-					/* get rid of the old */
-					TeardownHistoricSnapshot(false);
-
-					if (snapshot_now->copied)
-					{
-						ReorderBufferFreeSnap(rb, snapshot_now);
-						snapshot_now =
-							ReorderBufferCopySnap(rb, change->data.snapshot,
-												  txn, command_id);
-					}
-
-					/*
-					 * Restored from disk, need to be careful not to double
-					 * free. We could introduce refcounting for that, but for
-					 * now this seems infrequent enough not to care.
-					 */
-					else if (change->data.snapshot->copied)
-					{
-						snapshot_now =
-							ReorderBufferCopySnap(rb, change->data.snapshot,
-												  txn, command_id);
-					}
-					else
-					{
-						snapshot_now = change->data.snapshot;
-					}
-
-					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
-										  txn->xid);
-					break;
-
-				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
-					Assert(change->data.command_id != InvalidCommandId);
-
-					if (command_id < change->data.command_id)
-					{
-						command_id = change->data.command_id;
-
-						if (!snapshot_now->copied)
-						{
-							/* we don't use the global one anymore */
-							snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
-																 txn, command_id);
-						}
-
-						snapshot_now->curcid = command_id;
-
-						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
-											  txn->xid);
-					}
-
-					break;
-
-				case REORDER_BUFFER_CHANGE_INVALIDATION:
-
-					/*
-					 * Execute the invalidation message locally.
-					 *
-					 * XXX Do we need to care about relcacheInitFileInval and
-					 * the other fields added to ReorderBufferChange, or just
-					 * about the message itself?
-					 */
-					LocalExecuteInvalidationMessage(&change->data.inval.msg);
-					break;
-
-				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
-					elog(ERROR, "tuplecid value in changequeue");
-					break;
-			}
-		}
-
-		/*
-		 * There's a speculative insertion remaining, just clean in up, it
-		 * can't have been successful, otherwise we'd gotten a confirmation
-		 * record.
-		 */
-		if (specinsert)
-		{
-			ReorderBufferReturnChange(rb, specinsert);
-			specinsert = NULL;
-		}
-
-		/* clean up the iterator */
-		ReorderBufferStreamIterTXNFinish(rb, iterstate);
-		iterstate = NULL;
-
-		/* call stream_stop callback */
-		rb->stream_stop(rb, txn);
-
-		/* this is just a sanity check against bad output plugin behaviour */
-		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
-			elog(ERROR, "output plugin used XID %u",
-				 GetCurrentTransactionId());
-
-		/* remember the command ID and snapshot for the streaming run */
-		txn->command_id = command_id;
-		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
-												  txn, command_id);
-
-		/* cleanup */
-		TeardownHistoricSnapshot(false);
-
-		/*
-		 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
-		 * any memory. We could also keep the hash table and update it with
-		 * new ctid values, but this seems simpler and good enough for now.
-		 */
-		ReorderBufferDestroyTupleCidHash(rb, txn);
-
-		/*
-		 * Aborting the current (sub-)transaction as a whole has the right
-		 * semantics. We want all locks acquired in here to be released, not
-		 * reassigned to the parent and we do not want any database access
-		 * have persistent effects.
-		 */
-		AbortCurrentTransaction();
-
-		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
-
-		if (using_subtxn)
-			RollbackAndReleaseCurrentSubTransaction();
-	}
-	PG_CATCH();
-	{
-		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
-		ErrorData  *errdata = CopyErrorData();
-
-		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
-		if (iterstate)
-			ReorderBufferStreamIterTXNFinish(rb, iterstate);
-
-		TeardownHistoricSnapshot(true);
-
-		/*
-		 * Force cache invalidation to happen outside of a valid transaction
-		 * to prevent catalog access as we just caught an error.
-		 */
-		AbortCurrentTransaction();
-
-		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
-
-		if (using_subtxn)
-			RollbackAndReleaseCurrentSubTransaction();
-
-		/* re-throw only if it's not an abort */
-		if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
-		{
-			MemoryContextSwitchTo(ecxt);
-			PG_RE_THROW();
-		}
-		else
-		{
-			/* remember the command ID and snapshot for the streaming run */
-			txn->command_id = command_id;
-			txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
-													  txn, command_id);
-			rb->stream_stop(rb, txn);
-
-			FlushErrorState();
-		}
-	}
-	PG_END_TRY();
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
 
 	/*
 	 * Update the stream statistics.
-- 
1.8.3.1

v4-0019-Bugfix-in-schema-tracking.patchapplication/octet-stream; name=v4-0019-Bugfix-in-schema-tracking.patchDownload
From 036f61804046977873257075143af41b7abe2875 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 30 Dec 2019 13:14:25 +0530
Subject: [PATCH v4 19/19] Bugfix in schema-tracking

---
 src/backend/replication/pgoutput/pgoutput.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 0148f4c..ffd9f94 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -937,7 +937,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
-- 
1.8.3.1

#170Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#168)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sun, Dec 29, 2019 at 1:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have observed some more issues

1. Currently, In ReorderBufferCommit, it is always expected that
whenever we get REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we must
have already got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT and in
SPEC_CONFIRM we send the tuple we got in SPECT_INSERT. But, now those
two messages can be in different streams. So we need to find a way to
handle this. Maybe once we get SPEC_INSERT then we can remember the
tuple and then if we get the SPECT_CONFIRM in the next stream we can
send that tuple?

Your suggestion makes sense to me. So, we can try it.

2. During commit time in DecodeCommit we check whether we need to skip
the changes of the transaction or not by calling
SnapBuildXactNeedsSkip but since now we support streaming so it's
possible that before we decode the commit WAL, we might have already
sent the changes to the output plugin even though we could have
skipped those changes. So my question is instead of checking at the
commit time can't we check before adding to ReorderBuffer itself

I think if we can do that then the same will be true for current code
irrespective of this patch. I think it is possible that we can't take
that decision while decoding because we haven't assembled a consistent
snapshot yet. I think we might be able to do that while we try to
stream the changes. I think we need to take care of all the
conditions during streaming (when the logical_decoding_workmem limit
is reached) as we do in DecodeCommit. This needs a bit more study.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#171Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#166)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Dec 26, 2019 at 12:36 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Tue, 24 Dec 2019 at 17:21, Amit Kapila <amit.kapila16@gmail.com> wrote:

Thank you for explanation. The plan makes sense. But I think in the
current design it's a problem that logical replication worker doesn't
receive changes (and doesn't check interrupts) during applying
committed changes even if we don't have a worker dedicated for
applying. I think the worker should continue to receive changes and
save them to temporary files even during applying changes.

Won't it beat the purpose of this feature which is to reduce the apply
lag? Basically, it can so happen that while applying commit, it
constantly gets changes of other transactions which will delay the
apply of the current transaction.

You're right. But it seems to me that it optimizes the apply lags of
only a transaction that made many changes. On the other hand if a
transaction made many changes applying of subsequent changes are
delayed.

Hmm, how would it be worse than the current situation where once
commit is encountered on the publisher, we won't start with other
transactions until the replay of the same is finished on subscriber?

I think the best way as
discussed is to launch new workers for streamed transactions, but we
can do that as an additional feature. Anyway, as proposed, users can
choose the streaming mode for subscriptions, so there is an option to
turn this selectively.

Yes. But user who wants to use this feature would want to replicate
many changes but I guess the side effect is quite big. I think that at
least we need to make the logical replication tolerate such situation.

What exactly you mean by "at least we need to make the logical
replication tolerate such situation."? Do you have something specific
in mind?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#172Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#170)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Dec 30, 2019 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Dec 29, 2019 at 1:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have observed some more issues

1. Currently, In ReorderBufferCommit, it is always expected that
whenever we get REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we must
have already got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT and in
SPEC_CONFIRM we send the tuple we got in SPECT_INSERT. But, now those
two messages can be in different streams. So we need to find a way to
handle this. Maybe once we get SPEC_INSERT then we can remember the
tuple and then if we get the SPECT_CONFIRM in the next stream we can
send that tuple?

Your suggestion makes sense to me. So, we can try it.

Sure.

2. During commit time in DecodeCommit we check whether we need to skip
the changes of the transaction or not by calling
SnapBuildXactNeedsSkip but since now we support streaming so it's
possible that before we decode the commit WAL, we might have already
sent the changes to the output plugin even though we could have
skipped those changes. So my question is instead of checking at the
commit time can't we check before adding to ReorderBuffer itself

I think if we can do that then the same will be true for current code
irrespective of this patch. I think it is possible that we can't take
that decision while decoding because we haven't assembled a consistent
snapshot yet. I think we might be able to do that while we try to
stream the changes. I think we need to take care of all the
conditions during streaming (when the logical_decoding_workmem limit
is reached) as we do in DecodeCommit. This needs a bit more study.

I agree.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#173Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#163)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Dec 24, 2019 at 10:58 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Dec 12, 2019 at 3:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think the way invalidations work for logical replication is that
normally, we always start a new transaction before decoding each
commit which allows us to accept the invalidations (via
AtStart_Cache). However, if there are catalog changes within the
transaction being decoded, we need to reflect those before trying to
decode the WAL of operation which happened after that catalog change.
As we are not logging the WAL for each invalidation, we need to
execute all the invalidation messages for this transaction at each
catalog change. We are able to do that now as we decode the entire WAL
for a transaction only once we get the commit's WAL which contains all
the invalidation messages. So, we queue them up and execute them for
each catalog change which we identify by WAL record
XLOG_HEAP2_NEW_CID.

Thanks for the explanation. That makes sense. But, it's still true,
AFAICS, that instead of doing this stuff with logging invalidations
you could just InvalidateSystemCaches() in the cases where you are
currently applying all of the transaction's invalidations. That
approach might be worse than changing the way invalidations are
logged, but the two approaches deserve to be compared. One approach
has more CPU overhead and the other has more WAL overhead, so it's a
little hard to compare them, but it seems worth mulling over.

I have given some thought over it and it seems to me that this will
increase not only CPU usage but also Network usage. The increase in
CPU usage will be for all WALSenders that decodes a transaction that
has performed DDL. The increase in network usage comes from the fact
that we need to send the schema of relations again which doesn't
require to be invalidated. It is because invalidation blew our local
map that remembers which relation schemas are sent.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#174Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#168)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sun, Dec 29, 2019 at 1:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Dec 28, 2019 at 9:33 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Yeah, the "is_schema_sent" flag in ReorderBufferTXN does not work - it
needs to be in the RelationSyncEntry. In fact, I already have code for
that in my private repository - I thought the patches I sent here do
include this, but apparently I forgot to include this bit :-(

Attached is a rebased patch series, fixing this. It's essentially v2
with a couple of patches (0003, 0008, 0009 and 0012) replacing the
is_schema_sent with correct handling.

0003 - removes an is_schema_sent reference added prematurely (it's added
by a later patch, causing compile failure)

0008 - adds the is_schema_sent back (essentially reverting 0003)

0009 - removes is_schema_sent entirely

0012 - adds the correct handling of schema flags in pgoutput

Thanks for splitting the changes. They are quite clear.

I don't know what other changes you've made since v2, so this way it
should be possible to just take 0003, 0008, 0009 and 0012 and slip them
in with minimal hassle.

FWIW thanks to everyone (and Amit and Dilip in particular) working on
this patch series. There's been a lot of great reviews and improvements
since I abandoned this thread for a while. I expect to be able to spend
more time working on this in January.

+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+ MemoryContext oldctx;
+
+ oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+ entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+ MemoryContextSwitchTo(oldctx);
+}
I was looking into the schema tracking solution and I have one
question, Shouldn't we remove the topxid from the list if the
(sub)transaction is aborted?  because once it is aborted we need to
resent the schema.

I think you are right because, at abort, the subscriber would remove
the changes (for a subtransaction) including the schema changes sent
and then it won't be able to understand the subsequent changes sent by
the publisher. Won't we need to remove xid from the list at commit
time as well, otherwise, the list will keep on growing. One more
thing, we need to search the list of all the relations in the local
map to find xid being aborted/committed, right? If so, won't it be
costly doing at each transaction abort/commit?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#175Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#174)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Jan 4, 2020 at 10:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Dec 29, 2019 at 1:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Dec 28, 2019 at 9:33 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+ MemoryContext oldctx;
+
+ oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+ entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+ MemoryContextSwitchTo(oldctx);
+}
I was looking into the schema tracking solution and I have one
question, Shouldn't we remove the topxid from the list if the
(sub)transaction is aborted?  because once it is aborted we need to
resent the schema.

I think you are right because, at abort, the subscriber would remove
the changes (for a subtransaction) including the schema changes sent
and then it won't be able to understand the subsequent changes sent by
the publisher. Won't we need to remove xid from the list at commit
time as well, otherwise, the list will keep on growing.

Yes, we need to remove the xid from the list at the time of commit as well.

One more

thing, we need to search the list of all the relations in the local
map to find xid being aborted/committed, right? If so, won't it be
costly doing at each transaction abort/commit?

Yeah, if multiple concurrent transactions operate on the common
relations then the list can grow longer. I am not sure how many
concurrent large transactions are possible maybe it won't be huge that
searching will be very costly. Otherwise, we can maintain the sorted
array of the xids and do a binary search or we can maintain hash?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#176Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#169)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Dec 30, 2019 at 3:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Dec 12, 2019 at 9:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Yesterday, Tomas has posted the latest version of the patch set which
contain the fix for schema send part. Meanwhile, I was working on few
review comments/bugfixes and refactoring. I have tried to merge those
changes with the latest patch set except the refactoring related to
"0006-Implement-streaming-mode-in-ReorderBuffer" patch, because Tomas
has also made some changes in the same patch.

I don't see any changes by Tomas in that particular patch, am I
missing something?

I have created a
separate patch for the same so that we can review the changes and then
we can merge them to the main patch.

It is better to merge it with the main patch for
"Implement-streaming-mode-in-ReorderBuffer", otherwise, it is a bit
difficult to review.

0002-Issue-individual-invalidations-with-wal_level-log
----------------------------------------------------------------------------
1.
xact_desc_invalidations(StringInfo buf,
{
..
+ else if (msg->id == SHAREDINVALSNAPSHOT_ID)
+ appendStringInfo(buf, " snapshot %u", msg->sn.relId);

You have removed logging for the above cache but forgot to remove its
reference from one of the places. Also, I think you need to add a
comment somewhere in inval.c to say why you are writing for WAL for
some types of invalidations and not for others?

Done

I don't see any new comments as asked by me. I think we should also
consider WAL logging at each command end instead of doing piecemeal as
discussed in another email [1]/messages/by-id/CAA4eK1LOa+2KqNX=m=1qMBDW+o50AuwjAOX6ZqL-rWGiH1F9MQ@mail.gmail.com, which will have lesser code changes
and maybe better in performance. You might want to evaluate the
performance of both approaches.

0003-Extend-the-output-plugin-API-with-stream-methods
--------------------------------------------------------------------------------

4.
stream_start_cb_wrapper()
{
..
+ /* state.report_location = apply_lsn; */
..
+ /* FIXME ctx->write_location = apply_lsn; */
..
}

See, if we can fix these and similar in the callback for the stop. I
think we don't have final_lsn till we commit/abort. Can we compute
before calling these API's?

Done

You have just used final_lsn, but I don't see where you have ensured
that it is set before the API stream_stop_cb_wrapper. I think we need
something similar to what Vignesh has done in one of his bug-fix patch
[2]: /messages/by-id/CALDaNm3MDxFnsZsnSqVhPBLS3=qzNH6+YzB=xYuX2vbtsUeFgw@mail.gmail.com

0005-Gracefully-handle-concurrent-aborts-of-uncommitte
----------------------------------------------------------------------------------
1.
@@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_CATCH();
{
/* TODO: Encapsulate cleanup
from the PG_TRY and PG_CATCH blocks */
+
if (iterstate)
ReorderBufferIterTXNFinish(rb, iterstate);

Spurious line change.

Done

+ /*
+ * We don't expect direct calls to heap_getnext with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(scan->rs_base.rs_rd) ||
+   RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+ elog(ERROR, "improper heap_getnext call");

Earlier, I thought we don't need to check if it is a regular table in
this check, but it is required because output plugins can try to do
that and if they do so during decoding (with historic snapshots), the
same should be not allowed.

How about changing the error message to "unexpected heap_getnext call
during logical decoding" or something like that?

2. The commit message of this patch refers to Prepared transactions.
I think that needs to be changed.

0006-Implement-streaming-mode-in-ReorderBuffer
-------------------------------------------------------------------------

Few comments on v4-0018-Review-comment-fix-and-refactoring:
1.
+ if (streaming)
+ {
+ /*
+ * Set the last last of the stream as the final lsn before calling
+ * stream stop.
+ */
+ txn->final_lsn = prev_lsn;
+ rb->stream_stop(rb, txn);
+ }

Shouldn't we try to final_lsn as is done by Vignesh's patch [2]/messages/by-id/CALDaNm3MDxFnsZsnSqVhPBLS3=qzNH6+YzB=xYuX2vbtsUeFgw@mail.gmail.com?

2.
+ if (streaming)
+ {
+ /*
+ * Set the CheckXidAlive to the current (sub)xid for which this
+ * change belongs to so that we can detect the abort while we are
+ * decoding.
+ */
+ CheckXidAlive = change->txn->xid;
+
+ /* Increment the stream count. */
+ streamed++;
+ }

Is the variable 'streamed' used anywhere?

3.
+ /*
+ * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+ * any memory. We could also keep the hash table and update it with
+ * new ctid values, but this seems simpler and good enough for now.
+ */
+ ReorderBufferDestroyTupleCidHash(rb, txn);

Won't this be required only when we are streaming changes?

As per my understanding apart from the above comments, the known
pending work for this patchset is as follows:
a. The two open items agreed to you in the email [3]/messages/by-id/CAFiTN-uT5YZE0egGhKdTteTjcGrPi8hb=FMPpr9_hEB7hozQ-Q@mail.gmail.com.
b. Complete the handling of schema_sent as discussed above [4]/messages/by-id/CAA4eK1KjD9x0mS4JxzCbu3gu-r6K7XJRV+ZcGb3BH6U3x2uxew@mail.gmail.com.
c. Few comments by Vignesh and the response on the same by me [5]/messages/by-id/CALDaNm0DNUojjt7CV-fa59_kFbQQ3rcMBtauvo44ttea7r9KaA@mail.gmail.com[6]/messages/by-id/CAA4eK1+ZvupW00c--dqEg8f3dHZDOGmA9xOQLyQHjRSoDi6AkQ@mail.gmail.com.
d. WAL overhead and performance testing for additional WAL logging by
this patchset.
e. Some way to see the tuple for streamed transactions by decoding API
as speculated by you [7]/messages/by-id/CAFiTN-t8PmKA1X4jEqKmkvs0ggWpy0APWpPuaJwpx2YpfAf97w@mail.gmail.com.

Have I missed anything?

[1]: /messages/by-id/CAA4eK1LOa+2KqNX=m=1qMBDW+o50AuwjAOX6ZqL-rWGiH1F9MQ@mail.gmail.com
[2]: /messages/by-id/CALDaNm3MDxFnsZsnSqVhPBLS3=qzNH6+YzB=xYuX2vbtsUeFgw@mail.gmail.com
[3]: /messages/by-id/CAFiTN-uT5YZE0egGhKdTteTjcGrPi8hb=FMPpr9_hEB7hozQ-Q@mail.gmail.com
[4]: /messages/by-id/CAA4eK1KjD9x0mS4JxzCbu3gu-r6K7XJRV+ZcGb3BH6U3x2uxew@mail.gmail.com
[5]: /messages/by-id/CALDaNm0DNUojjt7CV-fa59_kFbQQ3rcMBtauvo44ttea7r9KaA@mail.gmail.com
[6]: /messages/by-id/CAA4eK1+ZvupW00c--dqEg8f3dHZDOGmA9xOQLyQHjRSoDi6AkQ@mail.gmail.com
[7]: /messages/by-id/CAFiTN-t8PmKA1X4jEqKmkvs0ggWpy0APWpPuaJwpx2YpfAf97w@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#177Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#176)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Dec 30, 2019 at 3:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Dec 12, 2019 at 9:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Yesterday, Tomas has posted the latest version of the patch set which
contain the fix for schema send part. Meanwhile, I was working on few
review comments/bugfixes and refactoring. I have tried to merge those
changes with the latest patch set except the refactoring related to
"0006-Implement-streaming-mode-in-ReorderBuffer" patch, because Tomas
has also made some changes in the same patch.

I don't see any changes by Tomas in that particular patch, am I
missing something?

He has created some sub-patch from the main patch for handling
schema-sent issue. So if I make change in that patch all other
patches will conflict.

I have created a
separate patch for the same so that we can review the changes and then
we can merge them to the main patch.

It is better to merge it with the main patch for
"Implement-streaming-mode-in-ReorderBuffer", otherwise, it is a bit
difficult to review.

Actually, we can merge 0008, 0009, 0012, 0018 to the main patch
(0007). Basically, if we merge all of them then we don't need to deal
with the conflict. I think Tomas has kept them separate so that we
can review the solution for the schema sent. And, I kept 0018 as a
separate patch to avoid conflict and rebasing in 0008, 0009 and 0012.
In the next patch set, I will merge all of them to 0007.

0002-Issue-individual-invalidations-with-wal_level-log
----------------------------------------------------------------------------
1.
xact_desc_invalidations(StringInfo buf,
{
..
+ else if (msg->id == SHAREDINVALSNAPSHOT_ID)
+ appendStringInfo(buf, " snapshot %u", msg->sn.relId);

You have removed logging for the above cache but forgot to remove its
reference from one of the places. Also, I think you need to add a
comment somewhere in inval.c to say why you are writing for WAL for
some types of invalidations and not for others?

Done

I don't see any new comments as asked by me.

Oh, I just fixed one part of the comment and overlooked the rest. Will fix.
I think we should also

consider WAL logging at each command end instead of doing piecemeal as
discussed in another email [1], which will have lesser code changes
and maybe better in performance. You might want to evaluate the
performance of both approaches.

Ok

0003-Extend-the-output-plugin-API-with-stream-methods
--------------------------------------------------------------------------------

4.
stream_start_cb_wrapper()
{
..
+ /* state.report_location = apply_lsn; */
..
+ /* FIXME ctx->write_location = apply_lsn; */
..
}

See, if we can fix these and similar in the callback for the stop. I
think we don't have final_lsn till we commit/abort. Can we compute
before calling these API's?

Done

You have just used final_lsn, but I don't see where you have ensured
that it is set before the API stream_stop_cb_wrapper. I think we need
something similar to what Vignesh has done in one of his bug-fix patch
[2]. See my comment below in this regard.

You can refer below hunk in 0018.

+ /*
+ * Done with current changes, call stream_stop callback for streaming
+ * transaction, commit callback otherwise.
+ */
+ if (streaming)
+ {
+ /*
+ * Set the last last of the stream as the final lsn before calling
+ * stream stop.
+ */
+ txn->final_lsn = prev_lsn;
+ rb->stream_stop(rb, txn);
+ }
0005-Gracefully-handle-concurrent-aborts-of-uncommitte
----------------------------------------------------------------------------------
1.
@@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_CATCH();
{
/* TODO: Encapsulate cleanup
from the PG_TRY and PG_CATCH blocks */
+
if (iterstate)
ReorderBufferIterTXNFinish(rb, iterstate);

Spurious line change.

Done

+ /*
+ * We don't expect direct calls to heap_getnext with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(scan->rs_base.rs_rd) ||
+   RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+ elog(ERROR, "improper heap_getnext call");

Earlier, I thought we don't need to check if it is a regular table in
this check, but it is required because output plugins can try to do
that

I did not understand that, can you give some example?
and if they do so during decoding (with historic snapshots), the

same should be not allowed.

How about changing the error message to "unexpected heap_getnext call
during logical decoding" or something like that?

Ok

2. The commit message of this patch refers to Prepared transactions.
I think that needs to be changed.

0006-Implement-streaming-mode-in-ReorderBuffer
-------------------------------------------------------------------------

Few comments on v4-0018-Review-comment-fix-and-refactoring:
1.
+ if (streaming)
+ {
+ /*
+ * Set the last last of the stream as the final lsn before calling
+ * stream stop.
+ */
+ txn->final_lsn = prev_lsn;
+ rb->stream_stop(rb, txn);
+ }

Shouldn't we try to final_lsn as is done by Vignesh's patch [2]?

Isn't it the same, there we are doing while serializing and here we
are doing while streaming? Basically, the last LSN we streamed. Am I
missing something?

2.
+ if (streaming)
+ {
+ /*
+ * Set the CheckXidAlive to the current (sub)xid for which this
+ * change belongs to so that we can detect the abort while we are
+ * decoding.
+ */
+ CheckXidAlive = change->txn->xid;
+
+ /* Increment the stream count. */
+ streamed++;
+ }

Is the variable 'streamed' used anywhere?

3.
+ /*
+ * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+ * any memory. We could also keep the hash table and update it with
+ * new ctid values, but this seems simpler and good enough for now.
+ */
+ ReorderBufferDestroyTupleCidHash(rb, txn);

Won't this be required only when we are streaming changes?

I will work on this review comments and reply to them separately along
with the patch.

As per my understanding apart from the above comments, the known
pending work for this patchset is as follows:
a. The two open items agreed to you in the email [3].
b. Complete the handling of schema_sent as discussed above [4].
c. Few comments by Vignesh and the response on the same by me [5][6].
d. WAL overhead and performance testing for additional WAL logging by
this patchset.
e. Some way to see the tuple for streamed transactions by decoding API
as speculated by you [7].

Have I missed anything?

I think this is the list I remember. Apart from these few points by
Robert which are still under discussion[8]/messages/by-id/CA+TgmoYH6N_YDvKH9AaAJo5ZTHn142K=B75VO9yKvjjjHcoZhA@mail.gmail.com.

[1] - /messages/by-id/CAA4eK1LOa+2KqNX=m=1qMBDW+o50AuwjAOX6ZqL-rWGiH1F9MQ@mail.gmail.com
[2] - /messages/by-id/CALDaNm3MDxFnsZsnSqVhPBLS3=qzNH6+YzB=xYuX2vbtsUeFgw@mail.gmail.com
[3] - /messages/by-id/CAFiTN-uT5YZE0egGhKdTteTjcGrPi8hb=FMPpr9_hEB7hozQ-Q@mail.gmail.com
[4] - /messages/by-id/CAA4eK1KjD9x0mS4JxzCbu3gu-r6K7XJRV+ZcGb3BH6U3x2uxew@mail.gmail.com
[5] - /messages/by-id/CALDaNm0DNUojjt7CV-fa59_kFbQQ3rcMBtauvo44ttea7r9KaA@mail.gmail.com
[6] - /messages/by-id/CAA4eK1+ZvupW00c--dqEg8f3dHZDOGmA9xOQLyQHjRSoDi6AkQ@mail.gmail.com
[7] - /messages/by-id/CAFiTN-t8PmKA1X4jEqKmkvs0ggWpy0APWpPuaJwpx2YpfAf97w@mail.gmail.com

[8]: /messages/by-id/CA+TgmoYH6N_YDvKH9AaAJo5ZTHn142K=B75VO9yKvjjjHcoZhA@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#178Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#177)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jan 6, 2020 at 9:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

It is better to merge it with the main patch for
"Implement-streaming-mode-in-ReorderBuffer", otherwise, it is a bit
difficult to review.

Actually, we can merge 0008, 0009, 0012, 0018 to the main patch
(0007). Basically, if we merge all of them then we don't need to deal
with the conflict. I think Tomas has kept them separate so that we
can review the solution for the schema sent. And, I kept 0018 as a
separate patch to avoid conflict and rebasing in 0008, 0009 and 0012.
In the next patch set, I will merge all of them to 0007.

Okay, I think we can merge those patches.

+ /*
+ * We don't expect direct calls to heap_getnext with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(scan->rs_base.rs_rd) ||
+   RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+ elog(ERROR, "improper heap_getnext call");

Earlier, I thought we don't need to check if it is a regular table in
this check, but it is required because output plugins can try to do
that

I did not understand that, can you give some example?

I think it can lead to the same problem of concurrent aborts as for
catalog scans.

2. The commit message of this patch refers to Prepared transactions.
I think that needs to be changed.

0006-Implement-streaming-mode-in-ReorderBuffer
-------------------------------------------------------------------------

Few comments on v4-0018-Review-comment-fix-and-refactoring:
1.
+ if (streaming)
+ {
+ /*
+ * Set the last last of the stream as the final lsn before calling
+ * stream stop.
+ */
+ txn->final_lsn = prev_lsn;
+ rb->stream_stop(rb, txn);
+ }

Shouldn't we try to final_lsn as is done by Vignesh's patch [2]?

Isn't it the same, there we are doing while serializing and here we
are doing while streaming? Basically, the last LSN we streamed. Am I
missing something?

No, I think you are right.

Few more comments:
--------------------------------
v4-0007-Implement-streaming-mode-in-ReorderBuffer
1.
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
+ /*
+ * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
+ * information about
subtransactions, which could arrive after streaming start.
+ */
+ if (!txn->is_schema_sent)
+ snapshot_now
= ReorderBufferCopySnap(rb, txn->base_snapshot,
+ txn,
command_id);
..
}

Why are we using base snapshot here instead of the snapshot we saved
the first time streaming has happened? And as mentioned in comments,
won't we need to consider the snapshots for subtransactions that
arrived after the last time we have streamed the changes?

2.
+ /* remember the command ID and snapshot for the streaming run */
+ txn->command_id = command_id;
+ txn-

snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,

+
txn, command_id);

I don't see where the txn->snapshot_now is getting freed. The
base_snapshot is freed in ReorderBufferCleanupTXN, but I don't see
this getting freed.

3.
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
+ /*
+ * If this is a subxact, we need to stream the top-level transaction
+ * instead.
+ */
+ if (txn->toptxn)
+ {
+
ReorderBufferStreamTXN(rb, txn->toptxn);
+ return;
+ }

Is it ever possible that we reach here for subtransaction, if not,
then it should be Assert rather than if condition?

4. In ReorderBufferStreamTXN(), don't we need to set some of the txn
fields like origin_id, origin_lsn as we do in ReorderBufferCommit()
especially to cover the case when it gets called due to memory
overflow (aka via ReorderBufferCheckMemoryLimit).

v4-0017-Extend-handling-of-concurrent-aborts-for-streamin
1.
@@ -3712,7 +3727,22 @@ ReorderBufferStreamTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
if (using_subtxn)

RollbackAndReleaseCurrentSubTransaction();

- PG_RE_THROW();
+ /* re-throw only if it's not an abort */
+ if (errdata-

sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)

+ {
+ MemoryContextSwitchTo(ecxt);
+ PG_RE_THROW();
+
}
+ else
+ {
+ /* remember the command ID and snapshot for the streaming run */
+ txn-

command_id = command_id;

+ txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+
  txn, command_id);
+ rb->stream_stop(rb, txn);
+
+
FlushErrorState();
+ }

Can you update comments either in the above code block or some other
place to explain what is the concurrent abort problem and how we dealt
with it? Also, please explain how the above error handling is
sufficient to address all the various scenarios (sub-transaction got
aborted when we have already sent some changes, or when we have not
sent any changes yet).

v4-0006-Gracefully-handle-concurrent-aborts-of-uncommitte
1.
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));

Why here we can't use TransactionIdDidAbort? If we can't use it, then
can you add comments stating the reason of the same.

2.
/*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;

In comments, there is a mention of a prepared transaction. Do we
allow prepared transactions to be decoded as part of this patch?

3.
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid
(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))

This comment just says what code below is doing, can you explain the
rationale behind this check. It would be better if it is clear by
reading comments, why we are doing this check after fetching the
tuple. I think this can refer to the comment I suggested to add for
changes in patch
v4-0017-Extend-handling-of-concurrent-aborts-for-streamin.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#179Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#178)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jan 6, 2020 at 9:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

It is better to merge it with the main patch for
"Implement-streaming-mode-in-ReorderBuffer", otherwise, it is a bit
difficult to review.

Actually, we can merge 0008, 0009, 0012, 0018 to the main patch
(0007). Basically, if we merge all of them then we don't need to deal
with the conflict. I think Tomas has kept them separate so that we
can review the solution for the schema sent. And, I kept 0018 as a
separate patch to avoid conflict and rebasing in 0008, 0009 and 0012.
In the next patch set, I will merge all of them to 0007.

Okay, I think we can merge those patches.

ok

+ /*
+ * We don't expect direct calls to heap_getnext with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(scan->rs_base.rs_rd) ||
+   RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+ elog(ERROR, "improper heap_getnext call");

Earlier, I thought we don't need to check if it is a regular table in
this check, but it is required because output plugins can try to do
that

I did not understand that, can you give some example?

I think it can lead to the same problem of concurrent aborts as for
catalog scans.

Yeah, got it.

2. The commit message of this patch refers to Prepared transactions.
I think that needs to be changed.

0006-Implement-streaming-mode-in-ReorderBuffer
-------------------------------------------------------------------------

Few comments on v4-0018-Review-comment-fix-and-refactoring:
1.
+ if (streaming)
+ {
+ /*
+ * Set the last last of the stream as the final lsn before calling
+ * stream stop.
+ */
+ txn->final_lsn = prev_lsn;
+ rb->stream_stop(rb, txn);
+ }

Shouldn't we try to final_lsn as is done by Vignesh's patch [2]?

Isn't it the same, there we are doing while serializing and here we
are doing while streaming? Basically, the last LSN we streamed. Am I
missing something?

No, I think you are right.

Few more comments:
--------------------------------
v4-0007-Implement-streaming-mode-in-ReorderBuffer
1.
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
+ /*
+ * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
+ * information about
subtransactions, which could arrive after streaming start.
+ */
+ if (!txn->is_schema_sent)
+ snapshot_now
= ReorderBufferCopySnap(rb, txn->base_snapshot,
+ txn,
command_id);
..
}

Why are we using base snapshot here instead of the snapshot we saved
the first time streaming has happened? And as mentioned in comments,
won't we need to consider the snapshots for subtransactions that
arrived after the last time we have streamed the changes?

2.
+ /* remember the command ID and snapshot for the streaming run */
+ txn->command_id = command_id;
+ txn-

snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,

+
txn, command_id);

I don't see where the txn->snapshot_now is getting freed. The
base_snapshot is freed in ReorderBufferCleanupTXN, but I don't see
this getting freed.

Ok, I will check that and fix.

3.
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
+ /*
+ * If this is a subxact, we need to stream the top-level transaction
+ * instead.
+ */
+ if (txn->toptxn)
+ {
+
ReorderBufferStreamTXN(rb, txn->toptxn);
+ return;
+ }

Is it ever possible that we reach here for subtransaction, if not,
then it should be Assert rather than if condition?

ReorderBufferCheckMemoryLimit, can call it either for the
subtransaction or for the main transaction, depends upon in which
ReorderBufferTXN you are adding the current change.

I will analyze your other comments and fix them in the next version.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#180Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#179)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jan 6, 2020 at 3:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

3.
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
+ /*
+ * If this is a subxact, we need to stream the top-level transaction
+ * instead.
+ */
+ if (txn->toptxn)
+ {
+
ReorderBufferStreamTXN(rb, txn->toptxn);
+ return;
+ }

Is it ever possible that we reach here for subtransaction, if not,
then it should be Assert rather than if condition?

ReorderBufferCheckMemoryLimit, can call it either for the
subtransaction or for the main transaction, depends upon in which
ReorderBufferTXN you are adding the current change.

That function has code like below:

ReorderBufferCheckMemoryLimit()
{
..
if (ReorderBufferCanStream(rb))
{
/*
* Pick the largest toplevel transaction and evict it from memory by
* streaming the already decoded part.
*/
txn = ReorderBufferLargestTopTXN(rb);
/* we know there has to be one, because the size is not zero */
Assert(txn && !txn->toptxn);
..
ReorderBufferStreamTXN(rb, txn);
..
}

How can it ReorderBufferTXN pass for subtransaction?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#181Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#180)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jan 6, 2020 at 4:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jan 6, 2020 at 3:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

3.
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
+ /*
+ * If this is a subxact, we need to stream the top-level transaction
+ * instead.
+ */
+ if (txn->toptxn)
+ {
+
ReorderBufferStreamTXN(rb, txn->toptxn);
+ return;
+ }

Is it ever possible that we reach here for subtransaction, if not,
then it should be Assert rather than if condition?

ReorderBufferCheckMemoryLimit, can call it either for the
subtransaction or for the main transaction, depends upon in which
ReorderBufferTXN you are adding the current change.

That function has code like below:

ReorderBufferCheckMemoryLimit()
{
..
if (ReorderBufferCanStream(rb))
{
/*
* Pick the largest toplevel transaction and evict it from memory by
* streaming the already decoded part.
*/
txn = ReorderBufferLargestTopTXN(rb);
/* we know there has to be one, because the size is not zero */
Assert(txn && !txn->toptxn);
..
ReorderBufferStreamTXN(rb, txn);
..
}

How can it ReorderBufferTXN pass for subtransaction?

Hmm, I missed it. You are right, will fix it.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#182Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#181)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jan 6, 2020 at 4:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jan 6, 2020 at 4:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jan 6, 2020 at 3:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

3.
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
+ /*
+ * If this is a subxact, we need to stream the top-level transaction
+ * instead.
+ */
+ if (txn->toptxn)
+ {
+
ReorderBufferStreamTXN(rb, txn->toptxn);
+ return;
+ }

Is it ever possible that we reach here for subtransaction, if not,
then it should be Assert rather than if condition?

ReorderBufferCheckMemoryLimit, can call it either for the
subtransaction or for the main transaction, depends upon in which
ReorderBufferTXN you are adding the current change.

That function has code like below:

ReorderBufferCheckMemoryLimit()
{
..
if (ReorderBufferCanStream(rb))
{
/*
* Pick the largest toplevel transaction and evict it from memory by
* streaming the already decoded part.
*/
txn = ReorderBufferLargestTopTXN(rb);
/* we know there has to be one, because the size is not zero */
Assert(txn && !txn->toptxn);
..
ReorderBufferStreamTXN(rb, txn);
..
}

How can it ReorderBufferTXN pass for subtransaction?

Hmm, I missed it. You are right, will fix it.

I have observed one more design issue. The problem is that when we
get a toasted chunks we remember the changes in the memory(hash table)
but don't stream until we get the actual change on the main table.
Now, the problem is that we might get the change of the toasted table
and the main table in different streams. So basically, in a stream,
if we have only got the toasted tuples then even after
ReorderBufferStreamTXN the memory usage will not be reduced.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#183Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#182)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jan 8, 2020 at 1:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have observed one more design issue.

Good observation.

The problem is that when we
get a toasted chunks we remember the changes in the memory(hash table)
but don't stream until we get the actual change on the main table.
Now, the problem is that we might get the change of the toasted table
and the main table in different streams. So basically, in a stream,
if we have only got the toasted tuples then even after
ReorderBufferStreamTXN the memory usage will not be reduced.

I think we can't split such changes in a different stream (unless we
design an entirely new solution to send partial changes of toast
data), so we need to send them together. We can keep a flag like
data_complete in ReorderBufferTxn and mark it complete only when we
are able to assemble the entire tuple. Now, whenever, we try to
stream the changes once we reach the memory threshold, we can check
whether the data_complete flag is true, if so, then only send the
changes, otherwise, we can pick the next largest transaction. I think
we can retry it for few times and if we get the incomplete data for
multiple transactions, then we can decide to spill the transaction or
maybe we can directly spill the first largest transaction which has
incomplete data.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#184Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#183)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 8, 2020 at 1:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have observed one more design issue.

Good observation.

The problem is that when we
get a toasted chunks we remember the changes in the memory(hash table)
but don't stream until we get the actual change on the main table.
Now, the problem is that we might get the change of the toasted table
and the main table in different streams. So basically, in a stream,
if we have only got the toasted tuples then even after
ReorderBufferStreamTXN the memory usage will not be reduced.

I think we can't split such changes in a different stream (unless we
design an entirely new solution to send partial changes of toast
data), so we need to send them together. We can keep a flag like
data_complete in ReorderBufferTxn and mark it complete only when we
are able to assemble the entire tuple. Now, whenever, we try to
stream the changes once we reach the memory threshold, we can check
whether the data_complete flag is true, if so, then only send the
changes, otherwise, we can pick the next largest transaction. I think
we can retry it for few times and if we get the incomplete data for
multiple transactions, then we can decide to spill the transaction or
maybe we can directly spill the first largest transaction which has
incomplete data.

Yeah, we might do something on this line. Basically, we need to mark
the top-transaction as data-incomplete if any of its subtransaction is
having data-incomplete (it will always be the latest sub-transaction
of the top transaction). Also, for streaming, we are checking the
largest top transaction whereas for spilling we just need the larget
(sub) transaction. So we also need to decide while picking the
largest top transaction for streaming, if we get a few transactions
with in-complete data then how we will go for the spill. Do we spill
all the sub-transactions under this top transaction or we will again
find the larget (sub) transaction for spilling.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#185Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#184)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jan 9, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 8, 2020 at 1:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have observed one more design issue.

Good observation.

The problem is that when we
get a toasted chunks we remember the changes in the memory(hash table)
but don't stream until we get the actual change on the main table.
Now, the problem is that we might get the change of the toasted table
and the main table in different streams. So basically, in a stream,
if we have only got the toasted tuples then even after
ReorderBufferStreamTXN the memory usage will not be reduced.

I think we can't split such changes in a different stream (unless we
design an entirely new solution to send partial changes of toast
data), so we need to send them together. We can keep a flag like
data_complete in ReorderBufferTxn and mark it complete only when we
are able to assemble the entire tuple. Now, whenever, we try to
stream the changes once we reach the memory threshold, we can check
whether the data_complete flag is true, if so, then only send the
changes, otherwise, we can pick the next largest transaction. I think
we can retry it for few times and if we get the incomplete data for
multiple transactions, then we can decide to spill the transaction or
maybe we can directly spill the first largest transaction which has
incomplete data.

Yeah, we might do something on this line. Basically, we need to mark
the top-transaction as data-incomplete if any of its subtransaction is
having data-incomplete (it will always be the latest sub-transaction
of the top transaction). Also, for streaming, we are checking the
largest top transaction whereas for spilling we just need the larget
(sub) transaction. So we also need to decide while picking the
largest top transaction for streaming, if we get a few transactions
with in-complete data then how we will go for the spill. Do we spill
all the sub-transactions under this top transaction or we will again
find the larget (sub) transaction for spilling.

I think it is better to do later as that will lead to the spill of
only required (minimum changes to get the memory below threshold)
changes.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#186Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#185)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jan 9, 2020 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jan 9, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 8, 2020 at 1:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have observed one more design issue.

Good observation.

The problem is that when we
get a toasted chunks we remember the changes in the memory(hash table)
but don't stream until we get the actual change on the main table.
Now, the problem is that we might get the change of the toasted table
and the main table in different streams. So basically, in a stream,
if we have only got the toasted tuples then even after
ReorderBufferStreamTXN the memory usage will not be reduced.

I think we can't split such changes in a different stream (unless we
design an entirely new solution to send partial changes of toast
data), so we need to send them together. We can keep a flag like
data_complete in ReorderBufferTxn and mark it complete only when we
are able to assemble the entire tuple. Now, whenever, we try to
stream the changes once we reach the memory threshold, we can check
whether the data_complete flag is true, if so, then only send the
changes, otherwise, we can pick the next largest transaction. I think
we can retry it for few times and if we get the incomplete data for
multiple transactions, then we can decide to spill the transaction or
maybe we can directly spill the first largest transaction which has
incomplete data.

Yeah, we might do something on this line. Basically, we need to mark
the top-transaction as data-incomplete if any of its subtransaction is
having data-incomplete (it will always be the latest sub-transaction
of the top transaction). Also, for streaming, we are checking the
largest top transaction whereas for spilling we just need the larget
(sub) transaction. So we also need to decide while picking the
largest top transaction for streaming, if we get a few transactions
with in-complete data then how we will go for the spill. Do we spill
all the sub-transactions under this top transaction or we will again
find the larget (sub) transaction for spilling.

I think it is better to do later as that will lead to the spill of
only required (minimum changes to get the memory below threshold)
changes.

Make sense to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#187Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#178)
14 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jan 6, 2020 at 9:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

It is better to merge it with the main patch for
"Implement-streaming-mode-in-ReorderBuffer", otherwise, it is a bit
difficult to review.

Actually, we can merge 0008, 0009, 0012, 0018 to the main patch
(0007). Basically, if we merge all of them then we don't need to deal
with the conflict. I think Tomas has kept them separate so that we
can review the solution for the schema sent. And, I kept 0018 as a
separate patch to avoid conflict and rebasing in 0008, 0009 and 0012.
In the next patch set, I will merge all of them to 0007.

Okay, I think we can merge those patches.

Done
0008, 0009, 0017, 0018 are merged to 0007, 0012 is merged to 0010

Few more comments:
--------------------------------
v4-0007-Implement-streaming-mode-in-ReorderBuffer
1.
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
+ /*
+ * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
+ * information about
subtransactions, which could arrive after streaming start.
+ */
+ if (!txn->is_schema_sent)
+ snapshot_now
= ReorderBufferCopySnap(rb, txn->base_snapshot,
+ txn,
command_id);
..
}

Why are we using base snapshot here instead of the snapshot we saved
the first time streaming has happened? And as mentioned in comments,
won't we need to consider the snapshots for subtransactions that
arrived after the last time we have streamed the changes?

Fixed

2.
+ /* remember the command ID and snapshot for the streaming run */
+ txn->command_id = command_id;
+ txn-

snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,

+
txn, command_id);

I don't see where the txn->snapshot_now is getting freed. The
base_snapshot is freed in ReorderBufferCleanupTXN, but I don't see
this getting freed.

I have freed this In ReorderBufferCleanupTXN

3.
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
+ /*
+ * If this is a subxact, we need to stream the top-level transaction
+ * instead.
+ */
+ if (txn->toptxn)
+ {
+
ReorderBufferStreamTXN(rb, txn->toptxn);
+ return;
+ }

Is it ever possible that we reach here for subtransaction, if not,
then it should be Assert rather than if condition?

Fixed

4. In ReorderBufferStreamTXN(), don't we need to set some of the txn
fields like origin_id, origin_lsn as we do in ReorderBufferCommit()
especially to cover the case when it gets called due to memory
overflow (aka via ReorderBufferCheckMemoryLimit).

We get origin_lsn during commit time so I am not sure how can we do
that. I have also noticed that currently, we are not using origin_lsn
on the subscriber side. I think need more investigation that if we
want this then do we need to log it early.

v4-0017-Extend-handling-of-concurrent-aborts-for-streamin
1.
@@ -3712,7 +3727,22 @@ ReorderBufferStreamTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
if (using_subtxn)

RollbackAndReleaseCurrentSubTransaction();

- PG_RE_THROW();
+ /* re-throw only if it's not an abort */
+ if (errdata-

sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)

+ {
+ MemoryContextSwitchTo(ecxt);
+ PG_RE_THROW();
+
}
+ else
+ {
+ /* remember the command ID and snapshot for the streaming run */
+ txn-

command_id = command_id;

+ txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+
txn, command_id);
+ rb->stream_stop(rb, txn);
+
+
FlushErrorState();
+ }

Can you update comments either in the above code block or some other
place to explain what is the concurrent abort problem and how we dealt
with it? Also, please explain how the above error handling is
sufficient to address all the various scenarios (sub-transaction got
aborted when we have already sent some changes, or when we have not
sent any changes yet).

Done

v4-0006-Gracefully-handle-concurrent-aborts-of-uncommitte
1.
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));

Why here we can't use TransactionIdDidAbort? If we can't use it, then
can you add comments stating the reason of the same.

Done

2.
/*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;

In comments, there is a mention of a prepared transaction. Do we
allow prepared transactions to be decoded as part of this patch?

Fixed

3.
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid
(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))

This comment just says what code below is doing, can you explain the
rationale behind this check. It would be better if it is clear by
reading comments, why we are doing this check after fetching the
tuple. I think this can refer to the comment I suggested to add for
changes in patch
v4-0017-Extend-handling-of-concurrent-aborts-for-streamin.

Done

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v5-0005-Cleaning-up-of-flags-in-ReorderBufferTXN-structur.patchapplication/octet-stream; name=v5-0005-Cleaning-up-of-flags-in-ReorderBufferTXN-structur.patchDownload
From 3c783cce62294b92c51d9f4dba316443f25e7cee Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 18:08:37 +0200
Subject: [PATCH v5 05/14] Cleaning up of flags in ReorderBufferTXN structure

---
 src/backend/replication/logical/reorderbuffer.c | 36 ++++++++++++-------------
 src/include/replication/reorderbuffer.h         | 33 ++++++++++++++---------
 2 files changed, 38 insertions(+), 31 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 652a76e..9eda992 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -741,7 +741,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 			Assert(prev_first_lsn < cur_txn->first_lsn);
 
 		/* known-as-subtxn txns must not be listed */
-		Assert(!cur_txn->is_known_as_subxact);
+		Assert(!rbtxn_is_known_subxact(cur_txn));
 
 		prev_first_lsn = cur_txn->first_lsn;
 	}
@@ -761,7 +761,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 			Assert(prev_base_snap_lsn < cur_txn->base_snapshot_lsn);
 
 		/* known-as-subtxn txns must not be listed */
-		Assert(!cur_txn->is_known_as_subxact);
+		Assert(!rbtxn_is_known_subxact(cur_txn));
 
 		prev_base_snap_lsn = cur_txn->base_snapshot_lsn;
 	}
@@ -784,7 +784,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
 
 	txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
 
-	Assert(!txn->is_known_as_subxact);
+	Assert(!rbtxn_is_known_subxact(txn));
 	Assert(txn->first_lsn != InvalidXLogRecPtr);
 	return txn;
 }
@@ -844,7 +844,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 
 	if (!new_sub)
 	{
-		if (subtxn->is_known_as_subxact)
+		if (rbtxn_is_known_subxact(subtxn))
 		{
 			/* already associated, nothing to do */
 			return;
@@ -860,7 +860,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 		}
 	}
 
-	subtxn->is_known_as_subxact = true;
+	subtxn->txn_flags |= RBTXN_IS_SUBXACT;
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
@@ -1081,7 +1081,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	{
 		ReorderBufferChange *cur_change;
 
-		if (txn->serialized)
+		if (rbtxn_is_serialized(txn))
 		{
 			/* serialize remaining changes */
 			ReorderBufferSerializeTXN(rb, txn);
@@ -1110,7 +1110,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		{
 			ReorderBufferChange *cur_change;
 
-			if (cur_txn->serialized)
+			if (rbtxn_is_serialized(cur_txn))
 			{
 				/* serialize remaining changes */
 				ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1274,7 +1274,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * they originally were happening inside another subtxn, so we won't
 		 * ever recurse more than one level deep here.
 		 */
-		Assert(subtxn->is_known_as_subxact);
+		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
 		ReorderBufferCleanupTXN(rb, subtxn);
@@ -1322,7 +1322,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	/*
 	 * Remove TXN from its containing list.
 	 *
-	 * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+	 * Note: if txn is known as subxact, we are deleting the TXN from its
 	 * parent's list of known subxacts; this leaves the parent's nsubxacts
 	 * count too high, but we don't care.  Otherwise, we are deleting the TXN
 	 * from the LSN-ordered list of toplevel TXNs.
@@ -1337,7 +1337,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	Assert(found);
 
 	/* remove entries spilled to disk */
-	if (txn->serialized)
+	if (rbtxn_is_serialized(txn))
 		ReorderBufferRestoreCleanup(rb, txn);
 
 	/* deallocate */
@@ -1354,7 +1354,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
 		return;
 
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1987,7 +1987,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
 			 * final_lsn to that of their last change; this causes
 			 * ReorderBufferRestoreCleanup to do the right thing.
 			 */
-			if (txn->serialized && txn->final_lsn == 0)
+			if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
 			{
 				ReorderBufferChange *last =
 				dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -2135,7 +2135,7 @@ ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
 	 * operate on its top-level transaction instead.
 	 */
 	txn = ReorderBufferTXNByXid(rb, xid, true, &is_new, lsn, true);
-	if (txn->is_known_as_subxact)
+	if (rbtxn_is_known_subxact(txn))
 		txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
 									NULL, InvalidXLogRecPtr, false);
 	Assert(txn->base_snapshot == NULL);
@@ -2314,7 +2314,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	txn->has_catalog_changes = true;
+	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2331,7 +2331,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
 	if (txn == NULL)
 		return false;
 
-	return txn->has_catalog_changes;
+	return rbtxn_has_catalog_changes(txn);
 }
 
 /*
@@ -2351,7 +2351,7 @@ ReorderBufferXidHasBaseSnapshot(ReorderBuffer *rb, TransactionId xid)
 		return false;
 
 	/* a known subtxn? operate on top-level txn instead */
-	if (txn->is_known_as_subxact)
+	if (rbtxn_is_known_subxact(txn))
 		txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
 									NULL, InvalidXLogRecPtr, false);
 
@@ -2539,12 +2539,12 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	rb->spillBytes += size;
 
 	/* Don't consider already serialized transaction. */
-	rb->spillTxns += txn->serialized ? 0 : 1;
+	rb->spillTxns += rbtxn_is_serialized(txn) ? 0 : 1;
 
 	Assert(spilled == txn->nentries_mem);
 	Assert(dlist_is_empty(&txn->changes));
 	txn->nentries_mem = 0;
-	txn->serialized = true;
+	txn->txn_flags |= RBTXN_IS_SERIALIZED;
 
 	if (fd != -1)
 		CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 9e84687..04dd0cb 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -169,18 +169,34 @@ typedef struct ReorderBufferChange
 	dlist_node	node;
 } ReorderBufferChange;
 
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT          0x0002
+#define RBTXN_IS_SERIALIZED       0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn)    (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk?  It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn)       (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
 typedef struct ReorderBufferTXN
 {
+	int     txn_flags;
+
 	/*
 	 * The transactions transaction id, can be a toplevel or sub xid.
 	 */
 	TransactionId xid;
 
-	/* did the TX have catalog changes */
-	bool		has_catalog_changes;
-
 	/* Do we know this is a subxact?  Xid of top-level txn if so */
-	bool		is_known_as_subxact;
 	TransactionId toplevel_xid;
 
 	/*
@@ -249,15 +265,6 @@ typedef struct ReorderBufferTXN
 	uint64		nentries_mem;
 
 	/*
-	 * Has this transaction been spilled to disk?  It's not always possible to
-	 * deduce that fact by comparing nentries with nentries_mem, because e.g.
-	 * subtransactions of a large transaction might get serialized together
-	 * with the parent - if they're restored to memory they'd have
-	 * nentries_mem == nentries.
-	 */
-	bool		serialized;
-
-	/*
 	 * List of ReorderBufferChange structs, including new Snapshots and new
 	 * CommandIds
 	 */
-- 
1.8.3.1

v5-0001-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=v5-0001-Immediately-WAL-log-assignments.patchDownload
From 9924675dc35109c8823ee5a748e293d6917de1b7 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 09:53:08 +0530
Subject: [PATCH v5 01/14] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 45 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 39 +++++++++++++--------------
 src/include/access/xact.h                |  3 +++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 +++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 98 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 017f03b..51557e2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -190,6 +190,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -5096,6 +5097,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6002,3 +6004,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 2fa0a7f..b11b0c2 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -88,11 +88,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -194,6 +196,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -397,7 +403,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -743,6 +749,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		curinsert_flags |= XLOG_INCLUDE_XID;
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 3aa6812..8c281e8 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1165,6 +1165,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1203,6 +1204,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5e1dc8a..a99fcaf 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel_xid is valid, we need to assign the subxact to the
+	 * toplevel transaction. We need to do this for all records, hence we
+	 * do it before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 record->toplevel_xid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrIds) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(r->toplevel_xid))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04ba..8645b38 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033f..e23892a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -227,6 +227,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index f7cc8c4..4805cae 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -147,6 +147,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -280,6 +282,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v5-0004-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=v5-0004-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From 2c2dbdf2ea938c8115a88c83255a77ecac371bec Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v5 04/14] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 +++++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  57 +++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index cd105d9..a3efbfd 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -542,3 +571,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8db9686..ace21ec 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index bdf4389..9c95fc1 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +204,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -863,6 +911,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->final_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 3b7ca7f..f24e246 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..0d0a94a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d16bebf..9e84687 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -345,6 +345,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -384,6 +430,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v5-0002-Issue-individual-invalidations-with-wal_level-log.patchapplication/octet-stream; name=v5-0002-Issue-individual-invalidations-with-wal_level-log.patchDownload
From d35181a5128caaa300131f5a9674d15614cc0f57 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:26:33 +0530
Subject: [PATCH v5 02/14] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.

The new invalidations are written to WAL immediately, without any
such caching. Perhaps it would be possible to add similar caching,
e.g. at the command level, or something like that?
---
 src/backend/access/rmgrdesc/xactdesc.c          | 50 +++++++++++++++++
 src/backend/access/transam/xact.c               |  7 +++
 src/backend/replication/logical/decode.c        | 23 ++++++++
 src/backend/replication/logical/reorderbuffer.c | 56 +++++++++++++++---
 src/backend/utils/cache/inval.c                 | 75 +++++++++++++++++++++++++
 src/include/access/xact.h                       | 18 +++++-
 src/include/replication/reorderbuffer.h         | 14 +++++
 7 files changed, 235 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index e388cc7..6e46d19 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,11 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -397,6 +402,14 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs,
+								xlrec->dbId, xlrec->tsId,
+								xlrec->relcacheInitFileInval);
+	}
 }
 
 const char *
@@ -424,7 +437,44 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval)
+{
+	int			i;
+
+	if (relcacheInitFileInval)
+		appendStringInfo(buf, "; relcache init file inval dbid %u tsid %u",
+						 dbId, tsId);
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 51557e2..dd3d36f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6001,6 +6001,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a99fcaf..13a11ac 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,29 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/* XXX for now we're issuing invalidations one by one */
+				Assert(invals->nmsgs == 1);
+
+				if (!TransactionIdIsValid(xid))
+					break;
+
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->dbId, invals->tsId,
+											 invals->relcacheInitFileInval,
+											 invals->msgs[0]);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index f352805..723300b 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -473,6 +473,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
@@ -1822,17 +1823,23 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation message locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+					txn->is_schema_sent = false;
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -2227,6 +2234,38 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 
 /*
  * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn,
+							 Oid dbId, Oid tsId, bool relcacheInitFileInval,
+							 SharedInvalidationMessage msg)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	/* XXX Should we even write invalidations without valid XID? */
+	if (xid == InvalidTransactionId)
+		return;
+
+	Assert(xid != InvalidTransactionId);
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.dbId = dbId;
+	change->data.inval.tsId = tsId;
+	change->data.inval.relcacheInitFileInval = relcacheInitFileInval;
+	change->data.inval.msg = msg;
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Setup the invalidation of the toplevel transaction.
  *
  * This needs to be done before ReorderBufferCommit is called!
  */
@@ -2674,6 +2713,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -2770,6 +2810,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -3055,6 +3096,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..e0d04b9 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write individual invalidations into WAL to support
+ *	the decoding of the in-progress transaction.  As of now it was enough to
+ *	log invalidation only at commit because we are only decoding the transaction
+ *	at the commit time.   We only need to log the catalog cache and relcache
+ *	invalidation.  There can not be any active MVCC scan in logical decoding so
+ *	we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,9 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -489,6 +499,18 @@ RegisterCatcacheInvalidation(int cacheId,
 {
 	AddCatcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   cacheId, hashValue, dbId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cc.id = (int8) cacheId;
+		msg.cc.dbId = dbId;
+		msg.cc.hashValue = hashValue;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -501,6 +523,18 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 {
 	AddCatalogInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								  dbId, catId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cat.id = SHAREDINVALCATALOG_ID;
+		msg.cat.dbId = dbId;
+		msg.cat.catId = catId;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -511,6 +545,8 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 static void
 RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 {
+	bool		RelcacheInitFileInval = false;
+
 	AddRelcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   dbId, relId);
 
@@ -529,7 +565,22 @@ RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 	 * as well.  Also zap when we are invalidating whole relcache.
 	 */
 	if (relId == InvalidOid || RelationIdIsInInitFile(relId))
+	{
 		transInvalInfo->RelcacheInitFileInval = true;
+		RelcacheInitFileInval = true;
+	}
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.rc.id = SHAREDINVALRELCACHE_ID;
+		msg.rc.dbId = dbId;
+		msg.rc.relId = relId;
+
+		LogLogicalInvalidations(1, &msg, RelcacheInitFileInval);
+	}
 }
 
 /*
@@ -1501,3 +1552,27 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval)
+{
+	xl_xact_invalidations xlrec;
+
+	/* prepare record */
+	memset(&xlrec, 0, sizeof(xlrec));
+	xlrec.dbId = MyDatabaseId;
+	xlrec.tsId = MyDatabaseTableSpace;
+	xlrec.relcacheInitFileInval = relcacheInitFileInval;
+	xlrec.nmsgs = nmsgs;
+
+	/* perform insertion */
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+	XLogRegisterData((char *) msgs,
+					 nmsgs * sizeof(SharedInvalidationMessage));
+	XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b38..6f2a583 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,22 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ *
+ * XXX Currently nmsgs=1 but that might change in the future.
+ */
+typedef struct xl_xact_invalidations
+{
+	Oid			dbId;			/* MyDatabaseId */
+	Oid			tsId;			/* MyDatabaseTableSpace */
+	bool		relcacheInitFileInval;	/* invalidate relcache init file */
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e42d4c5..d16bebf 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,16 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			Oid			dbId;	/* MyDatabaseId */
+			Oid			tsId;	/* MyDatabaseTableSpace */
+			bool		relcacheInitFileInval;	/* invalidate relcache init
+												 * file */
+			SharedInvalidationMessage msg;	/* invalidation message */
+		}			inval;
 	}			data;
 
 	/*
@@ -448,6 +459,9 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  Oid dbId, Oid tsId, bool relcacheInitFileInval,
+								  SharedInvalidationMessage msg);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
1.8.3.1

v5-0003-fixup-is_schema_sent-set-too-early.patchapplication/octet-stream; name=v5-0003-fixup-is_schema_sent-set-too-early.patchDownload
From bd1088604c7917441322bcdac6af6c45c4cf765c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 27 Dec 2019 22:50:55 +0100
Subject: [PATCH v5 03/14] fixup: is_schema_sent set too early

---
 src/backend/replication/logical/reorderbuffer.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 723300b..652a76e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1837,7 +1837,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * about the message itself?
 					 */
 					LocalExecuteInvalidationMessage(&change->data.inval.msg);
-					txn->is_schema_sent = false;
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
-- 
1.8.3.1

v5-0006-Gracefully-handle-concurrent-aborts-of-uncommitte.patchapplication/octet-stream; name=v5-0006-Gracefully-handle-concurrent-aborts-of-uncommitte.patchDownload
From 74663de44159dfe79a6d6dca3f1d8b7d0453a886 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:32:34 +0530
Subject: [PATCH v5 06/14] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml               |  5 ++-
 src/backend/access/heap/heapam.c                | 41 +++++++++++++++++++++
 src/backend/access/index/genam.c                | 49 +++++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c |  8 ++--
 src/backend/utils/time/snapmgr.c                | 25 ++++++++++++-
 src/include/utils/snapmgr.h                     |  4 +-
 6 files changed, 124 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index ace21ec..319349a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,7 +432,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
      includes writing to tables, performing DDL changes, and
      calling <literal>txid_current()</literal>.
     </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7b8490d..2d4ef48 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1303,6 +1303,15 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(scan->rs_base.rs_rd) ||
+		  RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	HEAPDEBUG_1;				/* heap_getnext( info ) */
@@ -1422,6 +1431,14 @@ heap_fetch(Relation relation,
 	bool		valid;
 
 	/*
+	 * We don't expect direct calls to heap_fetch with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_fetch call during logical decoding");
+
+	/*
 	 * Fetch and pin the appropriate page of the relation.
 	 */
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
@@ -1535,6 +1552,14 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		valid;
 	bool		skip;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search_buffer with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_hot_search_buffer call during logical decoding");
+
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
 		*all_dead = first_call;
@@ -1683,6 +1708,14 @@ heap_get_latest_tid(TableScanDesc sscan,
 	Assert(ItemPointerIsValid(tid));
 
 	/*
+	 * We don't expect direct calls to heap_get_latest_tid with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_get_latest_tid call during logical decoding");
+
+	/*
 	 * Loop to chase down t_ctid links.  At top of loop, ctid is the tuple we
 	 * need to examine, and *tid is the TID we will return if ctid turns out
 	 * to be bogus.
@@ -5481,6 +5514,14 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	ItemId		lp = NULL;
 	HeapTupleHeader htup;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_hot_search call during logical decoding");
+
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index c16eb05..5644b8d 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -477,6 +478,22 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  Instead of directly checking the abort status we do check
+	 * if it is not in progress transaction and no committed. Because if there
+	 * were a system crash then status of the the transaction which were running
+	 * at that time might not have marked.  So we need to consider them as
+	 * aborted.  Refer detailed comments at snapmgr.c where the variable is
+	 * declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
@@ -513,6 +530,22 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  Instead of directly checking the abort status we do check
+	 * if it is not in progress transaction and no committed. Because if there
+	 * were a system crash then status of the the transaction which were running
+	 * at that time might not have marked.  So we need to consider them as
+	 * aborted.  Refer detailed comments at snapmgr.c where the variable is
+	 * declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return result;
 }
 
@@ -639,6 +672,22 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  Instead of directly checking the abort status we do check
+	 * if it is not in progress transaction and no committed. Because if there
+	 * were a system crash then status of the the transaction which were running
+	 * at that time might not have marked.  So we need to consider them as
+	 * aborted.  Refer detailed comments at snapmgr.c where the variable is
+	 * declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 9eda992..c9d28a3 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -692,7 +692,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 			txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 		/* setup snapshot to allow catalog access */
-		SetupHistoricSnapshot(snapshot_now, NULL);
+		SetupHistoricSnapshot(snapshot_now, NULL, xid);
 		PG_TRY();
 		{
 			rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1551,7 +1551,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1802,7 +1802,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1822,7 +1822,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					}
 
 					break;
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c5..93a0c04 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -154,6 +154,13 @@ static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
 /*
+ * An xid value pointing to a possibly ongoing (sub)transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
+/*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2029,10 +2036,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
  *
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive.  This is to re-check XID status while accessing catalog.
+ *
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+					  TransactionId snapshot_xid)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -2041,8 +2052,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 
 	/* setup (cmin, cmax) lookup hash */
 	tuplecid_data = tuplecids;
-}
 
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
+	 */
+	if (TransactionIdIsValid(snapshot_xid) &&
+		!TransactionIdDidCommit(snapshot_xid))
+		CheckXidAlive = snapshot_xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
 /*
  * Make catalog snapshots behave normally again.
@@ -2052,6 +2072,7 @@ TeardownHistoricSnapshot(bool is_error)
 {
 	HistoricSnapshot = NULL;
 	tuplecid_data = NULL;
+	CheckXidAlive = InvalidTransactionId;
 }
 
 bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13c..12f737b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,8 +145,10 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+								  TransactionId snapshot_xid);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-- 
1.8.3.1

v5-0007-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v5-0007-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From 60d03b120ef3ca7a62be67d78294322fb9dd5341 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 6 Jan 2020 13:15:24 +0530
Subject: [PATCH v5 07/14] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c     |  38 +-
 src/backend/replication/logical/reorderbuffer.c | 691 +++++++++++++++++++++---
 src/include/replication/reorderbuffer.h         |  32 ++
 3 files changed, 668 insertions(+), 93 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..160b167 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c9d28a3..c381d8d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -236,6 +236,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +245,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +381,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -769,6 +782,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -864,6 +909,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -987,7 +1035,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1023,6 +1071,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1037,6 +1088,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1320,6 +1374,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1345,8 +1408,93 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
  */
 static void
 ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1354,9 +1502,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
-
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
 
 	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
@@ -1495,63 +1640,48 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * build data to be able to lookup the CommandIds of catalog tuples
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
-		return;
-	}
-
-	snapshot_now = txn->base_snapshot;
-
-	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, txn->xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1567,15 +1697,20 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 	PG_TRY();
 	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 		ReorderBufferChange *change;
 		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction("stream");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* start streaming this chunk of transaction */
+		if (streaming)
+			rb->stream_start(rb, txn);
+		else
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1583,6 +1718,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			if (streaming)
+			{
+				/*
+				 * While streaming an in-progress transaction there is a
+				 * possibility that the (sub)transaction might get aborted
+				 * concurrently.  In such case if the (sub)transaction has
+				 * catalog update then we might decode the tuple using wrong
+				 * catalog version.  So for detecting the concurrent abort we
+				 * set CheckXidAlive to the current (sub)transaction's xid for
+				 * which this change belongs to.  And, during catalog scan we
+				 * can check the status of the xid and if it is aborted we will
+				 * report an specific error which we can ignore.  We might have
+				 * already streamed some of the changes for the aborted
+				 * (sub)transaction, but that is fine because when we decode the
+				 * abort we will stream abort message to truncate the changes in
+				 * the subscriber.
+				 */
+				CheckXidAlive = change->txn->xid;
+			}
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1592,8 +1756,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * use as a normal record. It'll be cleaned up at the end
 					 * of INSERT processing.
 					 */
-					if (specinsert == NULL)
-						elog(ERROR, "invalid ordering of speculative insertion changes");
 					Assert(specinsert->data.tp.oldtuple == NULL);
 					change = specinsert;
 					change->action = REORDER_BUFFER_CHANGE_INSERT;
@@ -1659,7 +1821,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						if (streaming)
+						{
+							rb->stream_change(rb, txn, relation, change);
+
+							/* Remember that we have sent some data for this xid. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_change(rb, txn, relation, change);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1680,8 +1850,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						 * freed/reused while restoring spooled data from
 						 * disk.
 						 */
-						Assert(change->data.tp.newtuple != NULL);
-
 						dlist_delete(&change->node);
 						ReorderBufferToastAppendChunk(rb, txn, relation,
 													  change);
@@ -1699,7 +1867,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1757,7 +1925,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						if (streaming)
+						{
+							rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+							/* Remember that we have sent some data. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_truncate(rb, txn, nrelations, relations, change);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1766,10 +1942,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					if (streaming)
+						rb->stream_message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
+					else
+						rb->message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1800,9 +1982,9 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+										  txn->xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1822,7 +2004,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+											  txn->xid);
 					}
 
 					break;
@@ -1860,14 +2043,40 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			/*
+			 * Set the last last of the stream as the final lsn before calling
+			 * stream stop.
+			 */
+			txn->final_lsn = prev_lsn;
+			rb->stream_stop(rb, txn);
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+		{
+			txn->command_id = command_id;
+			txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+													  txn, command_id);
+		}
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1885,14 +2094,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then Discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,18 +2128,117 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		if (streaming)
+		{
+			/* Discard the changes that we just streamed. */
+			ReorderBufferTruncateTXN(rb, txn);
 
-		PG_RE_THROW();
+			/* Re-throw only if it's not an abort. */
+			if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+			{
+				MemoryContextSwitchTo(ecxt);
+				PG_RE_THROW();
+			}
+			else
+			{
+				/* remember the command ID and snapshot for the streaming run */
+				txn->command_id = command_id;
+				txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+														  txn, command_id);
+				/*
+				 * Set the last last of the stream as the final lsn before
+				 * calling stream stop.
+				 */
+				txn->final_lsn = prev_lsn;
+				rb->stream_stop(rb, txn);
+
+				FlushErrorState();
+			}
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
 /*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * We currently can only decode a transaction's contents when its commit
+ * record is read because that's the only place where we know about cache
+ * invalidations. Thus, once a toplevel commit is read, we iterate over the top
+ * and subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -1946,6 +2262,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2030,6 +2353,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2165,8 +2495,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2174,6 +2513,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2185,19 +2525,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2226,6 +2575,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2315,6 +2665,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2419,6 +2776,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
  *
@@ -2438,15 +2827,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2739,6 +3159,101 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after we there
+		 * might be some new sub-transaction which after the last streaming run
+		 * so we need to add those sub-xip in the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 04dd0cb..c4a2643 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* does the txn have catalog changes */
 #define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -187,6 +188,20 @@ typedef struct ReorderBufferChange
  */
 #define rbtxn_is_serialized(txn)       (txn->txn_flags & RBTXN_IS_SERIALIZED)
 
+/*
+ * Has this transaction been streamed to downstream? Similarly to spilling
+ * to disk, it's not trivial to deduce this from nentries and nentries_mem,
+ * for various reasons. For example, all changes may be in subtransactions
+ * in which case we'd have nentries==0 for the toplevel one, and it'd say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.
+ *
+ * Note: We never stream and serialize a transaction at the same time (e
+ * only do spill to disk when streaming is not supported by the plugin),
+ * so only one of those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn)         (txn->txn_flags & RBTXN_IS_STREAMED)
+
 typedef struct ReorderBufferTXN
 {
 	int     txn_flags;
@@ -222,6 +237,16 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	final_lsn;
 
 	/*
+	 * Have we sent any changes for this transaction in output plugin?
+	 */
+	bool		any_data_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
+	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
 	XLogRecPtr	end_lsn;
@@ -252,6 +277,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
-- 
1.8.3.1

v5-0009-Support-logical_decoding_work_mem-set-from-create.patchapplication/octet-stream; name=v5-0009-Support-logical_decoding_work_mem-set-from-create.patchDownload
From 70b890b5e730d7c08661c3bfd6eecf149d766f2d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 11:51:04 +0530
Subject: [PATCH v5 09/14] Support logical_decoding_work_mem set from create
 subscription command

---
 doc/src/sgml/config.sgml                           | 21 +++++++++++
 doc/src/sgml/ref/create_subscription.sgml          | 12 ++++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/commands/subscriptioncmds.c            | 44 +++++++++++++++++++---
 .../libpqwalreceiver/libpqwalreceiver.c            |  3 ++
 src/backend/replication/logical/worker.c           |  1 +
 src/backend/replication/pgoutput/pgoutput.c        | 30 ++++++++++++++-
 src/include/catalog/pg_subscription.h              |  3 ++
 src/include/replication/walreceiver.h              |  1 +
 9 files changed, 108 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d1c902..8b1923c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1751,6 +1751,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-logical-decoding-work-mem" xreflabel="logical_decoding_work_mem">
+      <term><varname>logical_decoding_work_mem</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>logical_decoding_work_mem</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to be used by logical decoding,
+        before some of the decoded changes are either written to local disk.
+        This limits the amount of memory used by logical streaming replication
+        connections. It defaults to 64 megabytes (<literal>64MB</literal>).
+        Since each replication connection only uses a single buffer of this size,
+        and an installation normally doesn't have many such connections
+        concurrently (as limited by <varname>max_wal_senders</varname>), it's
+        safe to set this value significantly higher than <varname>work_mem</varname>,
+        reducing the amount of decoded changes written to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c24..91790b0 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>work_mem</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          Limits the amount of memory used to decode changes on the
+          publisher.  If not specified, the publisher will use the default
+          specified by <varname>logical_decoding_work_mem</varname>. When
+          needed, additional data are spilled to disk.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index f77a83b..5cd1daa 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->workmem = subform->subworkmem;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 95962b4..c45c2ce 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -58,7 +58,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, int *logical_wm,
+						   bool *logical_wm_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -89,6 +90,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (logical_wm)
+		*logical_wm_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -174,6 +177,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0 && logical_wm)
+		{
+			if (*logical_wm_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*logical_wm_given = true;
+			*logical_wm = defGetInt32(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -317,6 +330,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	int			logical_wm;
+	bool		logical_wm_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -333,7 +348,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &logical_wm, &logical_wm_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -411,6 +426,12 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (logical_wm_given)
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(logical_wm);
+	else
+		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +690,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				int			logical_wm;
+				bool		logical_wm_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &logical_wm, &logical_wm_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +721,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (logical_wm_given)
+				{
+					values[Anum_pg_subscription_subworkmem - 1] =
+						Int32GetDatum(logical_wm);
+					replaces[Anum_pg_subscription_subworkmem - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +739,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +778,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +815,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 42e3e04..16f9d00 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -406,6 +406,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		appendStringInfo(&cmd, ", work_mem '%d'",
+						 options->proto.logical.work_mem);
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 7a5471f..48b960c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1745,6 +1745,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.work_mem = MySubscription->workmem;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7525082..536722b 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -18,6 +18,7 @@
 #include "replication/logicalproto.h"
 #include "replication/origin.h"
 #include "replication/pgoutput.h"
+#include "utils/guc.h"
 #include "utils/int8.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
@@ -87,11 +88,12 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, int *logical_decoding_work_mem)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		work_mem_given = false;
 
 	foreach(lc, options)
 	{
@@ -137,6 +139,29 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0)
+		{
+			int64	parsed;
+
+			if (work_mem_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			work_mem_given = true;
+
+			if (!scanint8(strVal(defel->arg), true, &parsed))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid work_mem")));
+
+			if (parsed > PG_INT32_MAX || parsed < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("work_mem \"%s\" out of range",
+								strVal(defel->arg))));
+
+			*logical_decoding_work_mem = (int)parsed;
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -171,7 +196,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&logical_decoding_work_mem);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d4..3394379 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	int32		subworkmem;		/* Memory to use to decode changes. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	int			workmem;		/* Memory to decode changes. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a276237..66e89f0 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -162,6 +162,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			int			work_mem;	/* Memory limit to use for decoding */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
-- 
1.8.3.1

v5-0008-Fix-speculative-insert-bug.patchapplication/octet-stream; name=v5-0008-Fix-speculative-insert-bug.patchDownload
From 490495a4cd8ac57753f9977122bbd9887f4c09c1 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 10 Jan 2020 09:01:35 +0530
Subject: [PATCH v5 08/14] Fix speculative insert bug.

---
 src/backend/replication/logical/reorderbuffer.c | 23 +++++++++++++++++++----
 src/include/replication/reorderbuffer.h         |  6 ++++++
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c381d8d..eb6fda5 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1701,6 +1701,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		ReorderBufferChange *change;
 		ReorderBufferChange *specinsert = NULL;
 
+		/*
+		 * Resotre any previous speculative inserted tuple if we are running in
+		 * streaming mode.
+		 */
+		if (streaming && txn->specinsert != NULL)
+		{
+			specinsert = txn->specinsert;
+			txn->specinsert = NULL;
+		}
+
 		if (using_subtxn)
 			BeginInternalSubTransaction("stream");
 		else
@@ -2029,13 +2039,18 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		}
 
 		/*
-		 * There's a speculative insertion remaining, just clean in up, it
-		 * can't have been successful, otherwise we'd gotten a confirmation
-		 * record.
+		 * In non-streaming mode if there's a speculative insertion remaining,
+		 * just clean in up, it can't have been successful, otherwise we'd
+		 * gotten a confirmation record.  For streaming mode, remember the tuple
+		 * so that if we get the confirmation in the next stream we can stream
+		 * it.
 		 */
 		if (specinsert)
 		{
-			ReorderBufferReturnChange(rb, specinsert);
+			if (streaming)
+				txn->specinsert = specinsert;
+			else
+				ReorderBufferReturnChange(rb, specinsert);
 			specinsert = NULL;
 		}
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index c4a2643..f41e216 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -335,6 +335,12 @@ typedef struct ReorderBufferTXN
 	uint32		ninvalidations;
 	SharedInvalidationMessage *invalidations;
 
+	/*
+	 * Speculative insert saved from the last streamed run if the speculative
+	 * confirm has not received in the same stream.
+	 */
+	ReorderBufferChange *specinsert;
+
 	/* ---
 	 * Position in one of three lists:
 	 * * list of subtransactions if we are *known* to be subxact
-- 
1.8.3.1

v5-0010-Add-support-for-streaming-to-built-in-replication.patchapplication/octet-stream; name=v5-0010-Add-support-for-streaming-to-built-in-replication.patchDownload
From 8af095c64b08a228fdfdae2e6fdee5d3b686b9f2 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 27 Dec 2019 23:05:20 +0100
Subject: [PATCH v5 10/14] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |    5 +-
 doc/src/sgml/ref/create_subscription.sgml          |   12 +
 src/backend/catalog/pg_subscription.c              |    1 +
 src/backend/commands/subscriptioncmds.c            |   60 +-
 src/backend/postmaster/pgstat.c                    |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    8 +-
 src/backend/replication/logical/launcher.c         |    2 +
 src/backend/replication/logical/logical.c          |    4 +-
 src/backend/replication/logical/proto.c            |  157 ++-
 src/backend/replication/logical/worker.c           | 1031 ++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c        |  310 +++++-
 src/backend/replication/slotfuncs.c                |    7 +
 src/backend/replication/walsender.c                |    6 +
 src/include/catalog/pg_subscription.h              |    3 +
 src/include/pgstat.h                               |    6 +-
 src/include/replication/logicalproto.h             |   42 +-
 src/include/replication/walreceiver.h              |    1 +
 src/test/subscription/t/009_stream_simple.pl       |   86 ++
 src/test/subscription/t/010_stream_subxact.pl      |  102 ++
 src/test/subscription/t/011_stream_ddl.pl          |   95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |   82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |   84 ++
 22 files changed, 2074 insertions(+), 42 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 6dfb2e4..e1fb907 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
+      and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 91790b0..d9abf5e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -218,6 +218,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 5cd1daa..1dc486c 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->workmem = subform->subworkmem;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index c45c2ce..1ece10d 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
 						   bool *refresh, int *logical_wm,
-						   bool *logical_wm_given)
+						   bool *logical_wm_given, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -92,6 +93,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*refresh = true;
 	if (logical_wm)
 		*logical_wm_given = false;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -186,6 +189,26 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 
 			*logical_wm_given = true;
 			*logical_wm = defGetInt32(defel);
+
+			/*
+			 * Check that the value is not smaller than 64kB (which is
+			 * the minimum value for logical_work_mem).
+			 */
+			if (*logical_wm < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("%d is outside the valid range for parameter \"work_mem\" (64 .. 2147483647)",
+								*logical_wm)));
+		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
 		}
 		else
 			ereport(ERROR,
@@ -332,6 +355,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	int			logical_wm;
 	bool		logical_wm_given;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -348,7 +373,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL, &logical_wm, &logical_wm_given);
+							   NULL, &logical_wm, &logical_wm_given,
+							   &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -430,7 +456,15 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 		values[Anum_pg_subscription_subworkmem - 1] =
 			Int32GetDatum(logical_wm);
 	else
-		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(-1);
+
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
 
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
@@ -692,11 +726,14 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				int			logical_wm;
 				bool		logical_wm_given;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
 										   NULL, &synchronous_commit, NULL,
-										   &logical_wm, &logical_wm_given);
+										   &logical_wm, &logical_wm_given,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -728,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subworkmem - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -740,7 +784,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
 										   NULL, NULL, NULL, NULL, NULL,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -778,7 +822,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh, NULL, NULL);
+										   NULL, &refresh, NULL, NULL,
+										   NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -815,7 +860,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 51c486b..03ef76c 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4102,6 +4102,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 16f9d00..61701d0 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -406,8 +406,12 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
-		appendStringInfo(&cmd, ", work_mem '%d'",
-						 options->proto.logical.work_mem);
+		if (options->proto.logical.work_mem != -1)
+			appendStringInfo(&cmd, ", work_mem '%d'",
+							 options->proto.logical.work_mem);
+
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
 
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e..e80d00c 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,6 +14,8 @@
  *
  *-------------------------------------------------------------------------
  */
+#include <sys/types.h>
+#include <unistd.h>
 
 #include "postgres.h"
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 9c95fc1..61064f3 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1149,7 +1149,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1194,7 +1194,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index dcf7c08..918a841 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,7 +139,8 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
@@ -147,6 +148,10 @@ logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -182,8 +187,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -191,6 +196,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -252,13 +261,18 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
+	pq_sendbyte(out, 'D');		/* action DELETE */
+
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
-	pq_sendbyte(out, 'D');		/* action DELETE */
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
 
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
@@ -300,6 +314,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -309,6 +324,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -351,12 +370,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -401,7 +424,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -409,6 +432,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -689,3 +716,119 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendint32(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgint(in, 4) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+}
+
+TransactionId
+logicalrep_read_stream_stop(StringInfo in)
+{
+	TransactionId xid;
+
+	xid = pq_getmsgint(in, 4);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 48b960c..5c20c0e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in /tmp by default, and the filenames include both
+ * the XID of the toplevel transaction and OID of the subscription. This
+ * is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -31,6 +52,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -60,6 +82,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -67,6 +90,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -106,12 +130,59 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	off_t		offset;			/* offset in the file */
+}			SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
+static bool handle_streamed_transaction(const char action, StringInfo s);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -163,6 +234,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -529,6 +636,318 @@ apply_handle_origin(StringInfo s)
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found PG_USED_FOR_ASSERTS_ONLY = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* FIXME optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/* We should not receive aborts for unknown subtransactions. */
+		Assert(found);
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	/* XXX Should this be allocated in another memory context? */
+
+	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	ensure_transaction();
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pfree(buffer);
+	pfree(s2.data);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -541,6 +960,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -556,6 +978,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -591,6 +1016,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -695,6 +1123,9 @@ apply_handle_update(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -830,6 +1261,9 @@ apply_handle_delete(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -929,6 +1363,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1020,6 +1457,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1117,6 +1570,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 }
 
 /*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i]);
+}
+
+/*
  * Apply main loop.
  */
 static void
@@ -1132,6 +1601,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1580,6 +2052,564 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	CloseTransientFile(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	CloseTransientFile(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * TODO: Add missing_ok flag to specify in which cases it's OK not to
+ * find the files, and when it's an error.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialied in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -1746,6 +2776,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.work_mem = MySubscription->workmem;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 536722b..ebe0423 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -45,17 +45,45 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
 
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
+
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 
-/* Entry in the map used to remember which relation schemas we sent. */
+/*
+ * Entry in the map used to remember which relation schemas we sent.
+ *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
+	TransactionId	xid;		/* transaction that created the record */
 	bool		schema_sent;	/* did we send the schema? */
+	List	   *streamed_txns;	/* streamed toplevel transactions with
+								 * this schema */
 	bool		replicate_valid;
 	PublicationActions pubactions;
 } RelationSyncEntry;
@@ -64,11 +92,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -84,16 +118,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, int *logical_decoding_work_mem)
+						List **publication_names, int *logical_decoding_work_mem,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		work_mem_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -162,6 +206,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*logical_decoding_work_mem = (int)parsed;
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -174,6 +235,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -197,7 +259,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&logical_decoding_work_mem);
+								&logical_decoding_work_mem,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -217,6 +280,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -284,9 +368,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (!relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (!schema_sent)
 	{
 		TupleDesc	desc;
 		int			i;
@@ -312,19 +435,26 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 				continue;
 
 			OutputPluginPrepareWrite(ctx, false);
-			logicalrep_write_typ(ctx->out, att->atttypid);
+			logicalrep_write_typ(ctx->out, xid, att->atttypid);
 			OutputPluginWrite(ctx, false);
 		}
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_rel(ctx->out, relation);
+		logicalrep_write_rel(ctx->out, xid, relation);
 		OutputPluginWrite(ctx, false);
-		relentry->schema_sent = true;
+		relentry->xid = change->txn->xid;
+
+		if (in_streaming)
+			set_schema_sent_in_streamed_txn(relentry, topxid);
+		else
+			relentry->schema_sent = true;
 	}
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -333,6 +463,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -361,14 +495,14 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
 	{
 		case REORDER_BUFFER_CHANGE_INSERT:
 			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_insert(ctx->out, relation,
+			logicalrep_write_insert(ctx->out, xid, relation,
 									&change->data.tp.newtuple->tuple);
 			OutputPluginWrite(ctx, true);
 			break;
@@ -378,7 +512,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				&change->data.tp.oldtuple->tuple : NULL;
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple,
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
 										&change->data.tp.newtuple->tuple);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -387,7 +521,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			if (change->data.tp.oldtuple)
 			{
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation,
+				logicalrep_write_delete(ctx->out, xid, relation,
 										&change->data.tp.oldtuple->tuple);
 				OutputPluginWrite(ctx, true);
 			}
@@ -413,6 +547,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -433,13 +571,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -513,6 +652,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -549,6 +773,34 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  */
 static RelationSyncEntry *
@@ -623,6 +875,36 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -657,7 +939,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index bb69683..3085c0f 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -147,6 +147,13 @@ create_logical_replication_slot(char *name, char *plugin,
 									logical_read_local_xlog_page, NULL, NULL,
 									NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+
 	/* build initial snapshot, might take a while */
 	DecodingContextFindStartpoint(ctx);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9c06374..63fc2c7 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -969,6 +969,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndUpdateProgress);
 
 		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
+		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
 		 * messages or send keepalives. As we possibly need to wait for
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3394379..18f416f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -50,6 +50,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	int32		subworkmem;		/* Memory to use to decode changes. */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -76,6 +78,7 @@ typedef struct Subscription
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	int			workmem;		/* Memory to decode changes. */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e5a5d02..bbc9112 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -945,7 +945,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 2cc2dc4..ade4188 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -85,25 +89,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out, TransactionId xid);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 66e89f0..1e4269c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -163,6 +163,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			int			work_mem;	/* Memory limit to use for decoding */
+			bool		streaming;	/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v5-0011-Track-statistics-for-streaming.patchapplication/octet-stream; name=v5-0011-Track-statistics-for-streaming.patchDownload
From 6f2811a82d859b692b6b1a8e59083c6ab5ac953b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Jan 2020 09:45:27 +0530
Subject: [PATCH v5 11/14] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                    | 25 +++++++++++++++++++
 src/backend/catalog/system_views.sql            |  5 +++-
 src/backend/replication/logical/reorderbuffer.c | 13 ++++++++++
 src/backend/replication/walsender.c             | 32 +++++++++++++++++++++----
 src/include/catalog/pg_proc.dat                 |  6 ++---
 src/include/replication/reorderbuffer.h         | 13 ++++++----
 src/include/replication/walsender_private.h     |  5 ++++
 src/test/regress/expected/rules.out             |  7 ++++--
 8 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dcb5811..180ea88 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1996,6 +1996,31 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       may get spilled repeatedly, and this counter gets incremented on every
       such invocation.</entry>
     </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
+
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 773edf8..cb9e6ee 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -785,7 +785,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index eb6fda5..9dd379e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -331,6 +331,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3264,6 +3268,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
 
+	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 63fc2c7..d7f22ae 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1293,7 +1293,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1314,7 +1314,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2357,6 +2358,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3196,7 +3200,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3253,6 +3257,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillTxns;
 		int64		spillCount;
 		int64		spillBytes;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3276,6 +3283,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		memset(nulls, 0, sizeof(nulls));
@@ -3362,6 +3372,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3610,11 +3625,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 427faa3..9ef4fbf 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5173,9 +5173,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index f41e216..d671a2f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -517,15 +517,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 366828f..3888b0c 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -85,6 +85,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62eaf90..2dcb063 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1960,9 +1960,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
1.8.3.1

v5-0013-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patchapplication/octet-stream; name=v5-0013-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patchDownload
From 657b9d3dd5df36dc04aa8138b071bb8eb3165024 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:14:45 +0200
Subject: [PATCH v5 13/14] BUGFIX: set final_lsn for subxacts before cleanup

---
 src/backend/replication/logical/reorderbuffer.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 9dd379e..5e8a931 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1327,6 +1327,10 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
+		/* make sure subtxn has final_lsn */
+		if (subtxn->final_lsn == InvalidXLogRecPtr)
+			subtxn->final_lsn = txn->final_lsn;
+
 		/*
 		 * Subtransactions are always associated to the toplevel TXN, even if
 		 * they originally were happening inside another subtxn, so we won't
-- 
1.8.3.1

v5-0014-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v5-0014-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From fd866d4e41a1fe3d61606029b529998eb5403c14 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v5 14/14] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v5-0012-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v5-0012-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From c01738e8cc2480efbc13af94b050ff83607eb589 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v5 12/14] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 77a1560..8cd1993 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -65,7 +65,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 34ab11e..ad3ed13 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 81520a7..0c9c6b3 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

#188Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#176)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Dec 30, 2019 at 3:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Dec 12, 2019 at 9:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

0002-Issue-individual-invalidations-with-wal_level-log
----------------------------------------------------------------------------
1.
xact_desc_invalidations(StringInfo buf,
{
..
+ else if (msg->id == SHAREDINVALSNAPSHOT_ID)
+ appendStringInfo(buf, " snapshot %u", msg->sn.relId);

You have removed logging for the above cache but forgot to remove its
reference from one of the places. Also, I think you need to add a
comment somewhere in inval.c to say why you are writing for WAL for
some types of invalidations and not for others?

Done

I don't see any new comments as asked by me.

Done

I think we should also

consider WAL logging at each command end instead of doing piecemeal as
discussed in another email [1], which will have lesser code changes
and maybe better in performance. You might want to evaluate the
performance of both approaches.

Still pending, will work on this.

0005-Gracefully-handle-concurrent-aborts-of-uncommitte
----------------------------------------------------------------------------------
1.
@@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_CATCH();
{
/* TODO: Encapsulate cleanup
from the PG_TRY and PG_CATCH blocks */
+
if (iterstate)
ReorderBufferIterTXNFinish(rb, iterstate);

Spurious line change.

Done

+ /*
+ * We don't expect direct calls to heap_getnext with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(scan->rs_base.rs_rd) ||
+   RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+ elog(ERROR, "improper heap_getnext call");

Earlier, I thought we don't need to check if it is a regular table in
this check, but it is required because output plugins can try to do
that and if they do so during decoding (with historic snapshots), the
same should be not allowed.

How about changing the error message to "unexpected heap_getnext call
during logical decoding" or something like that?

Done

2. The commit message of this patch refers to Prepared transactions.
I think that needs to be changed.

0006-Implement-streaming-mode-in-ReorderBuffer
-------------------------------------------------------------------------

Few comments on v4-0018-Review-comment-fix-and-refactoring:
1.
+ if (streaming)
+ {
+ /*
+ * Set the last last of the stream as the final lsn before calling
+ * stream stop.
+ */
+ txn->final_lsn = prev_lsn;
+ rb->stream_stop(rb, txn);
+ }

Shouldn't we try to final_lsn as is done by Vignesh's patch [2]?

Already agreed upon current implementation

2.
+ if (streaming)
+ {
+ /*
+ * Set the CheckXidAlive to the current (sub)xid for which this
+ * change belongs to so that we can detect the abort while we are
+ * decoding.
+ */
+ CheckXidAlive = change->txn->xid;
+
+ /* Increment the stream count. */
+ streamed++;
+ }

Is the variable 'streamed' used anywhere?

Removed

3.
+ /*
+ * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+ * any memory. We could also keep the hash table and update it with
+ * new ctid values, but this seems simpler and good enough for now.
+ */
+ ReorderBufferDestroyTupleCidHash(rb, txn);

Won't this be required only when we are streaming changes?

Fixed

As per my understanding apart from the above comments, the known
pending work for this patchset is as follows:
a. The two open items agreed to you in the email [3].
b. Complete the handling of schema_sent as discussed above [4].
c. Few comments by Vignesh and the response on the same by me [5][6].
d. WAL overhead and performance testing for additional WAL logging by
this patchset.
e. Some way to see the tuple for streamed transactions by decoding API
as speculated by you [7].

Have I missed anything?

I have worked upon most of these items, I will reply to them separately.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#189Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#170)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Dec 30, 2019 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Dec 29, 2019 at 1:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have observed some more issues

1. Currently, In ReorderBufferCommit, it is always expected that
whenever we get REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we must
have already got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT and in
SPEC_CONFIRM we send the tuple we got in SPECT_INSERT. But, now those
two messages can be in different streams. So we need to find a way to
handle this. Maybe once we get SPEC_INSERT then we can remember the
tuple and then if we get the SPECT_CONFIRM in the next stream we can
send that tuple?

Your suggestion makes sense to me. So, we can try it.

I have implemented this and attached it as a separate patch. In my
latest patch set[1]/messages/by-id/CAFiTN-snMb=53oqkM8av8Lqfxojjm4OBwCNxmFssgLCceY_zgg@mail.gmail.com

2. During commit time in DecodeCommit we check whether we need to skip
the changes of the transaction or not by calling
SnapBuildXactNeedsSkip but since now we support streaming so it's
possible that before we decode the commit WAL, we might have already
sent the changes to the output plugin even though we could have
skipped those changes. So my question is instead of checking at the
commit time can't we check before adding to ReorderBuffer itself

I think if we can do that then the same will be true for current code
irrespective of this patch. I think it is possible that we can't take
that decision while decoding because we haven't assembled a consistent
snapshot yet. I think we might be able to do that while we try to
stream the changes. I think we need to take care of all the
conditions during streaming (when the logical_decoding_workmem limit
is reached) as we do in DecodeCommit. This needs a bit more study.

I have analyzed this further and I think we can not decide all the
conditions even while streaming. Because IMHO once we get the
SNAPBUILD_FULL_SNAPSHOT we can add the changes to the reorder buffer
so that if we get the commit of the transaction after we reach to the
SNAPBUILD_CONSISTENT. However, if we get the commit before we reach
to SNAPBUILD_CONSISTENT then we need to ignore this transaction. Now,
even if we have SNAPBUILD_FULL_SNAPSHOT we can stream the changes
which might get dropped later but that we can not decide while
streaming.

[1]: /messages/by-id/CAFiTN-snMb=53oqkM8av8Lqfxojjm4OBwCNxmFssgLCceY_zgg@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#190Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tomas Vondra (#167)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

I pushed 0005 (the rbtxn flags thing) after some light editing.
It's been around for long enough ...

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#191Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#187)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Here's a rebase of this patch series. I didn't change anything except

1. disregard what was 0005, since I already pushed it.
2. roll 0003 into 0002.
3. rebase 0007 (now 0005) to account for the reorderbuffer changes.

(I did notice that 0005 adds a new boolean any_data_sent, which is
silly -- it should be another txn_flags bit.)

However, tests don't pass for me; notably, test_decoding crashes.
OTOH I noticed that the streamed transaction support in test_decoding
writes the XID to the output, which is going to make it useless for
regression testing. It probably should not emit the numerical values.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#192Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#191)
12 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 2020-Jan-10, Alvaro Herrera wrote:

Here's a rebase of this patch series. I didn't change anything except

... this time with attachments ...

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v6-0004-Gracefully-handle-concurrent-aborts-of-uncommitte.patchtext/x-diff; charset=us-asciiDownload
From 06973f7b57a9c186e53400e8815d8edf7a6bc047 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:32:34 +0530
Subject: [PATCH v6 04/12] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml             |  5 +-
 src/backend/access/heap/heapam.c              | 41 ++++++++++++++++
 src/backend/access/index/genam.c              | 49 +++++++++++++++++++
 .../replication/logical/reorderbuffer.c       |  8 +--
 src/backend/utils/time/snapmgr.c              | 25 +++++++++-
 src/include/utils/snapmgr.h                   |  4 +-
 6 files changed, 124 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index ace21ec8e5..319349a92d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,7 +432,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
      includes writing to tables, performing DDL changes, and
      calling <literal>txid_current()</literal>.
     </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7b8490d4e5..2d4ef48069 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1303,6 +1303,15 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(scan->rs_base.rs_rd) ||
+		  RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	HEAPDEBUG_1;				/* heap_getnext( info ) */
@@ -1421,6 +1430,14 @@ heap_fetch(Relation relation,
 	OffsetNumber offnum;
 	bool		valid;
 
+	/*
+	 * We don't expect direct calls to heap_fetch with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_fetch call during logical decoding");
+
 	/*
 	 * Fetch and pin the appropriate page of the relation.
 	 */
@@ -1535,6 +1552,14 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		valid;
 	bool		skip;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search_buffer with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_hot_search_buffer call during logical decoding");
+
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
 		*all_dead = first_call;
@@ -1682,6 +1707,14 @@ heap_get_latest_tid(TableScanDesc sscan,
 	 */
 	Assert(ItemPointerIsValid(tid));
 
+	/*
+	 * We don't expect direct calls to heap_get_latest_tid with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_get_latest_tid call during logical decoding");
+
 	/*
 	 * Loop to chase down t_ctid links.  At top of loop, ctid is the tuple we
 	 * need to examine, and *tid is the TID we will return if ctid turns out
@@ -5481,6 +5514,14 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	ItemId		lp = NULL;
 	HeapTupleHeader htup;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_hot_search call during logical decoding");
+
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index c16eb05416..5644b8d41a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -477,6 +478,22 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  Instead of directly checking the abort status we do check
+	 * if it is not in progress transaction and no committed. Because if there
+	 * were a system crash then status of the the transaction which were running
+	 * at that time might not have marked.  So we need to consider them as
+	 * aborted.  Refer detailed comments at snapmgr.c where the variable is
+	 * declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
@@ -513,6 +530,22 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  Instead of directly checking the abort status we do check
+	 * if it is not in progress transaction and no committed. Because if there
+	 * were a system crash then status of the the transaction which were running
+	 * at that time might not have marked.  So we need to consider them as
+	 * aborted.  Refer detailed comments at snapmgr.c where the variable is
+	 * declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return result;
 }
 
@@ -639,6 +672,22 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  Instead of directly checking the abort status we do check
+	 * if it is not in progress transaction and no committed. Because if there
+	 * were a system crash then status of the the transaction which were running
+	 * at that time might not have marked.  So we need to consider them as
+	 * aborted.  Refer detailed comments at snapmgr.c where the variable is
+	 * declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 531897cf05..2da0a23a7e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -692,7 +692,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 			txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 		/* setup snapshot to allow catalog access */
-		SetupHistoricSnapshot(snapshot_now, NULL);
+		SetupHistoricSnapshot(snapshot_now, NULL, xid);
 		PG_TRY();
 		{
 			rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1551,7 +1551,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1802,7 +1802,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1822,7 +1822,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					}
 
 					break;
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..93a0c048c5 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -153,6 +153,13 @@ static Snapshot SecondarySnapshot = NULL;
 static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
+/*
+ * An xid value pointing to a possibly ongoing (sub)transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
 /*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
@@ -2029,10 +2036,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
  *
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive.  This is to re-check XID status while accessing catalog.
+ *
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+					  TransactionId snapshot_xid)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -2041,8 +2052,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 
 	/* setup (cmin, cmax) lookup hash */
 	tuplecid_data = tuplecids;
-}
 
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
+	 */
+	if (TransactionIdIsValid(snapshot_xid) &&
+		!TransactionIdDidCommit(snapshot_xid))
+		CheckXidAlive = snapshot_xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
 /*
  * Make catalog snapshots behave normally again.
@@ -2052,6 +2072,7 @@ TeardownHistoricSnapshot(bool is_error)
 {
 	HistoricSnapshot = NULL;
 	tuplecid_data = NULL;
+	CheckXidAlive = InvalidTransactionId;
 }
 
 bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce84..12f737b21a 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,8 +145,10 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+								  TransactionId snapshot_xid);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-- 
2.20.1

v6-0005-Implement-streaming-mode-in-ReorderBuffer.patchtext/x-diff; charset=us-asciiDownload
From 84b94c7c49ca7aae737ffb1451eee9098c483578 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri, 10 Jan 2020 18:03:27 -0300
Subject: [PATCH v6 05/12] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c   |  38 +-
 .../replication/logical/reorderbuffer.c       | 691 +++++++++++++++---
 src/include/replication/reorderbuffer.h       |  36 +
 3 files changed, 672 insertions(+), 93 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aa..160b167adb 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2da0a23a7e..50341a6d9e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -236,6 +236,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +245,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +381,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -768,6 +781,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+/*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -864,6 +909,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -987,7 +1035,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1023,6 +1071,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1037,6 +1088,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1319,6 +1373,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		dlist_delete(&txn->base_snapshot_node);
 	}
 
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
 	/*
 	 * Remove TXN from its containing list.
 	 *
@@ -1344,9 +1407,94 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferReturnTXN(rb, txn);
 }
 
+/*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
 /*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
  */
 static void
 ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1354,9 +1502,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
-
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
 
 	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
@@ -1495,63 +1640,48 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
+
+	ReorderBufferStreamTXN(rb, txn);
+
+	rb->stream_commit(rb, txn, txn->final_lsn);
+
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
 	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
 	ReorderBufferIterTXNState *volatile iterstate = NULL;
-
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * build data to be able to lookup the CommandIds of catalog tuples
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
-		return;
-	}
-
-	snapshot_now = txn->base_snapshot;
-
-	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, txn->xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1567,15 +1697,20 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 	PG_TRY();
 	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 		ReorderBufferChange *change;
 		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction("stream");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* start streaming this chunk of transaction */
+		if (streaming)
+			rb->stream_start(rb, txn);
+		else
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1583,6 +1718,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			if (streaming)
+			{
+				/*
+				 * While streaming an in-progress transaction there is a
+				 * possibility that the (sub)transaction might get aborted
+				 * concurrently.  In such case if the (sub)transaction has
+				 * catalog update then we might decode the tuple using wrong
+				 * catalog version.  So for detecting the concurrent abort we
+				 * set CheckXidAlive to the current (sub)transaction's xid for
+				 * which this change belongs to.  And, during catalog scan we
+				 * can check the status of the xid and if it is aborted we will
+				 * report an specific error which we can ignore.  We might have
+				 * already streamed some of the changes for the aborted
+				 * (sub)transaction, but that is fine because when we decode the
+				 * abort we will stream abort message to truncate the changes in
+				 * the subscriber.
+				 */
+				CheckXidAlive = change->txn->xid;
+			}
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1592,8 +1756,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * use as a normal record. It'll be cleaned up at the end
 					 * of INSERT processing.
 					 */
-					if (specinsert == NULL)
-						elog(ERROR, "invalid ordering of speculative insertion changes");
 					Assert(specinsert->data.tp.oldtuple == NULL);
 					change = specinsert;
 					change->action = REORDER_BUFFER_CHANGE_INSERT;
@@ -1659,7 +1821,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						if (streaming)
+						{
+							rb->stream_change(rb, txn, relation, change);
+
+							/* Remember that we have sent some data for this xid. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_change(rb, txn, relation, change);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1680,8 +1850,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						 * freed/reused while restoring spooled data from
 						 * disk.
 						 */
-						Assert(change->data.tp.newtuple != NULL);
-
 						dlist_delete(&change->node);
 						ReorderBufferToastAppendChunk(rb, txn, relation,
 													  change);
@@ -1699,7 +1867,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1757,7 +1925,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						if (streaming)
+						{
+							rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+							/* Remember that we have sent some data. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_truncate(rb, txn, nrelations, relations, change);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1766,10 +1942,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					if (streaming)
+						rb->stream_message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
+					else
+						rb->message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1800,9 +1982,9 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+										  txn->xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1822,7 +2004,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+											  txn->xid);
 					}
 
 					break;
@@ -1860,14 +2043,40 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			/*
+			 * Set the last last of the stream as the final lsn before calling
+			 * stream stop.
+			 */
+			txn->final_lsn = prev_lsn;
+			rb->stream_stop(rb, txn);
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+		{
+			txn->command_id = command_id;
+			txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+													  txn, command_id);
+		}
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1885,14 +2094,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then Discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,17 +2128,116 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		if (streaming)
+		{
+			/* Discard the changes that we just streamed. */
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Re-throw only if it's not an abort. */
+			if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+			{
+				MemoryContextSwitchTo(ecxt);
+				PG_RE_THROW();
+			}
+			else
+			{
+				/* remember the command ID and snapshot for the streaming run */
+				txn->command_id = command_id;
+				txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+														  txn, command_id);
+				/*
+				 * Set the last last of the stream as the final lsn before
+				 * calling stream stop.
+				 */
+				txn->final_lsn = prev_lsn;
+				rb->stream_stop(rb, txn);
 
-		PG_RE_THROW();
+				FlushErrorState();
+			}
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * We currently can only decode a transaction's contents when its commit
+ * record is read because that's the only place where we know about cache
+ * invalidations. Thus, once a toplevel commit is read, we iterate over the top
+ * and subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
@@ -1946,6 +2262,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2030,6 +2353,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2165,8 +2495,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2174,6 +2513,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2185,19 +2525,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2226,6 +2575,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2315,6 +2665,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2418,6 +2775,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
@@ -2438,15 +2827,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2739,6 +3159,101 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after we there
+		 * might be some new sub-transaction which after the last streaming run
+		 * so we need to add those sub-xip in the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 79ea33cd26..629eeca7f6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -192,6 +193,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -225,6 +244,16 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	final_lsn;
 
+	/*
+	 * Have we sent any changes for this transaction in output plugin?
+	 */
+	bool		any_data_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
@@ -255,6 +284,13 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
+	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
 	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
-- 
2.20.1

v6-0006-Fix-speculative-insert-bug.patchtext/x-diff; charset=us-asciiDownload
From ae7bfc6848143185d13adaa5532c13e9f3d730ca Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 10 Jan 2020 09:01:35 +0530
Subject: [PATCH v6 06/12] Fix speculative insert bug.

---
 .../replication/logical/reorderbuffer.c       | 23 +++++++++++++++----
 src/include/replication/reorderbuffer.h       |  6 +++++
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 50341a6d9e..8e4744f73a 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1701,6 +1701,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		ReorderBufferChange *change;
 		ReorderBufferChange *specinsert = NULL;
 
+		/*
+		 * Resotre any previous speculative inserted tuple if we are running in
+		 * streaming mode.
+		 */
+		if (streaming && txn->specinsert != NULL)
+		{
+			specinsert = txn->specinsert;
+			txn->specinsert = NULL;
+		}
+
 		if (using_subtxn)
 			BeginInternalSubTransaction("stream");
 		else
@@ -2029,13 +2039,18 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		}
 
 		/*
-		 * There's a speculative insertion remaining, just clean in up, it
-		 * can't have been successful, otherwise we'd gotten a confirmation
-		 * record.
+		 * In non-streaming mode if there's a speculative insertion remaining,
+		 * just clean in up, it can't have been successful, otherwise we'd
+		 * gotten a confirmation record.  For streaming mode, remember the tuple
+		 * so that if we get the confirmation in the next stream we can stream
+		 * it.
 		 */
 		if (specinsert)
 		{
-			ReorderBufferReturnChange(rb, specinsert);
+			if (streaming)
+				txn->specinsert = specinsert;
+			else
+				ReorderBufferReturnChange(rb, specinsert);
 			specinsert = NULL;
 		}
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 629eeca7f6..0510d3831f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -343,6 +343,12 @@ typedef struct ReorderBufferTXN
 	uint32		ninvalidations;
 	SharedInvalidationMessage *invalidations;
 
+	/*
+	 * Speculative insert saved from the last streamed run if the speculative
+	 * confirm has not received in the same stream.
+	 */
+	ReorderBufferChange *specinsert;
+
 	/* ---
 	 * Position in one of three lists:
 	 * * list of subtransactions if we are *known* to be subxact
-- 
2.20.1

v6-0007-Support-logical_decoding_work_mem-set-from-create.patchtext/x-diff; charset=us-asciiDownload
From c00cf7ddb5f49d355c52b704479373ac6598134b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 11:51:04 +0530
Subject: [PATCH v6 07/12] Support logical_decoding_work_mem set from create
 subscription command

---
 doc/src/sgml/config.sgml                      | 21 +++++++++
 doc/src/sgml/ref/create_subscription.sgml     | 12 +++++
 src/backend/catalog/pg_subscription.c         |  1 +
 src/backend/commands/subscriptioncmds.c       | 44 ++++++++++++++++---
 .../libpqwalreceiver/libpqwalreceiver.c       |  3 ++
 src/backend/replication/logical/worker.c      |  1 +
 src/backend/replication/pgoutput/pgoutput.c   | 30 ++++++++++++-
 src/include/catalog/pg_subscription.h         |  3 ++
 src/include/replication/walreceiver.h         |  1 +
 9 files changed, 108 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d1c90282f..8b1923c9de 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1751,6 +1751,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-logical-decoding-work-mem" xreflabel="logical_decoding_work_mem">
+      <term><varname>logical_decoding_work_mem</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>logical_decoding_work_mem</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to be used by logical decoding,
+        before some of the decoded changes are either written to local disk.
+        This limits the amount of memory used by logical streaming replication
+        connections. It defaults to 64 megabytes (<literal>64MB</literal>).
+        Since each replication connection only uses a single buffer of this size,
+        and an installation normally doesn't have many such connections
+        concurrently (as limited by <varname>max_wal_senders</varname>), it's
+        safe to set this value significantly higher than <varname>work_mem</varname>,
+        reducing the amount of decoded changes written to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c244fb..91790b0c95 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>work_mem</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          Limits the amount of memory used to decode changes on the
+          publisher.  If not specified, the publisher will use the default
+          specified by <varname>logical_decoding_work_mem</varname>. When
+          needed, additional data are spilled to disk.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index f77a83bd2e..5cd1daa238 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->workmem = subform->subworkmem;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 95962b4a3e..c45c2ce212 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -58,7 +58,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, int *logical_wm,
+						   bool *logical_wm_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -89,6 +90,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (logical_wm)
+		*logical_wm_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -174,6 +177,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0 && logical_wm)
+		{
+			if (*logical_wm_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*logical_wm_given = true;
+			*logical_wm = defGetInt32(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -317,6 +330,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	int			logical_wm;
+	bool		logical_wm_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -333,7 +348,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &logical_wm, &logical_wm_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -411,6 +426,12 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (logical_wm_given)
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(logical_wm);
+	else
+		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +690,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				int			logical_wm;
+				bool		logical_wm_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &logical_wm, &logical_wm_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +721,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (logical_wm_given)
+				{
+					values[Anum_pg_subscription_subworkmem - 1] =
+						Int32GetDatum(logical_wm);
+					replaces[Anum_pg_subscription_subworkmem - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +739,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +778,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +815,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 42e3e04e68..16f9d008fd 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -406,6 +406,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		appendStringInfo(&cmd, ", work_mem '%d'",
+						 options->proto.logical.work_mem);
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 7a5471f95c..48b960c4c9 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1745,6 +1745,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.work_mem = MySubscription->workmem;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 752508213a..536722b32f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -18,6 +18,7 @@
 #include "replication/logicalproto.h"
 #include "replication/origin.h"
 #include "replication/pgoutput.h"
+#include "utils/guc.h"
 #include "utils/int8.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
@@ -87,11 +88,12 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, int *logical_decoding_work_mem)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		work_mem_given = false;
 
 	foreach(lc, options)
 	{
@@ -137,6 +139,29 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0)
+		{
+			int64	parsed;
+
+			if (work_mem_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			work_mem_given = true;
+
+			if (!scanint8(strVal(defel->arg), true, &parsed))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid work_mem")));
+
+			if (parsed > PG_INT32_MAX || parsed < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("work_mem \"%s\" out of range",
+								strVal(defel->arg))));
+
+			*logical_decoding_work_mem = (int)parsed;
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -171,7 +196,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&logical_decoding_work_mem);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d42d8..3394379f86 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	int32		subworkmem;		/* Memory to use to decode changes. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	int			workmem;		/* Memory to decode changes. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a276237477..66e89f087b 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -162,6 +162,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			int			work_mem;	/* Memory limit to use for decoding */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
-- 
2.20.1

v6-0008-Add-support-for-streaming-to-built-in-replication.patchtext/x-diff; charset=us-asciiDownload
From 934dcdd48e4720cb558e6e7a1049f80b7e218344 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 27 Dec 2019 23:05:20 +0100
Subject: [PATCH v6 08/12] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml      |    5 +-
 doc/src/sgml/ref/create_subscription.sgml     |   12 +
 src/backend/catalog/pg_subscription.c         |    1 +
 src/backend/commands/subscriptioncmds.c       |   60 +-
 src/backend/postmaster/pgstat.c               |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |    8 +-
 src/backend/replication/logical/launcher.c    |    2 +
 src/backend/replication/logical/logical.c     |    4 +-
 src/backend/replication/logical/proto.c       |  157 ++-
 src/backend/replication/logical/worker.c      | 1031 +++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c   |  310 ++++-
 src/backend/replication/slotfuncs.c           |    7 +
 src/backend/replication/walsender.c           |    6 +
 src/include/catalog/pg_subscription.h         |    3 +
 src/include/pgstat.h                          |    6 +-
 src/include/replication/logicalproto.h        |   42 +-
 src/include/replication/walreceiver.h         |    1 +
 src/test/subscription/t/009_stream_simple.pl  |   86 ++
 src/test/subscription/t/010_stream_subxact.pl |  102 ++
 src/test/subscription/t/011_stream_ddl.pl     |   95 ++
 .../t/012_stream_subxact_abort.pl             |   82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |   84 ++
 22 files changed, 2074 insertions(+), 42 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 6dfb2e4d3e..e1fb9075e1 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
+      and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 91790b0c95..d9abf5e64c 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -218,6 +218,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 5cd1daa238..1dc486c0e7 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->workmem = subform->subworkmem;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index c45c2ce212..1ece10d9f5 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
 						   bool *refresh, int *logical_wm,
-						   bool *logical_wm_given)
+						   bool *logical_wm_given, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -92,6 +93,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*refresh = true;
 	if (logical_wm)
 		*logical_wm_given = false;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -186,6 +189,26 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 
 			*logical_wm_given = true;
 			*logical_wm = defGetInt32(defel);
+
+			/*
+			 * Check that the value is not smaller than 64kB (which is
+			 * the minimum value for logical_work_mem).
+			 */
+			if (*logical_wm < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("%d is outside the valid range for parameter \"work_mem\" (64 .. 2147483647)",
+								*logical_wm)));
+		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
 		}
 		else
 			ereport(ERROR,
@@ -332,6 +355,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	int			logical_wm;
 	bool		logical_wm_given;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -348,7 +373,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL, &logical_wm, &logical_wm_given);
+							   NULL, &logical_wm, &logical_wm_given,
+							   &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -430,7 +456,15 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 		values[Anum_pg_subscription_subworkmem - 1] =
 			Int32GetDatum(logical_wm);
 	else
-		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(-1);
+
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
 
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
@@ -692,11 +726,14 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				int			logical_wm;
 				bool		logical_wm_given;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
 										   NULL, &synchronous_commit, NULL,
-										   &logical_wm, &logical_wm_given);
+										   &logical_wm, &logical_wm_given,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -728,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subworkmem - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -740,7 +784,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
 										   NULL, NULL, NULL, NULL, NULL,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -778,7 +822,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh, NULL, NULL);
+										   NULL, &refresh, NULL, NULL,
+										   NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -815,7 +860,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 51c486bebd..03ef76caea 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4102,6 +4102,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 16f9d008fd..61701d0590 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -406,8 +406,12 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
-		appendStringInfo(&cmd, ", work_mem '%d'",
-						 options->proto.logical.work_mem);
+		if (options->proto.logical.work_mem != -1)
+			appendStringInfo(&cmd, ", work_mem '%d'",
+							 options->proto.logical.work_mem);
+
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
 
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e987..e80d00c1c3 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,6 +14,8 @@
  *
  *-------------------------------------------------------------------------
  */
+#include <sys/types.h>
+#include <unistd.h>
 
 #include "postgres.h"
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 9c95fc1ed8..61064f392a 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1149,7 +1149,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1194,7 +1194,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index dcf7c08c18..918a841125 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,7 +139,8 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
@@ -147,6 +148,10 @@ logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -182,8 +187,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -191,6 +196,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -252,13 +261,18 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
+	pq_sendbyte(out, 'D');		/* action DELETE */
+
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
-	pq_sendbyte(out, 'D');		/* action DELETE */
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
 
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
@@ -300,6 +314,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -309,6 +324,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -351,12 +370,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -401,7 +424,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -409,6 +432,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -689,3 +716,119 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendint32(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgint(in, 4) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+}
+
+TransactionId
+logicalrep_read_stream_stop(StringInfo in)
+{
+	TransactionId xid;
+
+	xid = pq_getmsgint(in, 4);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 48b960c4c9..5c20c0e4c3 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in /tmp by default, and the filenames include both
+ * the XID of the toplevel transaction and OID of the subscription. This
+ * is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -31,6 +52,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -60,6 +82,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -67,6 +90,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -106,12 +130,59 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	off_t		offset;			/* offset in the file */
+}			SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
+static bool handle_streamed_transaction(const char action, StringInfo s);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -163,6 +234,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -528,6 +635,318 @@ apply_handle_origin(StringInfo s)
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found PG_USED_FOR_ASSERTS_ONLY = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* FIXME optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/* We should not receive aborts for unknown subtransactions. */
+		Assert(found);
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	/* XXX Should this be allocated in another memory context? */
+
+	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	ensure_transaction();
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pfree(buffer);
+	pfree(s2.data);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -541,6 +960,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -556,6 +978,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -591,6 +1016,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -695,6 +1123,9 @@ apply_handle_update(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -830,6 +1261,9 @@ apply_handle_delete(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -929,6 +1363,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1020,6 +1457,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1116,6 +1569,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 	}
 }
 
+/*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i]);
+}
+
 /*
  * Apply main loop.
  */
@@ -1132,6 +1601,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1580,6 +2052,564 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	CloseTransientFile(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	CloseTransientFile(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * TODO: Add missing_ok flag to specify in which cases it's OK not to
+ * find the files, and when it's an error.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialied in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -1746,6 +2776,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.work_mem = MySubscription->workmem;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 536722b32f..ebe0423cc2 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -45,17 +45,45 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
 
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
+
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 
-/* Entry in the map used to remember which relation schemas we sent. */
+/*
+ * Entry in the map used to remember which relation schemas we sent.
+ *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
+	TransactionId	xid;		/* transaction that created the record */
 	bool		schema_sent;	/* did we send the schema? */
+	List	   *streamed_txns;	/* streamed toplevel transactions with
+								 * this schema */
 	bool		replicate_valid;
 	PublicationActions pubactions;
 } RelationSyncEntry;
@@ -64,11 +92,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -84,16 +118,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, int *logical_decoding_work_mem)
+						List **publication_names, int *logical_decoding_work_mem,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		work_mem_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -162,6 +206,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*logical_decoding_work_mem = (int)parsed;
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -174,6 +235,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -197,7 +259,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&logical_decoding_work_mem);
+								&logical_decoding_work_mem,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -217,6 +280,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -284,9 +368,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (!relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (!schema_sent)
 	{
 		TupleDesc	desc;
 		int			i;
@@ -312,19 +435,26 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 				continue;
 
 			OutputPluginPrepareWrite(ctx, false);
-			logicalrep_write_typ(ctx->out, att->atttypid);
+			logicalrep_write_typ(ctx->out, xid, att->atttypid);
 			OutputPluginWrite(ctx, false);
 		}
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_rel(ctx->out, relation);
+		logicalrep_write_rel(ctx->out, xid, relation);
 		OutputPluginWrite(ctx, false);
-		relentry->schema_sent = true;
+		relentry->xid = change->txn->xid;
+
+		if (in_streaming)
+			set_schema_sent_in_streamed_txn(relentry, topxid);
+		else
+			relentry->schema_sent = true;
 	}
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -333,6 +463,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -361,14 +495,14 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
 	{
 		case REORDER_BUFFER_CHANGE_INSERT:
 			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_insert(ctx->out, relation,
+			logicalrep_write_insert(ctx->out, xid, relation,
 									&change->data.tp.newtuple->tuple);
 			OutputPluginWrite(ctx, true);
 			break;
@@ -378,7 +512,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				&change->data.tp.oldtuple->tuple : NULL;
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple,
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
 										&change->data.tp.newtuple->tuple);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -387,7 +521,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			if (change->data.tp.oldtuple)
 			{
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation,
+				logicalrep_write_delete(ctx->out, xid, relation,
 										&change->data.tp.oldtuple->tuple);
 				OutputPluginWrite(ctx, true);
 			}
@@ -413,6 +547,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -433,13 +571,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -512,6 +651,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -548,6 +772,34 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  */
@@ -622,6 +874,36 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -657,7 +939,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index bb69683e2a..3085c0f921 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -147,6 +147,13 @@ create_logical_replication_slot(char *name, char *plugin,
 									logical_read_local_xlog_page, NULL, NULL,
 									NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+
 	/* build initial snapshot, might take a while */
 	DecodingContextFindStartpoint(ctx);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9c063749b6..63fc2c7ff2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -968,6 +968,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndPrepareWrite, WalSndWriteData,
 										WalSndUpdateProgress);
 
+		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
 		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3394379f86..18f416fa78 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -50,6 +50,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	int32		subworkmem;		/* Memory to use to decode changes. */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -76,6 +78,7 @@ typedef struct Subscription
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	int			workmem;		/* Memory to decode changes. */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e5a5d025ba..bbc9112b01 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -945,7 +945,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 2cc2dc4db3..ade4188dd2 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -85,25 +89,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out, TransactionId xid);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 66e89f087b..1e4269ca21 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -163,6 +163,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			int			work_mem;	/* Memory limit to use for decoding */
+			bool		streaming;	/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.20.1

v6-0009-Track-statistics-for-streaming.patchtext/x-diff; charset=us-asciiDownload
From 31d5e1ed8ced85ea01de18fbd70af0ad38956e55 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Jan 2020 09:45:27 +0530
Subject: [PATCH v6 09/12] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                  | 25 +++++++++++++++
 src/backend/catalog/system_views.sql          |  5 ++-
 .../replication/logical/reorderbuffer.c       | 13 ++++++++
 src/backend/replication/walsender.c           | 32 ++++++++++++++++---
 src/include/catalog/pg_proc.dat               |  6 ++--
 src/include/replication/reorderbuffer.h       | 13 +++++---
 src/include/replication/walsender_private.h   |  5 +++
 src/test/regress/expected/rules.out           |  7 ++--
 8 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dcb58115af..180ea880a4 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1996,6 +1996,31 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       may get spilled repeatedly, and this counter gets incremented on every
       such invocation.</entry>
     </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
+
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 773edf85e7..cb9e6ee9ea 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -785,7 +785,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8e4744f73a..16515d0a12 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -331,6 +331,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3264,6 +3268,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
 
+	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 63fc2c7ff2..d7f22ae960 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1293,7 +1293,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1314,7 +1314,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2357,6 +2358,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3196,7 +3200,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3253,6 +3257,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillTxns;
 		int64		spillCount;
 		int64		spillBytes;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3276,6 +3283,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		memset(nulls, 0, sizeof(nulls));
@@ -3362,6 +3372,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3610,11 +3625,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 427faa3c3b..9ef4fbf4f9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5173,9 +5173,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0510d3831f..7259c66c3f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -524,15 +524,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 366828f0a4..3888b0c2f8 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -85,6 +85,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62eaf90a0f..2dcb063912 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1960,9 +1960,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
2.20.1

v6-0010-Enable-streaming-for-all-subscription-TAP-tests.patchtext/x-diff; charset=us-asciiDownload
From 7d671806584fff71067c8bde38b2f642ba1331a9 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v6 10/12] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 77a1560b23..8cd1993393 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -65,7 +65,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2fbc..94c71f8ae2 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 34ab11e7fe..ad3ed13ffc 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9181..a6fae9c3f1 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17a78..202871a658 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10a19..70c86b22ac 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6d63..f9c8d1d348 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 81520a7332..0c9c6b3dd4 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bdba9..21f50c7012 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133f69..30561d8f96 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae38592b..9a6bac6822 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..ed56fbf96c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cba4c..4df1ddef63 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1a8a..c3caff6149 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7c2f..c62eb521e7 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30f59..2be7542831 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0578..2da9607a7d 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9435..96ffc091b0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
2.20.1

v6-0001-Immediately-WAL-log-assignments.patchtext/x-diff; charset=us-asciiDownload
From f2cd3a14f9513b880becedd95c933dad53c7d9e3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 09:53:08 +0530
Subject: [PATCH v6 01/12] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 45 ++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 +++
 src/backend/replication/logical/decode.c | 39 ++++++++++----------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 98 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 017f03b6d8..51557e2951 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -190,6 +190,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -5096,6 +5097,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6002,3 +6004,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 2fa0a7f667..b11b0c2940 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -88,11 +88,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -194,6 +196,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -397,7 +403,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -743,6 +749,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		curinsert_flags |= XLOG_INCLUDE_XID;
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 3aa68127a3..8c281e821a 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1165,6 +1165,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1203,6 +1204,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5e1dc8a651..a99fcaf0a1 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel_xid is valid, we need to assign the subxact to the
+	 * toplevel transaction. We need to do this for all records, hence we
+	 * do it before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 record->toplevel_xid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrIds) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(r->toplevel_xid))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04babc2..8645b3816c 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033fc20..e23892ab87 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -227,6 +227,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index f7cc8c4e1d..4805caec67 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -147,6 +147,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -280,6 +282,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0194..2f0c8bf589 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
2.20.1

v6-0002-Issue-individual-invalidations-with-wal_level-log.patchtext/x-diff; charset=us-asciiDownload
From ccc7defbeec7653240c76e6e4dcdb9349b49284d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:26:33 +0530
Subject: [PATCH v6 02/12] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.

The new invalidations are written to WAL immediately, without any
such caching. Perhaps it would be possible to add similar caching,
e.g. at the command level, or something like that?
---
 src/backend/access/rmgrdesc/xactdesc.c        | 50 +++++++++++++
 src/backend/access/transam/xact.c             |  7 ++
 src/backend/replication/logical/decode.c      | 23 ++++++
 .../replication/logical/reorderbuffer.c       | 55 ++++++++++++--
 src/backend/utils/cache/inval.c               | 75 +++++++++++++++++++
 src/include/access/xact.h                     | 18 ++++-
 src/include/replication/reorderbuffer.h       | 14 ++++
 7 files changed, 234 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index e388cc714a..6e46d19168 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,11 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -397,6 +402,14 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs,
+								xlrec->dbId, xlrec->tsId,
+								xlrec->relcacheInitFileInval);
+	}
 }
 
 const char *
@@ -424,7 +437,44 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval)
+{
+	int			i;
+
+	if (relcacheInitFileInval)
+		appendStringInfo(buf, "; relcache init file inval dbid %u tsid %u",
+						 dbId, tsId);
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 51557e2951..dd3d36ffb1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6001,6 +6001,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a99fcaf0a1..13a11ac782 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,29 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/* XXX for now we're issuing invalidations one by one */
+				Assert(invals->nmsgs == 1);
+
+				if (!TransactionIdIsValid(xid))
+					break;
+
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->dbId, invals->tsId,
+											 invals->relcacheInitFileInval,
+											 invals->msgs[0]);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bbd908af05..531897cf05 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -473,6 +473,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
@@ -1822,17 +1823,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation message locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -2225,6 +2231,38 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	txn->ntuplecids++;
 }
 
+/*
+ * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn,
+							 Oid dbId, Oid tsId, bool relcacheInitFileInval,
+							 SharedInvalidationMessage msg)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	/* XXX Should we even write invalidations without valid XID? */
+	if (xid == InvalidTransactionId)
+		return;
+
+	Assert(xid != InvalidTransactionId);
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.dbId = dbId;
+	change->data.inval.tsId = tsId;
+	change->data.inval.relcacheInitFileInval = relcacheInitFileInval;
+	change->data.inval.msg = msg;
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Setup the invalidation of the toplevel transaction.
  *
@@ -2674,6 +2712,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -2770,6 +2809,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -3055,6 +3095,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33be6..e0d04b9850 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write individual invalidations into WAL to support
+ *	the decoding of the in-progress transaction.  As of now it was enough to
+ *	log invalidation only at commit because we are only decoding the transaction
+ *	at the commit time.   We only need to log the catalog cache and relcache
+ *	invalidation.  There can not be any active MVCC scan in logical decoding so
+ *	we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,9 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -489,6 +499,18 @@ RegisterCatcacheInvalidation(int cacheId,
 {
 	AddCatcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   cacheId, hashValue, dbId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cc.id = (int8) cacheId;
+		msg.cc.dbId = dbId;
+		msg.cc.hashValue = hashValue;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -501,6 +523,18 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 {
 	AddCatalogInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								  dbId, catId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cat.id = SHAREDINVALCATALOG_ID;
+		msg.cat.dbId = dbId;
+		msg.cat.catId = catId;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -511,6 +545,8 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 static void
 RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 {
+	bool		RelcacheInitFileInval = false;
+
 	AddRelcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   dbId, relId);
 
@@ -529,7 +565,22 @@ RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 	 * as well.  Also zap when we are invalidating whole relcache.
 	 */
 	if (relId == InvalidOid || RelationIdIsInInitFile(relId))
+	{
 		transInvalInfo->RelcacheInitFileInval = true;
+		RelcacheInitFileInval = true;
+	}
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.rc.id = SHAREDINVALRELCACHE_ID;
+		msg.rc.dbId = dbId;
+		msg.rc.relId = relId;
+
+		LogLogicalInvalidations(1, &msg, RelcacheInitFileInval);
+	}
 }
 
 /*
@@ -1501,3 +1552,27 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval)
+{
+	xl_xact_invalidations xlrec;
+
+	/* prepare record */
+	memset(&xlrec, 0, sizeof(xlrec));
+	xlrec.dbId = MyDatabaseId;
+	xlrec.tsId = MyDatabaseTableSpace;
+	xlrec.relcacheInitFileInval = relcacheInitFileInval;
+	xlrec.nmsgs = nmsgs;
+
+	/* perform insertion */
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+	XLogRegisterData((char *) msgs,
+					 nmsgs * sizeof(SharedInvalidationMessage));
+	XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b3816c..6f2a5831ee 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -197,6 +197,22 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+/*
+ * Invalidations logged with wal_level=logical.
+ *
+ * XXX Currently nmsgs=1 but that might change in the future.
+ */
+typedef struct xl_xact_invalidations
+{
+	Oid			dbId;			/* MyDatabaseId */
+	Oid			tsId;			/* MyDatabaseTableSpace */
+	bool		relcacheInitFileInval;	/* invalidate relcache init file */
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 4f6c65d6f4..fa41115db9 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,16 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			Oid			dbId;	/* MyDatabaseId */
+			Oid			tsId;	/* MyDatabaseTableSpace */
+			bool		relcacheInitFileInval;	/* invalidate relcache init
+												 * file */
+			SharedInvalidationMessage msg;	/* invalidation message */
+		}			inval;
 	}			data;
 
 	/*
@@ -458,6 +469,9 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  Oid dbId, Oid tsId, bool relcacheInitFileInval,
+								  SharedInvalidationMessage msg);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
2.20.1

v6-0003-Extend-the-output-plugin-API-with-stream-methods.patchtext/x-diff; charset=us-asciiDownload
From 1236dc75d5609f14d359afb66a0a5f8bbfbf353e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v6 03/12] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 ++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 +++++
 src/include/replication/reorderbuffer.h   |  57 ++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index cd105d91e0..a3efbfdd76 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -542,3 +571,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8db968641e..ace21ec8e5 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index bdf4389a57..9c95fc1ed8 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +204,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -863,6 +911,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->final_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 3b7ca7f1da..f24e2468ac 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -79,6 +79,11 @@ typedef struct LogicalDecodingContext
 	 */
 	void	   *output_writer_private;
 
+	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236c57..0d0a94a648 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
 /*
  * Output plugin callbacks
  */
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index fa41115db9..79ea33cd26 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -355,6 +355,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -393,6 +439,17 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
 	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
-- 
2.20.1

v6-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patchtext/x-diff; charset=us-asciiDownload
From ab09b5f50aa9af74da48148818b1899f907038eb Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:14:45 +0200
Subject: [PATCH v6 11/12] BUGFIX: set final_lsn for subxacts before cleanup

---
 src/backend/replication/logical/reorderbuffer.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 16515d0a12..c0c6a7f86c 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1327,6 +1327,10 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
+		/* make sure subtxn has final_lsn */
+		if (subtxn->final_lsn == InvalidXLogRecPtr)
+			subtxn->final_lsn = txn->final_lsn;
+
 		/*
 		 * Subtransactions are always associated to the toplevel TXN, even if
 		 * they originally were happening inside another subtxn, so we won't
-- 
2.20.1

v6-0012-Add-TAP-test-for-streaming-vs.-DDL.patchtext/x-diff; charset=us-asciiDownload
From b9bb007d0dccb301acac5c0da13bd1d41e38428a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v6 12/12] Add TAP test for streaming vs. DDL

---
 .../subscription/t/014_stream_through_ddl.pl  | 98 +++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000000..b8d78b1972
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.20.1

#193Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#192)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 2020-Jan-10, Alvaro Herrera wrote:

From 7d671806584fff71067c8bde38b2f642ba1331a9 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v6 10/12] Enable streaming for all subscription TAP tests

This patch turns a lot of test into the streamed mode. While it's
great that streaming mode is tested, we should add new tests for it
rather than failing to keep tests for the non-streamed mode. I suggest
that we add two versions of each test, one for each mode. Maybe the way
to do that is to create some subroutine that can be called twice.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#194Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#185)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jan 9, 2020 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jan 9, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 8, 2020 at 1:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have observed one more design issue.

Good observation.

The problem is that when we
get a toasted chunks we remember the changes in the memory(hash table)
but don't stream until we get the actual change on the main table.
Now, the problem is that we might get the change of the toasted table
and the main table in different streams. So basically, in a stream,
if we have only got the toasted tuples then even after
ReorderBufferStreamTXN the memory usage will not be reduced.

I think we can't split such changes in a different stream (unless we
design an entirely new solution to send partial changes of toast
data), so we need to send them together. We can keep a flag like
data_complete in ReorderBufferTxn and mark it complete only when we
are able to assemble the entire tuple. Now, whenever, we try to
stream the changes once we reach the memory threshold, we can check
whether the data_complete flag is true, if so, then only send the
changes, otherwise, we can pick the next largest transaction. I think
we can retry it for few times and if we get the incomplete data for
multiple transactions, then we can decide to spill the transaction or
maybe we can directly spill the first largest transaction which has
incomplete data.

Yeah, we might do something on this line. Basically, we need to mark
the top-transaction as data-incomplete if any of its subtransaction is
having data-incomplete (it will always be the latest sub-transaction
of the top transaction). Also, for streaming, we are checking the
largest top transaction whereas for spilling we just need the larget
(sub) transaction. So we also need to decide while picking the
largest top transaction for streaming, if we get a few transactions
with in-complete data then how we will go for the spill. Do we spill
all the sub-transactions under this top transaction or we will again
find the larget (sub) transaction for spilling.

I think it is better to do later as that will lead to the spill of
only required (minimum changes to get the memory below threshold)
changes.

I think instead of doing this can't we just spill the changes which
are in toast_hash. Basically, at the end of the stream, we have some
toast tuple which we could not stream because we did not have the
insert for the main table then we can spill only those changes which
are in tuple hash. And, in the subsequent stream whenever we get the
insert for the main table at that time we can restore those changes
and stream. We can also maintain some flag saying data is not
complete (with some change LSN number) and after that LSN we can spill
any toast change to disk until we get the change for the main table,
by this way we can avoid building tuple hash until we get the change
for the main table.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#195Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#194)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jan 13, 2020 at 3:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jan 9, 2020 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jan 9, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

The problem is that when we
get a toasted chunks we remember the changes in the memory(hash table)
but don't stream until we get the actual change on the main table.
Now, the problem is that we might get the change of the toasted table
and the main table in different streams. So basically, in a stream,
if we have only got the toasted tuples then even after
ReorderBufferStreamTXN the memory usage will not be reduced.

I think we can't split such changes in a different stream (unless we
design an entirely new solution to send partial changes of toast
data), so we need to send them together. We can keep a flag like
data_complete in ReorderBufferTxn and mark it complete only when we
are able to assemble the entire tuple. Now, whenever, we try to
stream the changes once we reach the memory threshold, we can check
whether the data_complete flag is true

Here, we can also consider streaming the changes when data_complete is
false, but some additional changes have been added to the same txn as
the new changes might complete the tuple.

, if so, then only send the
changes, otherwise, we can pick the next largest transaction. I think
we can retry it for few times and if we get the incomplete data for
multiple transactions, then we can decide to spill the transaction or
maybe we can directly spill the first largest transaction which has
incomplete data.

Yeah, we might do something on this line. Basically, we need to mark
the top-transaction as data-incomplete if any of its subtransaction is
having data-incomplete (it will always be the latest sub-transaction
of the top transaction). Also, for streaming, we are checking the
largest top transaction whereas for spilling we just need the larget
(sub) transaction. So we also need to decide while picking the
largest top transaction for streaming, if we get a few transactions
with in-complete data then how we will go for the spill. Do we spill
all the sub-transactions under this top transaction or we will again
find the larget (sub) transaction for spilling.

I think it is better to do later as that will lead to the spill of
only required (minimum changes to get the memory below threshold)
changes.

I think instead of doing this can't we just spill the changes which
are in toast_hash. Basically, at the end of the stream, we have some
toast tuple which we could not stream because we did not have the
insert for the main table then we can spill only those changes which
are in tuple hash.

Hmm, I think this can turn out to be inefficient because we can easily
end up spilling the data even when we don't need to so. Consider
cases, where part of the streamed changes are for toast, and remaining
are the changes which we would have streamed and hence can be removed.
In such cases, we could have easily consumed remaining changes for
toast without spilling. Also, I am not sure if spilling changes from
the hash table is a good idea as they are no more in the same order as
they were in ReorderBuffer which means the order in which we serialize
the changes normally would change and that might have some impact, so
we would need some more study if we want to pursue this idea.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#196Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#192)
12 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Jan 11, 2020 at 3:07 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2020-Jan-10, Alvaro Herrera wrote:

Here's a rebase of this patch series. I didn't change anything except

... this time with attachments ...

The patch set fails to apply on the head so rebased. (Rebased on
commit cebf9d6e6ee13cbf9f1a91ec633cf96780ffc985)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v7-0002-Issue-individual-invalidations-with-wal_level-log.patchapplication/octet-stream; name=v7-0002-Issue-individual-invalidations-with-wal_level-log.patchDownload
From 6d920bb1d396c18fd0308a01d03ea105684bf388 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:26:33 +0530
Subject: [PATCH v7 02/12] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.

The new invalidations are written to WAL immediately, without any
such caching. Perhaps it would be possible to add similar caching,
e.g. at the command level, or something like that?
---
 src/backend/access/rmgrdesc/xactdesc.c          | 50 +++++++++++++++++
 src/backend/access/transam/xact.c               |  7 +++
 src/backend/replication/logical/decode.c        | 23 ++++++++
 src/backend/replication/logical/reorderbuffer.c | 55 +++++++++++++++---
 src/backend/utils/cache/inval.c                 | 75 +++++++++++++++++++++++++
 src/include/access/xact.h                       | 18 +++++-
 src/include/replication/reorderbuffer.h         | 14 +++++
 7 files changed, 234 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index e388cc7..6e46d19 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,11 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -397,6 +402,14 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs,
+								xlrec->dbId, xlrec->tsId,
+								xlrec->relcacheInitFileInval);
+	}
 }
 
 const char *
@@ -424,7 +437,44 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval)
+{
+	int			i;
+
+	if (relcacheInitFileInval)
+		appendStringInfo(buf, "; relcache init file inval dbid %u tsid %u",
+						 dbId, tsId);
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 51557e2..dd3d36f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6001,6 +6001,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a99fcaf..13a11ac 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,29 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/* XXX for now we're issuing invalidations one by one */
+				Assert(invals->nmsgs == 1);
+
+				if (!TransactionIdIsValid(xid))
+					break;
+
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->dbId, invals->tsId,
+											 invals->relcacheInitFileInval,
+											 invals->msgs[0]);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bbd908a..531897c 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -473,6 +473,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
@@ -1822,17 +1823,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation message locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -2227,6 +2233,38 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 
 /*
  * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn,
+							 Oid dbId, Oid tsId, bool relcacheInitFileInval,
+							 SharedInvalidationMessage msg)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	/* XXX Should we even write invalidations without valid XID? */
+	if (xid == InvalidTransactionId)
+		return;
+
+	Assert(xid != InvalidTransactionId);
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.dbId = dbId;
+	change->data.inval.tsId = tsId;
+	change->data.inval.relcacheInitFileInval = relcacheInitFileInval;
+	change->data.inval.msg = msg;
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Setup the invalidation of the toplevel transaction.
  *
  * This needs to be done before ReorderBufferCommit is called!
  */
@@ -2674,6 +2712,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -2770,6 +2809,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -3055,6 +3095,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..e0d04b9 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write individual invalidations into WAL to support
+ *	the decoding of the in-progress transaction.  As of now it was enough to
+ *	log invalidation only at commit because we are only decoding the transaction
+ *	at the commit time.   We only need to log the catalog cache and relcache
+ *	invalidation.  There can not be any active MVCC scan in logical decoding so
+ *	we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,9 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -489,6 +499,18 @@ RegisterCatcacheInvalidation(int cacheId,
 {
 	AddCatcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   cacheId, hashValue, dbId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cc.id = (int8) cacheId;
+		msg.cc.dbId = dbId;
+		msg.cc.hashValue = hashValue;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -501,6 +523,18 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 {
 	AddCatalogInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								  dbId, catId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cat.id = SHAREDINVALCATALOG_ID;
+		msg.cat.dbId = dbId;
+		msg.cat.catId = catId;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -511,6 +545,8 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 static void
 RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 {
+	bool		RelcacheInitFileInval = false;
+
 	AddRelcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   dbId, relId);
 
@@ -529,7 +565,22 @@ RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 	 * as well.  Also zap when we are invalidating whole relcache.
 	 */
 	if (relId == InvalidOid || RelationIdIsInInitFile(relId))
+	{
 		transInvalInfo->RelcacheInitFileInval = true;
+		RelcacheInitFileInval = true;
+	}
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.rc.id = SHAREDINVALRELCACHE_ID;
+		msg.rc.dbId = dbId;
+		msg.rc.relId = relId;
+
+		LogLogicalInvalidations(1, &msg, RelcacheInitFileInval);
+	}
 }
 
 /*
@@ -1501,3 +1552,27 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval)
+{
+	xl_xact_invalidations xlrec;
+
+	/* prepare record */
+	memset(&xlrec, 0, sizeof(xlrec));
+	xlrec.dbId = MyDatabaseId;
+	xlrec.tsId = MyDatabaseTableSpace;
+	xlrec.relcacheInitFileInval = relcacheInitFileInval;
+	xlrec.nmsgs = nmsgs;
+
+	/* perform insertion */
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+	XLogRegisterData((char *) msgs,
+					 nmsgs * sizeof(SharedInvalidationMessage));
+	XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b38..6f2a583 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,22 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ *
+ * XXX Currently nmsgs=1 but that might change in the future.
+ */
+typedef struct xl_xact_invalidations
+{
+	Oid			dbId;			/* MyDatabaseId */
+	Oid			tsId;			/* MyDatabaseTableSpace */
+	bool		relcacheInitFileInval;	/* invalidate relcache init file */
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 4f6c65d..fa41115 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,16 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			Oid			dbId;	/* MyDatabaseId */
+			Oid			tsId;	/* MyDatabaseTableSpace */
+			bool		relcacheInitFileInval;	/* invalidate relcache init
+												 * file */
+			SharedInvalidationMessage msg;	/* invalidation message */
+		}			inval;
 	}			data;
 
 	/*
@@ -458,6 +469,9 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  Oid dbId, Oid tsId, bool relcacheInitFileInval,
+								  SharedInvalidationMessage msg);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
1.8.3.1

v7-0001-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=v7-0001-Immediately-WAL-log-assignments.patchDownload
From 6d97e11e93fc4de09af087c884014d54b21fb15e Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 09:53:08 +0530
Subject: [PATCH v7 01/12] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 45 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 39 +++++++++++++--------------
 src/include/access/xact.h                |  3 +++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 +++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 98 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 017f03b..51557e2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -190,6 +190,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -5096,6 +5097,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6002,3 +6004,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 2fa0a7f..b11b0c2 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -88,11 +88,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -194,6 +196,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -397,7 +403,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -743,6 +749,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		curinsert_flags |= XLOG_INCLUDE_XID;
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 3aa6812..8c281e8 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1165,6 +1165,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1203,6 +1204,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5e1dc8a..a99fcaf 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel_xid is valid, we need to assign the subxact to the
+	 * toplevel transaction. We need to do this for all records, hence we
+	 * do it before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 record->toplevel_xid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrIds) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(r->toplevel_xid))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04ba..8645b38 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033f..e23892a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -227,6 +227,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index f7cc8c4..4805cae 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -147,6 +147,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -280,6 +282,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v7-0005-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v7-0005-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From e043a337a1b6bd390350e12792c8b632c52915f8 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri, 10 Jan 2020 18:03:27 -0300
Subject: [PATCH v7 05/12] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c     |  38 +-
 src/backend/replication/logical/reorderbuffer.c | 691 +++++++++++++++++++++---
 src/include/replication/reorderbuffer.h         |  36 ++
 3 files changed, 672 insertions(+), 93 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..160b167 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2da0a23..50341a6 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -236,6 +236,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +245,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +381,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -769,6 +782,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -864,6 +909,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -987,7 +1035,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1023,6 +1071,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1037,6 +1088,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1320,6 +1374,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1345,8 +1408,93 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
  */
 static void
 ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1354,9 +1502,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
-
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
 
 	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
@@ -1495,63 +1640,48 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * build data to be able to lookup the CommandIds of catalog tuples
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
-		return;
-	}
-
-	snapshot_now = txn->base_snapshot;
-
-	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, txn->xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1567,15 +1697,20 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 	PG_TRY();
 	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 		ReorderBufferChange *change;
 		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction("stream");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* start streaming this chunk of transaction */
+		if (streaming)
+			rb->stream_start(rb, txn);
+		else
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1583,6 +1718,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			if (streaming)
+			{
+				/*
+				 * While streaming an in-progress transaction there is a
+				 * possibility that the (sub)transaction might get aborted
+				 * concurrently.  In such case if the (sub)transaction has
+				 * catalog update then we might decode the tuple using wrong
+				 * catalog version.  So for detecting the concurrent abort we
+				 * set CheckXidAlive to the current (sub)transaction's xid for
+				 * which this change belongs to.  And, during catalog scan we
+				 * can check the status of the xid and if it is aborted we will
+				 * report an specific error which we can ignore.  We might have
+				 * already streamed some of the changes for the aborted
+				 * (sub)transaction, but that is fine because when we decode the
+				 * abort we will stream abort message to truncate the changes in
+				 * the subscriber.
+				 */
+				CheckXidAlive = change->txn->xid;
+			}
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1592,8 +1756,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * use as a normal record. It'll be cleaned up at the end
 					 * of INSERT processing.
 					 */
-					if (specinsert == NULL)
-						elog(ERROR, "invalid ordering of speculative insertion changes");
 					Assert(specinsert->data.tp.oldtuple == NULL);
 					change = specinsert;
 					change->action = REORDER_BUFFER_CHANGE_INSERT;
@@ -1659,7 +1821,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						if (streaming)
+						{
+							rb->stream_change(rb, txn, relation, change);
+
+							/* Remember that we have sent some data for this xid. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_change(rb, txn, relation, change);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1680,8 +1850,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						 * freed/reused while restoring spooled data from
 						 * disk.
 						 */
-						Assert(change->data.tp.newtuple != NULL);
-
 						dlist_delete(&change->node);
 						ReorderBufferToastAppendChunk(rb, txn, relation,
 													  change);
@@ -1699,7 +1867,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1757,7 +1925,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						if (streaming)
+						{
+							rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+							/* Remember that we have sent some data. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_truncate(rb, txn, nrelations, relations, change);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1766,10 +1942,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					if (streaming)
+						rb->stream_message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
+					else
+						rb->message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1800,9 +1982,9 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+										  txn->xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1822,7 +2004,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+											  txn->xid);
 					}
 
 					break;
@@ -1860,14 +2043,40 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			/*
+			 * Set the last last of the stream as the final lsn before calling
+			 * stream stop.
+			 */
+			txn->final_lsn = prev_lsn;
+			rb->stream_stop(rb, txn);
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+		{
+			txn->command_id = command_id;
+			txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+													  txn, command_id);
+		}
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1885,14 +2094,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then Discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,18 +2128,117 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		if (streaming)
+		{
+			/* Discard the changes that we just streamed. */
+			ReorderBufferTruncateTXN(rb, txn);
 
-		PG_RE_THROW();
+			/* Re-throw only if it's not an abort. */
+			if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+			{
+				MemoryContextSwitchTo(ecxt);
+				PG_RE_THROW();
+			}
+			else
+			{
+				/* remember the command ID and snapshot for the streaming run */
+				txn->command_id = command_id;
+				txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+														  txn, command_id);
+				/*
+				 * Set the last last of the stream as the final lsn before
+				 * calling stream stop.
+				 */
+				txn->final_lsn = prev_lsn;
+				rb->stream_stop(rb, txn);
+
+				FlushErrorState();
+			}
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
 /*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * We currently can only decode a transaction's contents when its commit
+ * record is read because that's the only place where we know about cache
+ * invalidations. Thus, once a toplevel commit is read, we iterate over the top
+ * and subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -1946,6 +2262,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2030,6 +2353,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2165,8 +2495,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2174,6 +2513,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2185,19 +2525,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2226,6 +2575,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2315,6 +2665,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2419,6 +2776,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
  *
@@ -2438,15 +2827,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2739,6 +3159,101 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after we there
+		 * might be some new sub-transaction which after the last streaming run
+		 * so we need to add those sub-xip in the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 79ea33c..629eeca 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -192,6 +193,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -226,6 +245,16 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	final_lsn;
 
 	/*
+	 * Have we sent any changes for this transaction in output plugin?
+	 */
+	bool		any_data_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
+	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
 	XLogRecPtr	end_lsn;
@@ -256,6 +285,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
-- 
1.8.3.1

v7-0004-Gracefully-handle-concurrent-aborts-of-uncommitte.patchapplication/octet-stream; name=v7-0004-Gracefully-handle-concurrent-aborts-of-uncommitte.patchDownload
From c6b0c427af35c8829a20f5ecc2f4ec6d93b3aaac Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:32:34 +0530
Subject: [PATCH v7 04/12] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml               |  5 ++-
 src/backend/access/heap/heapam.c                | 41 +++++++++++++++++++++
 src/backend/access/index/genam.c                | 49 +++++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c |  8 ++--
 src/backend/utils/time/snapmgr.c                | 25 ++++++++++++-
 src/include/utils/snapmgr.h                     |  4 +-
 6 files changed, 124 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index ace21ec..319349a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,7 +432,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
      includes writing to tables, performing DDL changes, and
      calling <literal>txid_current()</literal>.
     </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7b8490d..2d4ef48 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1303,6 +1303,15 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(scan->rs_base.rs_rd) ||
+		  RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	HEAPDEBUG_1;				/* heap_getnext( info ) */
@@ -1422,6 +1431,14 @@ heap_fetch(Relation relation,
 	bool		valid;
 
 	/*
+	 * We don't expect direct calls to heap_fetch with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_fetch call during logical decoding");
+
+	/*
 	 * Fetch and pin the appropriate page of the relation.
 	 */
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
@@ -1535,6 +1552,14 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		valid;
 	bool		skip;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search_buffer with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_hot_search_buffer call during logical decoding");
+
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
 		*all_dead = first_call;
@@ -1683,6 +1708,14 @@ heap_get_latest_tid(TableScanDesc sscan,
 	Assert(ItemPointerIsValid(tid));
 
 	/*
+	 * We don't expect direct calls to heap_get_latest_tid with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_get_latest_tid call during logical decoding");
+
+	/*
 	 * Loop to chase down t_ctid links.  At top of loop, ctid is the tuple we
 	 * need to examine, and *tid is the TID we will return if ctid turns out
 	 * to be bogus.
@@ -5481,6 +5514,14 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	ItemId		lp = NULL;
 	HeapTupleHeader htup;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_hot_search call during logical decoding");
+
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index c16eb05..5644b8d 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -477,6 +478,22 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  Instead of directly checking the abort status we do check
+	 * if it is not in progress transaction and no committed. Because if there
+	 * were a system crash then status of the the transaction which were running
+	 * at that time might not have marked.  So we need to consider them as
+	 * aborted.  Refer detailed comments at snapmgr.c where the variable is
+	 * declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
@@ -513,6 +530,22 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  Instead of directly checking the abort status we do check
+	 * if it is not in progress transaction and no committed. Because if there
+	 * were a system crash then status of the the transaction which were running
+	 * at that time might not have marked.  So we need to consider them as
+	 * aborted.  Refer detailed comments at snapmgr.c where the variable is
+	 * declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return result;
 }
 
@@ -639,6 +672,22 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  Instead of directly checking the abort status we do check
+	 * if it is not in progress transaction and no committed. Because if there
+	 * were a system crash then status of the the transaction which were running
+	 * at that time might not have marked.  So we need to consider them as
+	 * aborted.  Refer detailed comments at snapmgr.c where the variable is
+	 * declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 531897c..2da0a23 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -692,7 +692,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 			txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 		/* setup snapshot to allow catalog access */
-		SetupHistoricSnapshot(snapshot_now, NULL);
+		SetupHistoricSnapshot(snapshot_now, NULL, xid);
 		PG_TRY();
 		{
 			rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1551,7 +1551,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1802,7 +1802,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1822,7 +1822,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					}
 
 					break;
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c5..93a0c04 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -154,6 +154,13 @@ static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
 /*
+ * An xid value pointing to a possibly ongoing (sub)transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
+/*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2029,10 +2036,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
  *
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive.  This is to re-check XID status while accessing catalog.
+ *
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+					  TransactionId snapshot_xid)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -2041,8 +2052,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 
 	/* setup (cmin, cmax) lookup hash */
 	tuplecid_data = tuplecids;
-}
 
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
+	 */
+	if (TransactionIdIsValid(snapshot_xid) &&
+		!TransactionIdDidCommit(snapshot_xid))
+		CheckXidAlive = snapshot_xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
 /*
  * Make catalog snapshots behave normally again.
@@ -2052,6 +2072,7 @@ TeardownHistoricSnapshot(bool is_error)
 {
 	HistoricSnapshot = NULL;
 	tuplecid_data = NULL;
+	CheckXidAlive = InvalidTransactionId;
 }
 
 bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13c..12f737b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,8 +145,10 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+								  TransactionId snapshot_xid);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-- 
1.8.3.1

v7-0003-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=v7-0003-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From 7ef911c973808a9c6426a83d723040b6b862762d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v7 03/12] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 +++++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  57 +++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index cd105d9..a3efbfd 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -542,3 +571,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8db9686..ace21ec 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index bdf4389..9c95fc1 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +204,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -863,6 +911,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->final_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 3b7ca7f..f24e246 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..0d0a94a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index fa41115..79ea33c 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -355,6 +355,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -394,6 +440,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v7-0006-Fix-speculative-insert-bug.patchapplication/octet-stream; name=v7-0006-Fix-speculative-insert-bug.patchDownload
From 719fb75d324938c2e397079776f437b6e00b765f Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 10 Jan 2020 09:01:35 +0530
Subject: [PATCH v7 06/12] Fix speculative insert bug.

---
 src/backend/replication/logical/reorderbuffer.c | 23 +++++++++++++++++++----
 src/include/replication/reorderbuffer.h         |  6 ++++++
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 50341a6..8e4744f 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1701,6 +1701,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		ReorderBufferChange *change;
 		ReorderBufferChange *specinsert = NULL;
 
+		/*
+		 * Resotre any previous speculative inserted tuple if we are running in
+		 * streaming mode.
+		 */
+		if (streaming && txn->specinsert != NULL)
+		{
+			specinsert = txn->specinsert;
+			txn->specinsert = NULL;
+		}
+
 		if (using_subtxn)
 			BeginInternalSubTransaction("stream");
 		else
@@ -2029,13 +2039,18 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		}
 
 		/*
-		 * There's a speculative insertion remaining, just clean in up, it
-		 * can't have been successful, otherwise we'd gotten a confirmation
-		 * record.
+		 * In non-streaming mode if there's a speculative insertion remaining,
+		 * just clean in up, it can't have been successful, otherwise we'd
+		 * gotten a confirmation record.  For streaming mode, remember the tuple
+		 * so that if we get the confirmation in the next stream we can stream
+		 * it.
 		 */
 		if (specinsert)
 		{
-			ReorderBufferReturnChange(rb, specinsert);
+			if (streaming)
+				txn->specinsert = specinsert;
+			else
+				ReorderBufferReturnChange(rb, specinsert);
 			specinsert = NULL;
 		}
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 629eeca..0510d38 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -343,6 +343,12 @@ typedef struct ReorderBufferTXN
 	uint32		ninvalidations;
 	SharedInvalidationMessage *invalidations;
 
+	/*
+	 * Speculative insert saved from the last streamed run if the speculative
+	 * confirm has not received in the same stream.
+	 */
+	ReorderBufferChange *specinsert;
+
 	/* ---
 	 * Position in one of three lists:
 	 * * list of subtransactions if we are *known* to be subxact
-- 
1.8.3.1

v7-0008-Add-support-for-streaming-to-built-in-replication.patchapplication/octet-stream; name=v7-0008-Add-support-for-streaming-to-built-in-replication.patchDownload
From 38e572e314fd8f55e9359403568113bd642ee4ab Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 13 Jan 2020 14:24:39 +0530
Subject: [PATCH v7 08/12] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |    5 +-
 doc/src/sgml/ref/create_subscription.sgml          |   12 +
 src/backend/catalog/pg_subscription.c              |    1 +
 src/backend/commands/subscriptioncmds.c            |   60 +-
 src/backend/postmaster/pgstat.c                    |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    8 +-
 src/backend/replication/logical/launcher.c         |    2 +
 src/backend/replication/logical/logical.c          |    4 +-
 src/backend/replication/logical/proto.c            |  157 ++-
 src/backend/replication/logical/worker.c           | 1031 ++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c        |  310 +++++-
 src/backend/replication/slotfuncs.c                |    7 +
 src/backend/replication/walsender.c                |    6 +
 src/include/catalog/pg_subscription.h              |    3 +
 src/include/pgstat.h                               |    6 +-
 src/include/replication/logicalproto.h             |   42 +-
 src/include/replication/walreceiver.h              |    1 +
 src/test/subscription/t/009_stream_simple.pl       |   86 ++
 src/test/subscription/t/010_stream_subxact.pl      |  102 ++
 src/test/subscription/t/011_stream_ddl.pl          |   95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |   82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |   84 ++
 22 files changed, 2074 insertions(+), 42 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 6dfb2e4..e1fb907 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
+      and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 91790b0..d9abf5e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -218,6 +218,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 5cd1daa..1dc486c 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->workmem = subform->subworkmem;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index c50e854..a486bd3 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
 						   bool *refresh, int *logical_wm,
-						   bool *logical_wm_given)
+						   bool *logical_wm_given, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -92,6 +93,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*refresh = true;
 	if (logical_wm)
 		*logical_wm_given = false;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -186,6 +189,26 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 
 			*logical_wm_given = true;
 			*logical_wm = defGetInt32(defel);
+
+			/*
+			 * Check that the value is not smaller than 64kB (which is
+			 * the minimum value for logical_work_mem).
+			 */
+			if (*logical_wm < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("%d is outside the valid range for parameter \"work_mem\" (64 .. 2147483647)",
+								*logical_wm)));
+		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
 		}
 		else
 			ereport(ERROR,
@@ -332,6 +355,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	int			logical_wm;
 	bool		logical_wm_given;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -348,7 +373,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL, &logical_wm, &logical_wm_given);
+							   NULL, &logical_wm, &logical_wm_given,
+							   &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -430,7 +456,15 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 		values[Anum_pg_subscription_subworkmem - 1] =
 			Int32GetDatum(logical_wm);
 	else
-		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(-1);
+
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
 
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
@@ -691,11 +725,14 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				int			logical_wm;
 				bool		logical_wm_given;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
 										   NULL, &synchronous_commit, NULL,
-										   &logical_wm, &logical_wm_given);
+										   &logical_wm, &logical_wm_given,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -727,6 +764,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subworkmem - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -739,7 +783,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
 										   NULL, NULL, NULL, NULL, NULL,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -777,7 +821,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh, NULL, NULL);
+										   NULL, &refresh, NULL, NULL,
+										   NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -814,7 +859,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 51c486b..03ef76c 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4102,6 +4102,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 099d21b..470600a 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -406,8 +406,12 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
-		appendStringInfo(&cmd, ", work_mem '%d'",
-						 options->proto.logical.work_mem);
+		if (options->proto.logical.work_mem != -1)
+			appendStringInfo(&cmd, ", work_mem '%d'",
+							 options->proto.logical.work_mem);
+
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
 
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e..e80d00c 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,6 +14,8 @@
  *
  *-------------------------------------------------------------------------
  */
+#include <sys/types.h>
+#include <unistd.h>
 
 #include "postgres.h"
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 9c95fc1..61064f3 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1149,7 +1149,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1194,7 +1194,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd..93780b2 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,13 +257,18 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
+	pq_sendbyte(out, 'D');		/* action DELETE */
+
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
-	pq_sendbyte(out, 'D');		/* action DELETE */
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
 
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,119 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendint32(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgint(in, 4) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+}
+
+TransactionId
+logicalrep_read_stream_stop(StringInfo in)
+{
+	TransactionId xid;
+
+	xid = pq_getmsgint(in, 4);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 48b960c..5c20c0e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in /tmp by default, and the filenames include both
+ * the XID of the toplevel transaction and OID of the subscription. This
+ * is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -31,6 +52,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -60,6 +82,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -67,6 +90,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -106,12 +130,59 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	off_t		offset;			/* offset in the file */
+}			SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
+static bool handle_streamed_transaction(const char action, StringInfo s);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -163,6 +234,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -529,6 +636,318 @@ apply_handle_origin(StringInfo s)
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found PG_USED_FOR_ASSERTS_ONLY = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* FIXME optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/* We should not receive aborts for unknown subtransactions. */
+		Assert(found);
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	/* XXX Should this be allocated in another memory context? */
+
+	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	ensure_transaction();
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pfree(buffer);
+	pfree(s2.data);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -541,6 +960,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -556,6 +978,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -591,6 +1016,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -695,6 +1123,9 @@ apply_handle_update(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -830,6 +1261,9 @@ apply_handle_delete(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -929,6 +1363,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1020,6 +1457,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1117,6 +1570,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 }
 
 /*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i]);
+}
+
+/*
  * Apply main loop.
  */
 static void
@@ -1132,6 +1601,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1580,6 +2052,564 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	CloseTransientFile(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	CloseTransientFile(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * TODO: Add missing_ok flag to specify in which cases it's OK not to
+ * find the files, and when it's an error.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialied in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -1746,6 +2776,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.work_mem = MySubscription->workmem;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 536722b..ebe0423 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -45,17 +45,45 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
 
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
+
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 
-/* Entry in the map used to remember which relation schemas we sent. */
+/*
+ * Entry in the map used to remember which relation schemas we sent.
+ *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
+	TransactionId	xid;		/* transaction that created the record */
 	bool		schema_sent;	/* did we send the schema? */
+	List	   *streamed_txns;	/* streamed toplevel transactions with
+								 * this schema */
 	bool		replicate_valid;
 	PublicationActions pubactions;
 } RelationSyncEntry;
@@ -64,11 +92,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -84,16 +118,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, int *logical_decoding_work_mem)
+						List **publication_names, int *logical_decoding_work_mem,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		work_mem_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -162,6 +206,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*logical_decoding_work_mem = (int)parsed;
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -174,6 +235,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -197,7 +259,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&logical_decoding_work_mem);
+								&logical_decoding_work_mem,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -217,6 +280,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -284,9 +368,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (!relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (!schema_sent)
 	{
 		TupleDesc	desc;
 		int			i;
@@ -312,19 +435,26 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 				continue;
 
 			OutputPluginPrepareWrite(ctx, false);
-			logicalrep_write_typ(ctx->out, att->atttypid);
+			logicalrep_write_typ(ctx->out, xid, att->atttypid);
 			OutputPluginWrite(ctx, false);
 		}
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_rel(ctx->out, relation);
+		logicalrep_write_rel(ctx->out, xid, relation);
 		OutputPluginWrite(ctx, false);
-		relentry->schema_sent = true;
+		relentry->xid = change->txn->xid;
+
+		if (in_streaming)
+			set_schema_sent_in_streamed_txn(relentry, topxid);
+		else
+			relentry->schema_sent = true;
 	}
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -333,6 +463,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -361,14 +495,14 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
 	{
 		case REORDER_BUFFER_CHANGE_INSERT:
 			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_insert(ctx->out, relation,
+			logicalrep_write_insert(ctx->out, xid, relation,
 									&change->data.tp.newtuple->tuple);
 			OutputPluginWrite(ctx, true);
 			break;
@@ -378,7 +512,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				&change->data.tp.oldtuple->tuple : NULL;
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple,
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
 										&change->data.tp.newtuple->tuple);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -387,7 +521,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			if (change->data.tp.oldtuple)
 			{
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation,
+				logicalrep_write_delete(ctx->out, xid, relation,
 										&change->data.tp.oldtuple->tuple);
 				OutputPluginWrite(ctx, true);
 			}
@@ -413,6 +547,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -433,13 +571,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -513,6 +652,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -549,6 +773,34 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  */
 static RelationSyncEntry *
@@ -623,6 +875,36 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -657,7 +939,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index bb69683..3085c0f 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -147,6 +147,13 @@ create_logical_replication_slot(char *name, char *plugin,
 									logical_read_local_xlog_page, NULL, NULL,
 									NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+
 	/* build initial snapshot, might take a while */
 	DecodingContextFindStartpoint(ctx);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9c06374..63fc2c7 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -969,6 +969,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndUpdateProgress);
 
 		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
+		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
 		 * messages or send keepalives. As we possibly need to wait for
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3394379..18f416f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -50,6 +50,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	int32		subworkmem;		/* Memory to use to decode changes. */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -76,6 +78,7 @@ typedef struct Subscription
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	int			workmem;		/* Memory to decode changes. */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e5a5d02..bbc9112 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -945,7 +945,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 2cc2dc4..ade4188 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -85,25 +89,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out, TransactionId xid);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 66e89f0..1e4269c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -163,6 +163,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			int			work_mem;	/* Memory limit to use for decoding */
+			bool		streaming;	/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v7-0009-Track-statistics-for-streaming.patchapplication/octet-stream; name=v7-0009-Track-statistics-for-streaming.patchDownload
From b1a405a2f2408015cb402fd6604579662ea58a4e Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Jan 2020 09:45:27 +0530
Subject: [PATCH v7 09/12] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                    | 25 +++++++++++++++++++
 src/backend/catalog/system_views.sql            |  5 +++-
 src/backend/replication/logical/reorderbuffer.c | 13 ++++++++++
 src/backend/replication/walsender.c             | 32 +++++++++++++++++++++----
 src/include/catalog/pg_proc.dat                 |  6 ++---
 src/include/replication/reorderbuffer.h         | 13 ++++++----
 src/include/replication/walsender_private.h     |  5 ++++
 src/test/regress/expected/rules.out             |  7 ++++--
 8 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dcb5811..180ea88 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1996,6 +1996,31 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       may get spilled repeatedly, and this counter gets incremented on every
       such invocation.</entry>
     </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
+
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 773edf8..cb9e6ee 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -785,7 +785,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8e4744f..16515d0 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -331,6 +331,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3264,6 +3268,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
 
+	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 63fc2c7..d7f22ae 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1293,7 +1293,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1314,7 +1314,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2357,6 +2358,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3196,7 +3200,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3253,6 +3257,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillTxns;
 		int64		spillCount;
 		int64		spillBytes;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3276,6 +3283,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		memset(nulls, 0, sizeof(nulls));
@@ -3362,6 +3372,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3610,11 +3625,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 427faa3..9ef4fbf 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5173,9 +5173,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0510d38..7259c66 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -524,15 +524,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 366828f..3888b0c 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -85,6 +85,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62eaf90..2dcb063 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1960,9 +1960,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
1.8.3.1

v7-0007-Support-logical_decoding_work_mem-set-from-create.patchapplication/octet-stream; name=v7-0007-Support-logical_decoding_work_mem-set-from-create.patchDownload
From 307c8bf57e04bf93faadc1c7f2bef6368e5dbd0b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 11:51:04 +0530
Subject: [PATCH v7 07/12] Support logical_decoding_work_mem set from create
 subscription command

---
 doc/src/sgml/config.sgml                           | 21 +++++++++++
 doc/src/sgml/ref/create_subscription.sgml          | 12 ++++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/commands/subscriptioncmds.c            | 44 +++++++++++++++++++---
 .../libpqwalreceiver/libpqwalreceiver.c            |  3 ++
 src/backend/replication/logical/worker.c           |  1 +
 src/backend/replication/pgoutput/pgoutput.c        | 30 ++++++++++++++-
 src/include/catalog/pg_subscription.h              |  3 ++
 src/include/replication/walreceiver.h              |  1 +
 9 files changed, 108 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d1c902..8b1923c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1751,6 +1751,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-logical-decoding-work-mem" xreflabel="logical_decoding_work_mem">
+      <term><varname>logical_decoding_work_mem</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>logical_decoding_work_mem</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to be used by logical decoding,
+        before some of the decoded changes are either written to local disk.
+        This limits the amount of memory used by logical streaming replication
+        connections. It defaults to 64 megabytes (<literal>64MB</literal>).
+        Since each replication connection only uses a single buffer of this size,
+        and an installation normally doesn't have many such connections
+        concurrently (as limited by <varname>max_wal_senders</varname>), it's
+        safe to set this value significantly higher than <varname>work_mem</varname>,
+        reducing the amount of decoded changes written to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c24..91790b0 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>work_mem</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          Limits the amount of memory used to decode changes on the
+          publisher.  If not specified, the publisher will use the default
+          specified by <varname>logical_decoding_work_mem</varname>. When
+          needed, additional data are spilled to disk.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index f77a83b..5cd1daa 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->workmem = subform->subworkmem;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9bfe142..c50e854 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -58,7 +58,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, int *logical_wm,
+						   bool *logical_wm_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -89,6 +90,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (logical_wm)
+		*logical_wm_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -174,6 +177,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0 && logical_wm)
+		{
+			if (*logical_wm_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*logical_wm_given = true;
+			*logical_wm = defGetInt32(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -317,6 +330,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	int			logical_wm;
+	bool		logical_wm_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -333,7 +348,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &logical_wm, &logical_wm_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -411,6 +426,12 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (logical_wm_given)
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(logical_wm);
+	else
+		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -668,10 +689,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				int			logical_wm;
+				bool		logical_wm_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &logical_wm, &logical_wm_given);
 
 				if (slotname_given)
 				{
@@ -696,6 +720,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (logical_wm_given)
+				{
+					values[Anum_pg_subscription_subworkmem - 1] =
+						Int32GetDatum(logical_wm);
+					replaces[Anum_pg_subscription_subworkmem - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -707,7 +738,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -745,7 +777,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -782,7 +814,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 658af71..099d21b 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -406,6 +406,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		appendStringInfo(&cmd, ", work_mem '%d'",
+						 options->proto.logical.work_mem);
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 7a5471f..48b960c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1745,6 +1745,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.work_mem = MySubscription->workmem;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7525082..536722b 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -18,6 +18,7 @@
 #include "replication/logicalproto.h"
 #include "replication/origin.h"
 #include "replication/pgoutput.h"
+#include "utils/guc.h"
 #include "utils/int8.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
@@ -87,11 +88,12 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, int *logical_decoding_work_mem)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		work_mem_given = false;
 
 	foreach(lc, options)
 	{
@@ -137,6 +139,29 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0)
+		{
+			int64	parsed;
+
+			if (work_mem_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			work_mem_given = true;
+
+			if (!scanint8(strVal(defel->arg), true, &parsed))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid work_mem")));
+
+			if (parsed > PG_INT32_MAX || parsed < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("work_mem \"%s\" out of range",
+								strVal(defel->arg))));
+
+			*logical_decoding_work_mem = (int)parsed;
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -171,7 +196,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&logical_decoding_work_mem);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d4..3394379 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	int32		subworkmem;		/* Memory to use to decode changes. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	int			workmem;		/* Memory to decode changes. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a276237..66e89f0 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -162,6 +162,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			int			work_mem;	/* Memory limit to use for decoding */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
-- 
1.8.3.1

v7-0010-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v7-0010-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From dfc942975061012a931debb4fccb466b38d7c029 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v7 10/12] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 6da7f71..086d0c7 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -70,7 +70,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 34ab11e..ad3ed13 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 81520a7..0c9c6b3 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v7-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patchapplication/octet-stream; name=v7-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patchDownload
From 4bd9c9773941c5b645e793f55d3063d06c9ce7ac Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:14:45 +0200
Subject: [PATCH v7 11/12] BUGFIX: set final_lsn for subxacts before cleanup

---
 src/backend/replication/logical/reorderbuffer.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 16515d0..c0c6a7f 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1327,6 +1327,10 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
+		/* make sure subtxn has final_lsn */
+		if (subtxn->final_lsn == InvalidXLogRecPtr)
+			subtxn->final_lsn = txn->final_lsn;
+
 		/*
 		 * Subtransactions are always associated to the toplevel TXN, even if
 		 * they originally were happening inside another subtxn, so we won't
-- 
1.8.3.1

v7-0012-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v7-0012-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From ed4c3d83ea72c05f949772d6b379838536bb14df Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v7 12/12] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

#197Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dilip Kumar (#196)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jan 14, 2020 at 10:56:37AM +0530, Dilip Kumar wrote:

On Sat, Jan 11, 2020 at 3:07 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2020-Jan-10, Alvaro Herrera wrote:

Here's a rebase of this patch series. I didn't change anything except

... this time with attachments ...

The patch set fails to apply on the head so rebased. (Rebased on
commit cebf9d6e6ee13cbf9f1a91ec633cf96780ffc985)

I've noticed the patch was in WoA state since 2019/12/01, but there's
been quite a lot of traffic on this thread and a bunch of new patch
versions. So I've switched it to "needs review" - if that's not the
right status, let me know.

Also, the patch was moved forward mostly by Amit and Dilip, so I've
added them as authors in the CF app (well, what matters is the commit
message, of course, but let's keep this up to date too).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#198Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#195)
13 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jan 13, 2020 at 3:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jan 9, 2020 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jan 9, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

The problem is that when we
get a toasted chunks we remember the changes in the memory(hash table)
but don't stream until we get the actual change on the main table.
Now, the problem is that we might get the change of the toasted table
and the main table in different streams. So basically, in a stream,
if we have only got the toasted tuples then even after
ReorderBufferStreamTXN the memory usage will not be reduced.

I think we can't split such changes in a different stream (unless we
design an entirely new solution to send partial changes of toast
data), so we need to send them together. We can keep a flag like
data_complete in ReorderBufferTxn and mark it complete only when we
are able to assemble the entire tuple. Now, whenever, we try to
stream the changes once we reach the memory threshold, we can check
whether the data_complete flag is true

Here, we can also consider streaming the changes when data_complete is
false, but some additional changes have been added to the same txn as
the new changes might complete the tuple.

, if so, then only send the
changes, otherwise, we can pick the next largest transaction. I think
we can retry it for few times and if we get the incomplete data for
multiple transactions, then we can decide to spill the transaction or
maybe we can directly spill the first largest transaction which has
incomplete data.

Yeah, we might do something on this line. Basically, we need to mark
the top-transaction as data-incomplete if any of its subtransaction is
having data-incomplete (it will always be the latest sub-transaction
of the top transaction). Also, for streaming, we are checking the
largest top transaction whereas for spilling we just need the larget
(sub) transaction. So we also need to decide while picking the
largest top transaction for streaming, if we get a few transactions
with in-complete data then how we will go for the spill. Do we spill
all the sub-transactions under this top transaction or we will again
find the larget (sub) transaction for spilling.

I think it is better to do later as that will lead to the spill of
only required (minimum changes to get the memory below threshold)
changes.

I think instead of doing this can't we just spill the changes which
are in toast_hash. Basically, at the end of the stream, we have some
toast tuple which we could not stream because we did not have the
insert for the main table then we can spill only those changes which
are in tuple hash.

Hmm, I think this can turn out to be inefficient because we can easily
end up spilling the data even when we don't need to so. Consider
cases, where part of the streamed changes are for toast, and remaining
are the changes which we would have streamed and hence can be removed.
In such cases, we could have easily consumed remaining changes for
toast without spilling. Also, I am not sure if spilling changes from
the hash table is a good idea as they are no more in the same order as
they were in ReorderBuffer which means the order in which we serialize
the changes normally would change and that might have some impact, so
we would need some more study if we want to pursue this idea.

I have fixed this bug and attached it as a separate patch. I will
merge it to the main patch after we agree with the idea and after some
more testing.

The idea is that whenever we get the toasted chunk instead of directly
inserting it into the toast hash I am inserting it into some local
list so that if we don't get the change for the main table then we can
insert these changes back to the txn->changes. So once we get the
change for the main table at that time I am preparing the hash table
to merge the chunks. If the stream is over and we haven't got the
changes for the main table, that time we will mark the txn that it has
some pending toast changes so that next time we will not pick the same
transaction for the streaming. This flag will be cleaned whenever we
get any changes for the txn (insert or /update). There is also a
possibility that even after we stream the changes the rb->size is not
below logical_decoding_work_mem because we could not stream the
changes so for handling this after streaming we recheck the size and
if it is still not under control then we pick another transaction. In
some cases, we might not get any transaction to stream because the
transaction has the pending toast change flag set, In this case, we
will go for the spill.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v8-0003-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=v8-0003-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From e8bbd48b3d309041b1205e46aa697ac0c72c5f46 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v8 03/13] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 +++++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  57 +++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index cd105d9..a3efbfd 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -542,3 +571,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8db9686..ace21ec 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index bdf4389..9c95fc1 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +204,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -863,6 +911,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->final_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 3b7ca7f..f24e246 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..0d0a94a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 9a3f045..15bb5ed 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -356,6 +356,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -395,6 +441,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v8-0002-Issue-individual-invalidations-with-wal_level-log.patchapplication/octet-stream; name=v8-0002-Issue-individual-invalidations-with-wal_level-log.patchDownload
From 65adaeebc3127ae732a0927e22768594d0ee0d12 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:26:33 +0530
Subject: [PATCH v8 02/13] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.

The new invalidations are written to WAL immediately, without any
such caching. Perhaps it would be possible to add similar caching,
e.g. at the command level, or something like that?
---
 src/backend/access/rmgrdesc/xactdesc.c          | 50 +++++++++++++++++
 src/backend/access/transam/xact.c               |  7 +++
 src/backend/replication/logical/decode.c        | 23 ++++++++
 src/backend/replication/logical/reorderbuffer.c | 55 +++++++++++++++---
 src/backend/utils/cache/inval.c                 | 75 +++++++++++++++++++++++++
 src/include/access/xact.h                       | 18 +++++-
 src/include/replication/reorderbuffer.h         | 14 +++++
 7 files changed, 234 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index e388cc7..6e46d19 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,11 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -397,6 +402,14 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs,
+								xlrec->dbId, xlrec->tsId,
+								xlrec->relcacheInitFileInval);
+	}
 }
 
 const char *
@@ -424,7 +437,44 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval)
+{
+	int			i;
+
+	if (relcacheInitFileInval)
+		appendStringInfo(buf, "; relcache init file inval dbid %u tsid %u",
+						 dbId, tsId);
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 51557e2..dd3d36f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6001,6 +6001,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a99fcaf..13a11ac 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,29 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/* XXX for now we're issuing invalidations one by one */
+				Assert(invals->nmsgs == 1);
+
+				if (!TransactionIdIsValid(xid))
+					break;
+
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->dbId, invals->tsId,
+											 invals->relcacheInitFileInval,
+											 invals->msgs[0]);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 73ca4f7..78443e2 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -473,6 +473,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
@@ -1822,17 +1823,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation message locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -2212,6 +2218,38 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 
 /*
  * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn,
+							 Oid dbId, Oid tsId, bool relcacheInitFileInval,
+							 SharedInvalidationMessage msg)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	/* XXX Should we even write invalidations without valid XID? */
+	if (xid == InvalidTransactionId)
+		return;
+
+	Assert(xid != InvalidTransactionId);
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.dbId = dbId;
+	change->data.inval.tsId = tsId;
+	change->data.inval.relcacheInitFileInval = relcacheInitFileInval;
+	change->data.inval.msg = msg;
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Setup the invalidation of the toplevel transaction.
  *
  * This needs to be done before ReorderBufferCommit is called!
  */
@@ -2658,6 +2696,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -2765,6 +2804,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -3050,6 +3090,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..e0d04b9 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write individual invalidations into WAL to support
+ *	the decoding of the in-progress transaction.  As of now it was enough to
+ *	log invalidation only at commit because we are only decoding the transaction
+ *	at the commit time.   We only need to log the catalog cache and relcache
+ *	invalidation.  There can not be any active MVCC scan in logical decoding so
+ *	we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,9 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -489,6 +499,18 @@ RegisterCatcacheInvalidation(int cacheId,
 {
 	AddCatcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   cacheId, hashValue, dbId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cc.id = (int8) cacheId;
+		msg.cc.dbId = dbId;
+		msg.cc.hashValue = hashValue;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -501,6 +523,18 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 {
 	AddCatalogInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								  dbId, catId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cat.id = SHAREDINVALCATALOG_ID;
+		msg.cat.dbId = dbId;
+		msg.cat.catId = catId;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -511,6 +545,8 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 static void
 RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 {
+	bool		RelcacheInitFileInval = false;
+
 	AddRelcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   dbId, relId);
 
@@ -529,7 +565,22 @@ RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 	 * as well.  Also zap when we are invalidating whole relcache.
 	 */
 	if (relId == InvalidOid || RelationIdIsInInitFile(relId))
+	{
 		transInvalInfo->RelcacheInitFileInval = true;
+		RelcacheInitFileInval = true;
+	}
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.rc.id = SHAREDINVALRELCACHE_ID;
+		msg.rc.dbId = dbId;
+		msg.rc.relId = relId;
+
+		LogLogicalInvalidations(1, &msg, RelcacheInitFileInval);
+	}
 }
 
 /*
@@ -1501,3 +1552,27 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval)
+{
+	xl_xact_invalidations xlrec;
+
+	/* prepare record */
+	memset(&xlrec, 0, sizeof(xlrec));
+	xlrec.dbId = MyDatabaseId;
+	xlrec.tsId = MyDatabaseTableSpace;
+	xlrec.relcacheInitFileInval = relcacheInitFileInval;
+	xlrec.nmsgs = nmsgs;
+
+	/* perform insertion */
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+	XLogRegisterData((char *) msgs,
+					 nmsgs * sizeof(SharedInvalidationMessage));
+	XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b38..6f2a583 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,22 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ *
+ * XXX Currently nmsgs=1 but that might change in the future.
+ */
+typedef struct xl_xact_invalidations
+{
+	Oid			dbId;			/* MyDatabaseId */
+	Oid			tsId;			/* MyDatabaseTableSpace */
+	bool		relcacheInitFileInval;	/* invalidate relcache init file */
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4..9a3f045 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,16 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			Oid			dbId;	/* MyDatabaseId */
+			Oid			tsId;	/* MyDatabaseTableSpace */
+			bool		relcacheInitFileInval;	/* invalidate relcache init
+												 * file */
+			SharedInvalidationMessage msg;	/* invalidation message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +470,9 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  Oid dbId, Oid tsId, bool relcacheInitFileInval,
+								  SharedInvalidationMessage msg);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
1.8.3.1

v8-0004-Gracefully-handle-concurrent-aborts-of-uncommitte.patchapplication/octet-stream; name=v8-0004-Gracefully-handle-concurrent-aborts-of-uncommitte.patchDownload
From 3134014b6fe09daafbbc52b39dead9332683f4b0 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:32:34 +0530
Subject: [PATCH v8 04/13] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml               |  5 ++-
 src/backend/access/heap/heapam.c                | 41 +++++++++++++++++++++
 src/backend/access/index/genam.c                | 49 +++++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c |  8 ++--
 src/backend/utils/time/snapmgr.c                | 25 ++++++++++++-
 src/include/utils/snapmgr.h                     |  4 +-
 6 files changed, 124 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index ace21ec..319349a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,7 +432,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
      includes writing to tables, performing DDL changes, and
      calling <literal>txid_current()</literal>.
     </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5ddb6e8..b6c95c3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1303,6 +1303,15 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(scan->rs_base.rs_rd) ||
+		  RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	HEAPDEBUG_1;				/* heap_getnext( info ) */
@@ -1422,6 +1431,14 @@ heap_fetch(Relation relation,
 	bool		valid;
 
 	/*
+	 * We don't expect direct calls to heap_fetch with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_fetch call during logical decoding");
+
+	/*
 	 * Fetch and pin the appropriate page of the relation.
 	 */
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
@@ -1535,6 +1552,14 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		valid;
 	bool		skip;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search_buffer with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_hot_search_buffer call during logical decoding");
+
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
 		*all_dead = first_call;
@@ -1683,6 +1708,14 @@ heap_get_latest_tid(TableScanDesc sscan,
 	Assert(ItemPointerIsValid(tid));
 
 	/*
+	 * We don't expect direct calls to heap_get_latest_tid with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_get_latest_tid call during logical decoding");
+
+	/*
 	 * Loop to chase down t_ctid links.  At top of loop, ctid is the tuple we
 	 * need to examine, and *tid is the TID we will return if ctid turns out
 	 * to be bogus.
@@ -5481,6 +5514,14 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	ItemId		lp = NULL;
 	HeapTupleHeader htup;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_hot_search call during logical decoding");
+
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index c16eb05..5644b8d 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -477,6 +478,22 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  Instead of directly checking the abort status we do check
+	 * if it is not in progress transaction and no committed. Because if there
+	 * were a system crash then status of the the transaction which were running
+	 * at that time might not have marked.  So we need to consider them as
+	 * aborted.  Refer detailed comments at snapmgr.c where the variable is
+	 * declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
@@ -513,6 +530,22 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  Instead of directly checking the abort status we do check
+	 * if it is not in progress transaction and no committed. Because if there
+	 * were a system crash then status of the the transaction which were running
+	 * at that time might not have marked.  So we need to consider them as
+	 * aborted.  Refer detailed comments at snapmgr.c where the variable is
+	 * declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return result;
 }
 
@@ -639,6 +672,22 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  Instead of directly checking the abort status we do check
+	 * if it is not in progress transaction and no committed. Because if there
+	 * were a system crash then status of the the transaction which were running
+	 * at that time might not have marked.  So we need to consider them as
+	 * aborted.  Refer detailed comments at snapmgr.c where the variable is
+	 * declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 78443e2..e7aa004 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -692,7 +692,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 			txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 		/* setup snapshot to allow catalog access */
-		SetupHistoricSnapshot(snapshot_now, NULL);
+		SetupHistoricSnapshot(snapshot_now, NULL, xid);
 		PG_TRY();
 		{
 			rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1551,7 +1551,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1802,7 +1802,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1822,7 +1822,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					}
 
 					break;
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c5..93a0c04 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -154,6 +154,13 @@ static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
 /*
+ * An xid value pointing to a possibly ongoing (sub)transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
+/*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2029,10 +2036,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
  *
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive.  This is to re-check XID status while accessing catalog.
+ *
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+					  TransactionId snapshot_xid)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -2041,8 +2052,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 
 	/* setup (cmin, cmax) lookup hash */
 	tuplecid_data = tuplecids;
-}
 
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
+	 */
+	if (TransactionIdIsValid(snapshot_xid) &&
+		!TransactionIdDidCommit(snapshot_xid))
+		CheckXidAlive = snapshot_xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
 /*
  * Make catalog snapshots behave normally again.
@@ -2052,6 +2072,7 @@ TeardownHistoricSnapshot(bool is_error)
 {
 	HistoricSnapshot = NULL;
 	tuplecid_data = NULL;
+	CheckXidAlive = InvalidTransactionId;
 }
 
 bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13c..12f737b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,8 +145,10 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+								  TransactionId snapshot_xid);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-- 
1.8.3.1

v8-0005-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v8-0005-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From 30d5fdb4efcae52cd00eee4ea979d9c5a22e7b82 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri, 10 Jan 2020 18:03:27 -0300
Subject: [PATCH v8 05/13] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c     |  38 +-
 src/backend/replication/logical/reorderbuffer.c | 692 +++++++++++++++++++++---
 src/include/replication/reorderbuffer.h         |  36 ++
 3 files changed, 673 insertions(+), 93 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..160b167 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e7aa004..c77aabe 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -236,6 +236,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +245,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +381,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -769,6 +782,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -864,6 +909,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -987,7 +1035,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1023,6 +1071,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1037,6 +1088,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1320,6 +1374,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1345,8 +1408,93 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
  */
 static void
 ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1354,9 +1502,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
-
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
 
 	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
@@ -1495,63 +1640,48 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * build data to be able to lookup the CommandIds of catalog tuples
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
-		return;
-	}
-
-	snapshot_now = txn->base_snapshot;
-
-	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, txn->xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1567,15 +1697,20 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 	PG_TRY();
 	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 		ReorderBufferChange *change;
 		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction("stream");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* start streaming this chunk of transaction */
+		if (streaming)
+			rb->stream_start(rb, txn);
+		else
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1583,6 +1718,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			if (streaming)
+			{
+				/*
+				 * While streaming an in-progress transaction there is a
+				 * possibility that the (sub)transaction might get aborted
+				 * concurrently.  In such case if the (sub)transaction has
+				 * catalog update then we might decode the tuple using wrong
+				 * catalog version.  So for detecting the concurrent abort we
+				 * set CheckXidAlive to the current (sub)transaction's xid for
+				 * which this change belongs to.  And, during catalog scan we
+				 * can check the status of the xid and if it is aborted we will
+				 * report an specific error which we can ignore.  We might have
+				 * already streamed some of the changes for the aborted
+				 * (sub)transaction, but that is fine because when we decode the
+				 * abort we will stream abort message to truncate the changes in
+				 * the subscriber.
+				 */
+				CheckXidAlive = change->txn->xid;
+			}
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1592,8 +1756,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * use as a normal record. It'll be cleaned up at the end
 					 * of INSERT processing.
 					 */
-					if (specinsert == NULL)
-						elog(ERROR, "invalid ordering of speculative insertion changes");
 					Assert(specinsert->data.tp.oldtuple == NULL);
 					change = specinsert;
 					change->action = REORDER_BUFFER_CHANGE_INSERT;
@@ -1659,7 +1821,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						if (streaming)
+						{
+							rb->stream_change(rb, txn, relation, change);
+
+							/* Remember that we have sent some data for this xid. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_change(rb, txn, relation, change);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1680,8 +1850,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						 * freed/reused while restoring spooled data from
 						 * disk.
 						 */
-						Assert(change->data.tp.newtuple != NULL);
-
 						dlist_delete(&change->node);
 						ReorderBufferToastAppendChunk(rb, txn, relation,
 													  change);
@@ -1699,7 +1867,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1757,7 +1925,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						if (streaming)
+						{
+							rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+							/* Remember that we have sent some data. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_truncate(rb, txn, nrelations, relations, change);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1766,10 +1942,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					if (streaming)
+						rb->stream_message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
+					else
+						rb->message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1800,9 +1982,9 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+										  txn->xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1822,7 +2004,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+											  txn->xid);
 					}
 
 					break;
@@ -1860,14 +2043,41 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			/*
+			 * Set the last last of the stream as the final lsn before calling
+			 * stream stop.
+			 */
+			if (!XLogRecPtrIsInvalid(prev_lsn))
+				txn->final_lsn = prev_lsn;
+			rb->stream_stop(rb, txn);
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+		{
+			txn->command_id = command_id;
+			txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+													  txn, command_id);
+		}
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1885,14 +2095,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then Discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,18 +2129,117 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		if (streaming)
+		{
+			/* Discard the changes that we just streamed. */
+			ReorderBufferTruncateTXN(rb, txn);
 
-		PG_RE_THROW();
+			/* Re-throw only if it's not an abort. */
+			if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+			{
+				MemoryContextSwitchTo(ecxt);
+				PG_RE_THROW();
+			}
+			else
+			{
+				/* remember the command ID and snapshot for the streaming run */
+				txn->command_id = command_id;
+				txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+														  txn, command_id);
+				/*
+				 * Set the last last of the stream as the final lsn before
+				 * calling stream stop.
+				 */
+				txn->final_lsn = prev_lsn;
+				rb->stream_stop(rb, txn);
+
+				FlushErrorState();
+			}
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
 /*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * We currently can only decode a transaction's contents when its commit
+ * record is read because that's the only place where we know about cache
+ * invalidations. Thus, once a toplevel commit is read, we iterate over the top
+ * and subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -1946,6 +2263,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2015,6 +2339,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2150,8 +2481,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2159,6 +2499,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2170,19 +2511,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2211,6 +2561,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2300,6 +2651,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2404,6 +2762,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
  *
@@ -2423,15 +2813,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2734,6 +3155,101 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after we there
+		 * might be some new sub-transaction which after the last streaming run
+		 * so we need to add those sub-xip in the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 15bb5ed..adb8f9d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -192,6 +193,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -227,6 +246,16 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	final_lsn;
 
 	/*
+	 * Have we sent any changes for this transaction in output plugin?
+	 */
+	bool		any_data_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
+	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
 	XLogRecPtr	end_lsn;
@@ -257,6 +286,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
-- 
1.8.3.1

v8-0006-Fix-speculative-insert-bug.patchapplication/octet-stream; name=v8-0006-Fix-speculative-insert-bug.patchDownload
From d26729956f5ff279b7c25f4ec1f2b671702c7df7 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 10 Jan 2020 09:01:35 +0530
Subject: [PATCH v8 06/13] Fix speculative insert bug.

---
 src/backend/replication/logical/reorderbuffer.c | 23 +++++++++++++++++++----
 src/include/replication/reorderbuffer.h         |  6 ++++++
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c77aabe..070ad1f 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1701,6 +1701,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		ReorderBufferChange *change;
 		ReorderBufferChange *specinsert = NULL;
 
+		/*
+		 * Resotre any previous speculative inserted tuple if we are running in
+		 * streaming mode.
+		 */
+		if (streaming && txn->specinsert != NULL)
+		{
+			specinsert = txn->specinsert;
+			txn->specinsert = NULL;
+		}
+
 		if (using_subtxn)
 			BeginInternalSubTransaction("stream");
 		else
@@ -2029,13 +2039,18 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		}
 
 		/*
-		 * There's a speculative insertion remaining, just clean in up, it
-		 * can't have been successful, otherwise we'd gotten a confirmation
-		 * record.
+		 * In non-streaming mode if there's a speculative insertion remaining,
+		 * just clean in up, it can't have been successful, otherwise we'd
+		 * gotten a confirmation record.  For streaming mode, remember the tuple
+		 * so that if we get the confirmation in the next stream we can stream
+		 * it.
 		 */
 		if (specinsert)
 		{
-			ReorderBufferReturnChange(rb, specinsert);
+			if (streaming)
+				txn->specinsert = specinsert;
+			else
+				ReorderBufferReturnChange(rb, specinsert);
 			specinsert = NULL;
 		}
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index adb8f9d..2680ac3 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -344,6 +344,12 @@ typedef struct ReorderBufferTXN
 	uint32		ninvalidations;
 	SharedInvalidationMessage *invalidations;
 
+	/*
+	 * Speculative insert saved from the last streamed run if the speculative
+	 * confirm has not received in the same stream.
+	 */
+	ReorderBufferChange *specinsert;
+
 	/* ---
 	 * Position in one of three lists:
 	 * * list of subtransactions if we are *known* to be subxact
-- 
1.8.3.1

v8-0007-Support-logical_decoding_work_mem-set-from-create.patchapplication/octet-stream; name=v8-0007-Support-logical_decoding_work_mem-set-from-create.patchDownload
From c85c9bf20abf9c5f4037cb5fdbacfc96fea2cb11 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 11:51:04 +0530
Subject: [PATCH v8 07/13] Support logical_decoding_work_mem set from create
 subscription command

---
 doc/src/sgml/config.sgml                           | 21 +++++++++++
 doc/src/sgml/ref/create_subscription.sgml          | 12 ++++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/commands/subscriptioncmds.c            | 44 +++++++++++++++++++---
 .../libpqwalreceiver/libpqwalreceiver.c            |  3 ++
 src/backend/replication/logical/worker.c           |  1 +
 src/backend/replication/pgoutput/pgoutput.c        | 30 ++++++++++++++-
 src/include/catalog/pg_subscription.h              |  3 ++
 src/include/replication/walreceiver.h              |  1 +
 9 files changed, 108 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e07dc01..6d1a25b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1751,6 +1751,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-logical-decoding-work-mem" xreflabel="logical_decoding_work_mem">
+      <term><varname>logical_decoding_work_mem</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>logical_decoding_work_mem</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to be used by logical decoding,
+        before some of the decoded changes are either written to local disk.
+        This limits the amount of memory used by logical streaming replication
+        connections. It defaults to 64 megabytes (<literal>64MB</literal>).
+        Since each replication connection only uses a single buffer of this size,
+        and an installation normally doesn't have many such connections
+        concurrently (as limited by <varname>max_wal_senders</varname>), it's
+        safe to set this value significantly higher than <varname>work_mem</varname>,
+        reducing the amount of decoded changes written to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c24..91790b0 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>work_mem</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          Limits the amount of memory used to decode changes on the
+          publisher.  If not specified, the publisher will use the default
+          specified by <varname>logical_decoding_work_mem</varname>. When
+          needed, additional data are spilled to disk.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index f77a83b..5cd1daa 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->workmem = subform->subworkmem;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9bfe142..c50e854 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -58,7 +58,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, int *logical_wm,
+						   bool *logical_wm_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -89,6 +90,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (logical_wm)
+		*logical_wm_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -174,6 +177,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0 && logical_wm)
+		{
+			if (*logical_wm_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*logical_wm_given = true;
+			*logical_wm = defGetInt32(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -317,6 +330,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	int			logical_wm;
+	bool		logical_wm_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -333,7 +348,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &logical_wm, &logical_wm_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -411,6 +426,12 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (logical_wm_given)
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(logical_wm);
+	else
+		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -668,10 +689,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				int			logical_wm;
+				bool		logical_wm_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &logical_wm, &logical_wm_given);
 
 				if (slotname_given)
 				{
@@ -696,6 +720,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (logical_wm_given)
+				{
+					values[Anum_pg_subscription_subworkmem - 1] =
+						Int32GetDatum(logical_wm);
+					replaces[Anum_pg_subscription_subworkmem - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -707,7 +738,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -745,7 +777,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -782,7 +814,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9..896ddab 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		appendStringInfo(&cmd, ", work_mem '%d'",
+						 options->proto.logical.work_mem);
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 7a5471f..48b960c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1745,6 +1745,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.work_mem = MySubscription->workmem;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7525082..536722b 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -18,6 +18,7 @@
 #include "replication/logicalproto.h"
 #include "replication/origin.h"
 #include "replication/pgoutput.h"
+#include "utils/guc.h"
 #include "utils/int8.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
@@ -87,11 +88,12 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, int *logical_decoding_work_mem)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		work_mem_given = false;
 
 	foreach(lc, options)
 	{
@@ -137,6 +139,29 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0)
+		{
+			int64	parsed;
+
+			if (work_mem_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			work_mem_given = true;
+
+			if (!scanint8(strVal(defel->arg), true, &parsed))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid work_mem")));
+
+			if (parsed > PG_INT32_MAX || parsed < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("work_mem \"%s\" out of range",
+								strVal(defel->arg))));
+
+			*logical_decoding_work_mem = (int)parsed;
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -171,7 +196,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&logical_decoding_work_mem);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d4..3394379 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	int32		subworkmem;		/* Memory to use to decode changes. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	int			workmem;		/* Memory to decode changes. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e08afc6..4c7acfb 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -169,6 +169,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			int			work_mem;	/* Memory limit to use for decoding */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
-- 
1.8.3.1

v8-0009-Track-statistics-for-streaming.patchapplication/octet-stream; name=v8-0009-Track-statistics-for-streaming.patchDownload
From 99d8c2c35690e2e802536a116e2f528ebd49bcd3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Jan 2020 09:45:27 +0530
Subject: [PATCH v8 09/13] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                    | 25 +++++++++++++++++++
 src/backend/catalog/system_views.sql            |  5 +++-
 src/backend/replication/logical/reorderbuffer.c | 13 ++++++++++
 src/backend/replication/walsender.c             | 32 +++++++++++++++++++++----
 src/include/catalog/pg_proc.dat                 |  6 ++---
 src/include/replication/reorderbuffer.h         | 13 ++++++----
 src/include/replication/walsender_private.h     |  5 ++++
 src/test/regress/expected/rules.out             |  7 ++++--
 8 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0bfd615..512d843 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2004,6 +2004,31 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       may get spilled repeatedly, and this counter gets incremented on every
       such invocation.</entry>
     </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
+
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index c9e75f4..e406856 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -785,7 +785,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 070ad1f..fe4e57c 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -331,6 +331,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3260,6 +3264,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
 
+	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 63fc2c7..d7f22ae 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1293,7 +1293,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1314,7 +1314,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2357,6 +2358,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3196,7 +3200,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3253,6 +3257,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillTxns;
 		int64		spillCount;
 		int64		spillBytes;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3276,6 +3283,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		memset(nulls, 0, sizeof(nulls));
@@ -3362,6 +3372,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3610,11 +3625,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fcf2a12..75124c8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5173,9 +5173,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 2680ac3..02650c3 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -525,15 +525,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 366828f..3888b0c 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -85,6 +85,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 70e1e2f..9dc739e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1982,9 +1982,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
1.8.3.1

v8-0010-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v8-0010-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From dd06dab3c58def3943d2622047cd03b76cc467ca Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v8 10/13] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 6da7f71..086d0c7 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -70,7 +70,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 34ab11e..ad3ed13 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 81520a7..0c9c6b3 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v8-0001-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=v8-0001-Immediately-WAL-log-assignments.patchDownload
From 003bd22dd11b032cbfecb7cd8785ce5ecbaa60e5 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 09:53:08 +0530
Subject: [PATCH v8 01/13] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 45 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 39 +++++++++++++--------------
 src/include/access/xact.h                |  3 +++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 +++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 98 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 017f03b..51557e2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -190,6 +190,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -5096,6 +5097,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6002,3 +6004,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 2fa0a7f..b11b0c2 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -88,11 +88,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -194,6 +196,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -397,7 +403,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -743,6 +749,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		curinsert_flags |= XLOG_INCLUDE_XID;
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 3aa6812..8c281e8 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1165,6 +1165,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1203,6 +1204,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5e1dc8a..a99fcaf 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel_xid is valid, we need to assign the subxact to the
+	 * toplevel transaction. We need to do this for all records, hence we
+	 * do it before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 record->toplevel_xid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrIds) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(r->toplevel_xid))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04ba..8645b38 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033f..e23892a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -227,6 +227,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index f7cc8c4..4805cae 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -147,6 +147,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -280,6 +282,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v8-0008-Add-support-for-streaming-to-built-in-replication.patchapplication/octet-stream; name=v8-0008-Add-support-for-streaming-to-built-in-replication.patchDownload
From 574b34383c118b0921000e5ae4db8d73238ed691 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 13 Jan 2020 14:24:39 +0530
Subject: [PATCH v8 08/13] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |    5 +-
 doc/src/sgml/ref/create_subscription.sgml          |   12 +
 src/backend/catalog/pg_subscription.c              |    1 +
 src/backend/commands/subscriptioncmds.c            |   60 +-
 src/backend/postmaster/pgstat.c                    |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    8 +-
 src/backend/replication/logical/launcher.c         |    2 +
 src/backend/replication/logical/logical.c          |    4 +-
 src/backend/replication/logical/proto.c            |  157 ++-
 src/backend/replication/logical/worker.c           | 1031 ++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c        |  310 +++++-
 src/backend/replication/slotfuncs.c                |    7 +
 src/backend/replication/walsender.c                |    6 +
 src/include/catalog/pg_subscription.h              |    3 +
 src/include/pgstat.h                               |    6 +-
 src/include/replication/logicalproto.h             |   42 +-
 src/include/replication/walreceiver.h              |    1 +
 src/test/subscription/t/009_stream_simple.pl       |   86 ++
 src/test/subscription/t/010_stream_subxact.pl      |  102 ++
 src/test/subscription/t/011_stream_ddl.pl          |   95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |   82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |   84 ++
 22 files changed, 2074 insertions(+), 42 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 6dfb2e4..e1fb907 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
+      and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 91790b0..d9abf5e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -218,6 +218,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 5cd1daa..1dc486c 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->workmem = subform->subworkmem;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index c50e854..a486bd3 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
 						   bool *refresh, int *logical_wm,
-						   bool *logical_wm_given)
+						   bool *logical_wm_given, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -92,6 +93,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*refresh = true;
 	if (logical_wm)
 		*logical_wm_given = false;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -186,6 +189,26 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 
 			*logical_wm_given = true;
 			*logical_wm = defGetInt32(defel);
+
+			/*
+			 * Check that the value is not smaller than 64kB (which is
+			 * the minimum value for logical_work_mem).
+			 */
+			if (*logical_wm < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("%d is outside the valid range for parameter \"work_mem\" (64 .. 2147483647)",
+								*logical_wm)));
+		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
 		}
 		else
 			ereport(ERROR,
@@ -332,6 +355,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	int			logical_wm;
 	bool		logical_wm_given;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -348,7 +373,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL, &logical_wm, &logical_wm_given);
+							   NULL, &logical_wm, &logical_wm_given,
+							   &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -430,7 +456,15 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 		values[Anum_pg_subscription_subworkmem - 1] =
 			Int32GetDatum(logical_wm);
 	else
-		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(-1);
+
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
 
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
@@ -691,11 +725,14 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				int			logical_wm;
 				bool		logical_wm_given;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
 										   NULL, &synchronous_commit, NULL,
-										   &logical_wm, &logical_wm_given);
+										   &logical_wm, &logical_wm_given,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -727,6 +764,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subworkmem - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -739,7 +783,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
 										   NULL, NULL, NULL, NULL, NULL,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -777,7 +821,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh, NULL, NULL);
+										   NULL, &refresh, NULL, NULL,
+										   NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -814,7 +859,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 51c486b..03ef76c 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4102,6 +4102,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 896ddab..1b8303c 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,8 +408,12 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
-		appendStringInfo(&cmd, ", work_mem '%d'",
-						 options->proto.logical.work_mem);
+		if (options->proto.logical.work_mem != -1)
+			appendStringInfo(&cmd, ", work_mem '%d'",
+							 options->proto.logical.work_mem);
+
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
 
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e..e80d00c 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,6 +14,8 @@
  *
  *-------------------------------------------------------------------------
  */
+#include <sys/types.h>
+#include <unistd.h>
 
 #include "postgres.h"
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 9c95fc1..61064f3 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1149,7 +1149,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1194,7 +1194,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd..93780b2 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,13 +257,18 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
+	pq_sendbyte(out, 'D');		/* action DELETE */
+
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
-	pq_sendbyte(out, 'D');		/* action DELETE */
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
 
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,119 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendint32(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgint(in, 4) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+}
+
+TransactionId
+logicalrep_read_stream_stop(StringInfo in)
+{
+	TransactionId xid;
+
+	xid = pq_getmsgint(in, 4);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 48b960c..5c20c0e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in /tmp by default, and the filenames include both
+ * the XID of the toplevel transaction and OID of the subscription. This
+ * is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -31,6 +52,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -60,6 +82,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -67,6 +90,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -106,12 +130,59 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	off_t		offset;			/* offset in the file */
+}			SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
+static bool handle_streamed_transaction(const char action, StringInfo s);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -163,6 +234,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -529,6 +636,318 @@ apply_handle_origin(StringInfo s)
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found PG_USED_FOR_ASSERTS_ONLY = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* FIXME optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/* We should not receive aborts for unknown subtransactions. */
+		Assert(found);
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	/* XXX Should this be allocated in another memory context? */
+
+	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	ensure_transaction();
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pfree(buffer);
+	pfree(s2.data);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -541,6 +960,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -556,6 +978,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -591,6 +1016,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -695,6 +1123,9 @@ apply_handle_update(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -830,6 +1261,9 @@ apply_handle_delete(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -929,6 +1363,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1020,6 +1457,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1117,6 +1570,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 }
 
 /*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i]);
+}
+
+/*
  * Apply main loop.
  */
 static void
@@ -1132,6 +1601,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1580,6 +2052,564 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	CloseTransientFile(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	CloseTransientFile(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * TODO: Add missing_ok flag to specify in which cases it's OK not to
+ * find the files, and when it's an error.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialied in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -1746,6 +2776,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.work_mem = MySubscription->workmem;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 536722b..ebe0423 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -45,17 +45,45 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
 
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
+
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 
-/* Entry in the map used to remember which relation schemas we sent. */
+/*
+ * Entry in the map used to remember which relation schemas we sent.
+ *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
+	TransactionId	xid;		/* transaction that created the record */
 	bool		schema_sent;	/* did we send the schema? */
+	List	   *streamed_txns;	/* streamed toplevel transactions with
+								 * this schema */
 	bool		replicate_valid;
 	PublicationActions pubactions;
 } RelationSyncEntry;
@@ -64,11 +92,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -84,16 +118,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, int *logical_decoding_work_mem)
+						List **publication_names, int *logical_decoding_work_mem,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		work_mem_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -162,6 +206,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*logical_decoding_work_mem = (int)parsed;
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -174,6 +235,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -197,7 +259,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&logical_decoding_work_mem);
+								&logical_decoding_work_mem,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -217,6 +280,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -284,9 +368,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (!relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (!schema_sent)
 	{
 		TupleDesc	desc;
 		int			i;
@@ -312,19 +435,26 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 				continue;
 
 			OutputPluginPrepareWrite(ctx, false);
-			logicalrep_write_typ(ctx->out, att->atttypid);
+			logicalrep_write_typ(ctx->out, xid, att->atttypid);
 			OutputPluginWrite(ctx, false);
 		}
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_rel(ctx->out, relation);
+		logicalrep_write_rel(ctx->out, xid, relation);
 		OutputPluginWrite(ctx, false);
-		relentry->schema_sent = true;
+		relentry->xid = change->txn->xid;
+
+		if (in_streaming)
+			set_schema_sent_in_streamed_txn(relentry, topxid);
+		else
+			relentry->schema_sent = true;
 	}
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -333,6 +463,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -361,14 +495,14 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
 	{
 		case REORDER_BUFFER_CHANGE_INSERT:
 			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_insert(ctx->out, relation,
+			logicalrep_write_insert(ctx->out, xid, relation,
 									&change->data.tp.newtuple->tuple);
 			OutputPluginWrite(ctx, true);
 			break;
@@ -378,7 +512,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				&change->data.tp.oldtuple->tuple : NULL;
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple,
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
 										&change->data.tp.newtuple->tuple);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -387,7 +521,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			if (change->data.tp.oldtuple)
 			{
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation,
+				logicalrep_write_delete(ctx->out, xid, relation,
 										&change->data.tp.oldtuple->tuple);
 				OutputPluginWrite(ctx, true);
 			}
@@ -413,6 +547,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -433,13 +571,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -513,6 +652,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -549,6 +773,34 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  */
 static RelationSyncEntry *
@@ -623,6 +875,36 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -657,7 +939,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index bb69683..3085c0f 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -147,6 +147,13 @@ create_logical_replication_slot(char *name, char *plugin,
 									logical_read_local_xlog_page, NULL, NULL,
 									NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+
 	/* build initial snapshot, might take a while */
 	DecodingContextFindStartpoint(ctx);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9c06374..63fc2c7 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -969,6 +969,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndUpdateProgress);
 
 		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
+		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
 		 * messages or send keepalives. As we possibly need to wait for
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3394379..18f416f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -50,6 +50,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	int32		subworkmem;		/* Memory to use to decode changes. */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -76,6 +78,7 @@ typedef struct Subscription
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	int			workmem;		/* Memory to decode changes. */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index aecb601..146d7c4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -945,7 +945,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 2cc2dc4..ade4188 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -85,25 +89,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out, TransactionId xid);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4c7acfb..54054a4 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -170,6 +170,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			int			work_mem;	/* Memory limit to use for decoding */
+			bool		streaming;	/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v8-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patchapplication/octet-stream; name=v8-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patchDownload
From bb44ee6ad6ff0c99a8f759fa7efa5b242bd19898 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:14:45 +0200
Subject: [PATCH v8 11/13] BUGFIX: set final_lsn for subxacts before cleanup

---
 src/backend/replication/logical/reorderbuffer.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index fe4e57c..beb6cd2 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1327,6 +1327,10 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
+		/* make sure subtxn has final_lsn */
+		if (subtxn->final_lsn == InvalidXLogRecPtr)
+			subtxn->final_lsn = txn->final_lsn;
+
 		/*
 		 * Subtransactions are always associated to the toplevel TXN, even if
 		 * they originally were happening inside another subtxn, so we won't
-- 
1.8.3.1

v8-0012-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v8-0012-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From e32d888e9b1de0c25843eae0d299b7f9e8dc45fb Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v8 12/13] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v8-0013-Bugfix-handling-of-incomplete-toast-tuple.patchapplication/octet-stream; name=v8-0013-Bugfix-handling-of-incomplete-toast-tuple.patchDownload
From 4fd60ba2487e6f57d435cc57449ad2a8b5287632 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Tue, 21 Jan 2020 10:57:10 +0530
Subject: [PATCH v8 13/13] Bugfix handling of incomplete toast tuple

---
 contrib/test_decoding/logical.conf              |   1 +
 src/backend/replication/logical/reorderbuffer.c | 159 +++++++++++++++++++++---
 src/include/replication/reorderbuffer.h         |   8 ++
 3 files changed, 152 insertions(+), 16 deletions(-)

diff --git a/contrib/test_decoding/logical.conf b/contrib/test_decoding/logical.conf
index 07c4d3d..f748994 100644
--- a/contrib/test_decoding/logical.conf
+++ b/contrib/test_decoding/logical.conf
@@ -1,3 +1,4 @@
 wal_level = logical
 max_replication_slots = 4
 logical_decoding_work_mem = 64kB
+logging_collector=on
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index beb6cd2..acede78 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -664,6 +664,16 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * If we have detected the toast chunk without the main table change while
+	 * sending the previous stream, then reset the flag if we have got any
+	 * insert/update so that we can retry to stream this.
+	 */
+	if (txn->incomplte_toast_chunks &&
+		(change->action == REORDER_BUFFER_CHANGE_INSERT ||
+		change->action == REORDER_BUFFER_CHANGE_UPDATE))
+		txn->incomplte_toast_chunks = false;
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
@@ -1682,6 +1692,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	MemoryContext ccxt = CurrentMemoryContext;
 	ReorderBufferIterTXNState *volatile iterstate = NULL;
 	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+	dlist_head	toast_change;
+	Oid			toastrelid = InvalidOid;
 
 	/*
 	 * build data to be able to lookup the CommandIds of catalog tuples
@@ -1691,6 +1703,9 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	/* setup the initial snapshot */
 	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, txn->xid);
 
+	/* Initialize the local toast change list. */
+	dlist_init(&toast_change);
+
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
 	 * heavyweight locks and such. Thus we need to have enough state around to
@@ -1734,6 +1749,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
 		{
 			Relation	relation = NULL;
+			Relation	toastrel = NULL;
 			Oid			reloid;
 
 			/*
@@ -1838,6 +1854,32 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					/* user-triggered change */
 					if (!IsToastRelation(relation))
 					{
+						/*
+						 * If we have the toast changes list then reassemble the
+						 * toast chunks in the hash table.
+						 */
+						if (!dlist_is_empty(&toast_change))
+						{
+							dlist_mutable_iter iter;
+
+							Assert(streaming);
+
+							/* open the toast relation. */
+							toastrel = RelationIdGetRelation(toastrelid);
+
+							dlist_foreach_modify(iter, &toast_change)
+							{
+								ReorderBufferChange *change;
+
+								change = dlist_container(ReorderBufferChange,
+														 node, iter.cur);
+								dlist_delete(&change->node);
+								ReorderBufferToastAppendChunk(rb, txn, toastrel,
+															  change);
+							}
+							RelationClose(toastrel);
+						}
+
 						ReorderBufferToastReplace(rb, txn, relation, change);
 						if (streaming)
 						{
@@ -1869,8 +1911,25 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						 * disk.
 						 */
 						dlist_delete(&change->node);
-						ReorderBufferToastAppendChunk(rb, txn, relation,
-													  change);
+
+						/*
+						 * For the streaming run don't directly reassemble the
+						 * chunks in the hash table instead collect those
+						 * changes in the list because there is a possibility
+						 * that we might not get the change for the main table
+						 * in this stream.  So we will assemble the chunks when
+						 * we actually get the change for the main table.
+						 * Otherwise we will attach the list back to the main
+						 * changes list.
+						 */
+						if (streaming)
+						{
+							dlist_push_tail(&toast_change, &change->node);
+							toastrelid = reloid;
+						}
+						else
+							ReorderBufferToastAppendChunk(rb, txn, relation,
+														  change);
 					}
 
 			change_done:
@@ -2125,7 +2184,36 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 * deallocate the ReorderBufferTXN.
 		 */
 		if (streaming)
+		{
 			ReorderBufferTruncateTXN(rb, txn);
+
+			/*
+			 * If we could not stream the toasted chunks then append them back
+			 * to the main txn list.
+			 */
+			if (!dlist_is_empty(&toast_change))
+			{
+				dlist_mutable_iter iter;
+
+				Assert(streaming);
+				dlist_foreach_modify(iter, &toast_change)
+				{
+					ReorderBufferChange *change;
+
+					change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+					/*
+					 * Remove from temp list and add it back to the txn changes
+					 * list.
+					 */
+					dlist_delete(&change->node);
+					dlist_push_tail(&txn->changes, &change->node);
+					txn->nentries_mem++;
+					txn->nentries++;
+				}
+				txn->incomplte_toast_chunks = true;
+			}
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2177,6 +2265,33 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				rb->stream_stop(rb, txn);
 
 				FlushErrorState();
+
+				/*
+				 * If we could not stream the toasted chunks then append them
+				 * back to the main txn list.
+				 */
+				if (!dlist_is_empty(&toast_change))
+				{
+					dlist_mutable_iter iter;
+
+					Assert(streaming);
+					dlist_foreach_modify(iter, &toast_change)
+					{
+						ReorderBufferChange *change;
+
+						change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+						/*
+						 * Remove from temp list and add it back to the txn
+						 * changes list.
+						 */
+						dlist_delete(&change->node);
+						dlist_push_tail(&txn->changes, &change->node);
+						txn->nentries_mem++;
+						txn->nentries++;
+					}
+					txn->incomplte_toast_chunks = true;
+				}
 			}
 		}
 		else
@@ -2523,6 +2638,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2538,7 +2654,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 	/* if subxact, and streaming supported, use the toplevel instead */
 	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+		toptxn = txn->toptxn;
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2546,12 +2662,16 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+		if (toptxn)
+			toptxn->size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+		if (toptxn)
+			toptxn->size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2805,14 +2925,11 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
 		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		if (((!largest) || (txn->size > largest->size)) &&
+			(!txn->incomplte_toast_chunks) && (txn->size > 0))
+		largest = txn;
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2834,11 +2951,13 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	if (rb->size < logical_decoding_work_mem * 1024L)
 		return;
 
+retry:
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
 	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	if (ReorderBufferCanStream(rb))
+	if (ReorderBufferCanStream(rb) &&
+		(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
 	{
 		/*
 		 * Pick the largest toplevel transaction and evict it from memory by
@@ -2877,8 +2996,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	 * streaming. But for streaming we should really check nentries_mem for
 	 * all subtransactions too.
 	 */
-	Assert(txn->size == 0);
-	Assert(txn->nentries_mem == 0);
+	Assert(txn->incomplte_toast_chunks || txn->specinsert || txn->size == 0);
+	Assert(txn->incomplte_toast_chunks || txn->nentries_mem == 0);
 
 	/*
 	 * And furthermore, evicting the transaction should get us below the
@@ -2890,7 +3009,16 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	 * the memory limit). So by evicting it we're definitely back below the
 	 * memory limit.
 	 */
-	Assert(rb->size < logical_decoding_work_mem * 1024L);
+	Assert(txn->incomplte_toast_chunks ||
+		   rb->size < logical_decoding_work_mem * 1024L);
+
+	/*
+	 * For streaming transaction it's possible that we are not yet below the
+	 * memory limit due to incomplete toast tuple.  So we retry with some other
+	 * transaction.
+	 */
+	if (rb->size >= logical_decoding_work_mem * 1024L)
+		goto retry;
 }
 
 /*
@@ -3277,9 +3405,8 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	/* Don't consider already streamed transaction. */
 	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
 
-	Assert(dlist_is_empty(&txn->changes));
-	Assert(txn->nentries == 0);
-	Assert(txn->nentries_mem == 0);
+	Assert(txn->incomplte_toast_chunks || txn->nentries == 0);
+	Assert(txn->incomplte_toast_chunks || txn->nentries_mem == 0);
 }
 
 /*
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 02650c3..1348cde 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -251,6 +251,14 @@ typedef struct ReorderBufferTXN
 	bool		any_data_sent;
 
 	/*
+	 * While sending the last stream we have found that the transaction has some
+	 * toast tuples but haven't yet got the change of the main table so we could
+	 * not stream it.  Don't try to stream it again until we get any new change
+	 * for the transaction.
+	 */
+	bool		incomplte_toast_chunks;
+
+	/*
 	 * Toplevel transaction for this subxact (NULL for top-level).
 	 */
 	struct ReorderBufferTXN *toptxn;
-- 
1.8.3.1

#199Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#176)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Update on the open items

As per my understanding apart from the above comments, the known
pending work for this patchset is as follows:
a. The two open items agreed to you in the email [3]. -> The first part is done and the second part is an improvement, not a bugfix. I will try to work on this part in the next patch set.
b. Complete the handling of schema_sent as discussed above [4]. -> Done
c. Few comments by Vignesh and the response on the same by me [5][6]. -> Done
d. WAL overhead and performance testing for additional WAL logging by
this patchset. -> Pending
e. Some way to see the tuple for streamed transactions by decoding API
as speculated by you [7]. ->Pending

f. Bug in the toast table handling -> Submitted as a separate POC
patch, which can be merged to the main after review and more testing.

[3] - /messages/by-id/CAFiTN-uT5YZE0egGhKdTteTjcGrPi8hb=FMPpr9_hEB7hozQ-Q@mail.gmail.com
[4] - /messages/by-id/CAA4eK1KjD9x0mS4JxzCbu3gu-r6K7XJRV+ZcGb3BH6U3x2uxew@mail.gmail.com
[5] - /messages/by-id/CALDaNm0DNUojjt7CV-fa59_kFbQQ3rcMBtauvo44ttea7r9KaA@mail.gmail.com
[6] - /messages/by-id/CAA4eK1+ZvupW00c--dqEg8f3dHZDOGmA9xOQLyQHjRSoDi6AkQ@mail.gmail.com
[7] - /messages/by-id/CAFiTN-t8PmKA1X4jEqKmkvs0ggWpy0APWpPuaJwpx2YpfAf97w@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#200Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#199)
2 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

I looked at this patchset and it seemed natural to apply 0008 next
(adding work_mem to subscriptions). Attached is Dilip's latest version,
plus my review changes. This will break the patch tester's logic; sorry
about that.

What part of this change is what sets the process's
logical_decoding_work_mem to the given value? I was unable to figure
that out. Is it missing or am I just stupid?

Changes:
* the patch adds logical_decoding_work_mem SGML, but that has already
been applied (cec2edfa7859); remove dupe.

* parse_subscription_options() comment says that it will raise an error if a
caller does not pass the pointer for an option but option list
specifies that option. It does not really implement that behavior (an
existing problem): instead, if the pointer is not passed, the option
is ignored. Moreover, this new patch continued to fail to handle
things as the comment says. I decided to implement the documented
behavior instead; it's now inconsistent with how the other options are
implemented. I think we should fix the other options to behave as the
comment says, because it's a more convenient API; if we instead opted
to update the code comment to match the code, each caller would have
to be checked to verify that the correct options are passed, which is
pointless and error prone.

* the parse_subscription_options API is a mess. I reordered the
arguments a little bit; also change the argument layout in callers so
that each caller is grouped more sensibly. Also added comments to
simplify reading the argument lists. I think this could be fixed by
using an ad-hoc struct to pass in and out. Didn't get around to doing
that, seems an unrelated potential improvement.

* trying to do own range checking in pgoutput and subscriptioncmds.c
seems pointless and likely to get out of sync with guc.c. Simpler is
to call set_config_option() to verify that the argument is in range.
(Note a further problem in the patch series: the range check in
subscriptioncmds.c is only added in patch 0009).

* parsing integers using scanint8() seemed weird (error messages there
do not correspond to what we want). After a couple of false starts, I
decided to rely on guc.c's set_config_option() followed by parse_int().
That also has the benefit that you can give it units.

* psql \dRs+ should display the work_mem; patch failed to do that.
Added. Unit display is done by pg_size_pretty(), which might be
different from what guc.c does, but I think it works OK.
It's the first place where we use pg_size_pretty to show a memory
limit, however.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Dilip-s-original.patchtext/x-diff; charset=us-asciiDownload
From a31b4ebd90dd7a4c94a35f2b3452258078c30e37 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed, 22 Jan 2020 12:44:13 -0300
Subject: [PATCH 1/2] Dilip's original

---
 doc/src/sgml/config.sgml                      | 21 +++++++++
 doc/src/sgml/ref/create_subscription.sgml     | 12 +++++
 src/backend/catalog/pg_subscription.c         |  1 +
 src/backend/commands/subscriptioncmds.c       | 44 ++++++++++++++++---
 .../libpqwalreceiver/libpqwalreceiver.c       |  3 ++
 src/backend/replication/logical/worker.c      |  1 +
 src/backend/replication/pgoutput/pgoutput.c   | 30 ++++++++++++-
 src/include/catalog/pg_subscription.h         |  3 ++
 src/include/replication/walreceiver.h         |  1 +
 9 files changed, 108 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3ccacd528b..163cc77d1d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1751,6 +1751,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-logical-decoding-work-mem" xreflabel="logical_decoding_work_mem">
+      <term><varname>logical_decoding_work_mem</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>logical_decoding_work_mem</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to be used by logical decoding,
+        before some of the decoded changes are either written to local disk.
+        This limits the amount of memory used by logical streaming replication
+        connections. It defaults to 64 megabytes (<literal>64MB</literal>).
+        Since each replication connection only uses a single buffer of this size,
+        and an installation normally doesn't have many such connections
+        concurrently (as limited by <varname>max_wal_senders</varname>), it's
+        safe to set this value significantly higher than <varname>work_mem</varname>,
+        reducing the amount of decoded changes written to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c244fb..91790b0c95 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>work_mem</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          Limits the amount of memory used to decode changes on the
+          publisher.  If not specified, the publisher will use the default
+          specified by <varname>logical_decoding_work_mem</varname>. When
+          needed, additional data are spilled to disk.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index f77a83bd2e..5cd1daa238 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->workmem = subform->subworkmem;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9bfe142ada..c50e854e96 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -58,7 +58,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, int *logical_wm,
+						   bool *logical_wm_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -89,6 +90,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (logical_wm)
+		*logical_wm_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -174,6 +177,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0 && logical_wm)
+		{
+			if (*logical_wm_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*logical_wm_given = true;
+			*logical_wm = defGetInt32(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -317,6 +330,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	int			logical_wm;
+	bool		logical_wm_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -333,7 +348,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &logical_wm, &logical_wm_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -411,6 +426,12 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (logical_wm_given)
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(logical_wm);
+	else
+		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -668,10 +689,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				int			logical_wm;
+				bool		logical_wm_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &logical_wm, &logical_wm_given);
 
 				if (slotname_given)
 				{
@@ -696,6 +720,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (logical_wm_given)
+				{
+					values[Anum_pg_subscription_subworkmem - 1] =
+						Int32GetDatum(logical_wm);
+					replaces[Anum_pg_subscription_subworkmem - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -707,7 +738,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -745,7 +777,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -782,7 +814,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9bb6..896ddab2b1 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		appendStringInfo(&cmd, ", work_mem '%d'",
+						 options->proto.logical.work_mem);
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 7a5471f95c..48b960c4c9 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1745,6 +1745,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.work_mem = MySubscription->workmem;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 752508213a..536722b32f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -18,6 +18,7 @@
 #include "replication/logicalproto.h"
 #include "replication/origin.h"
 #include "replication/pgoutput.h"
+#include "utils/guc.h"
 #include "utils/int8.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
@@ -87,11 +88,12 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, int *logical_decoding_work_mem)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		work_mem_given = false;
 
 	foreach(lc, options)
 	{
@@ -137,6 +139,29 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0)
+		{
+			int64	parsed;
+
+			if (work_mem_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			work_mem_given = true;
+
+			if (!scanint8(strVal(defel->arg), true, &parsed))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid work_mem")));
+
+			if (parsed > PG_INT32_MAX || parsed < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("work_mem \"%s\" out of range",
+								strVal(defel->arg))));
+
+			*logical_decoding_work_mem = (int)parsed;
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -171,7 +196,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&logical_decoding_work_mem);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d42d8..3394379f86 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	int32		subworkmem;		/* Memory to use to decode changes. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	int			workmem;		/* Memory to decode changes. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e08afc6548..4c7acfb7d3 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -169,6 +169,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			int			work_mem;	/* Memory limit to use for decoding */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
-- 
2.20.1

0002-Changes-by-lvaro.patchtext/x-diff; charset=us-asciiDownload
From 848ad7383cede7600ae3fca07440e3f2441ac934 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed, 22 Jan 2020 12:51:28 -0300
Subject: [PATCH 2/2] =?UTF-8?q?Changes=20by=20=C3=81lvaro?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 doc/src/sgml/config.sgml                    | 21 -------
 src/backend/commands/subscriptioncmds.c     | 62 +++++++++++++--------
 src/backend/replication/pgoutput/pgoutput.c | 23 +++-----
 src/bin/psql/describe.c                     |  4 +-
 src/include/catalog/pg_subscription.h       |  1 -
 5 files changed, 52 insertions(+), 59 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 163cc77d1d..3ccacd528b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1751,27 +1751,6 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-logical-decoding-work-mem" xreflabel="logical_decoding_work_mem">
-      <term><varname>logical_decoding_work_mem</varname> (<type>integer</type>)
-      <indexterm>
-       <primary><varname>logical_decoding_work_mem</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Specifies the maximum amount of memory to be used by logical decoding,
-        before some of the decoded changes are either written to local disk.
-        This limits the amount of memory used by logical streaming replication
-        connections. It defaults to 64 megabytes (<literal>64MB</literal>).
-        Since each replication connection only uses a single buffer of this size,
-        and an installation normally doesn't have many such connections
-        concurrently (as limited by <varname>max_wal_senders</varname>), it's
-        safe to set this value significantly higher than <varname>work_mem</varname>,
-        reducing the amount of decoded changes written to disk.
-       </para>
-      </listitem>
-     </varlistentry>
-
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index c50e854e96..7920e75bfa 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -54,12 +54,13 @@ static List *fetch_table_list(WalReceiverConn *wrconn, List *publications);
  * accommodate that.
  */
 static void
-parse_subscription_options(List *options, bool *connect, bool *enabled_given,
-						   bool *enabled, bool *create_slot,
+parse_subscription_options(List *options, bool *connect,
+						   bool *enabled_given, bool *enabled,
+						   bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
+						   bool *logical_wm_given, int *logical_wm,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh, int *logical_wm,
-						   bool *logical_wm_given)
+						   bool *refresh)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -177,15 +178,25 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
-		else if (strcmp(defel->defname, "work_mem") == 0 && logical_wm)
+		else if (strcmp(defel->defname, "work_mem") == 0)
 		{
+			if (!logical_wm)
+				elog(ERROR, "option \"work_mem\" not valid in this context");
+
 			if (*logical_wm_given)
 				ereport(ERROR,
 						(errcode(ERRCODE_SYNTAX_ERROR),
 						 errmsg("conflicting or redundant options")));
 
+			/* Test if the value is valid for logical_decoding_work_mem */
+			(void) set_config_option("logical_decoding_work_mem", defGetString(defel),
+									 PGC_BACKEND, PGC_S_TEST, GUC_ACTION_SET,
+									 false, 0, false);
+			if (!parse_int(defGetString(defel), logical_wm,
+						   GUC_UNIT_KB, NULL))
+				elog(ERROR, "parse_int failed");	/* shouldn't happen */
 			*logical_wm_given = true;
-			*logical_wm = defGetInt32(defel);
+
 		}
 		else
 			ereport(ERROR,
@@ -345,10 +356,11 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	 *
 	 * Connection and publication should not be specified here.
 	 */
-	parse_subscription_options(stmt->options, &connect, &enabled_given,
-							   &enabled, &create_slot, &slotname_given,
-							   &slotname, &copy_data, &synchronous_commit,
-							   NULL, &logical_wm, &logical_wm_given);
+	parse_subscription_options(stmt->options, &connect,
+							   &enabled_given, &enabled,
+							   &create_slot, &slotname_given, &slotname,
+							   &logical_wm_given, &logical_wm,
+							   &copy_data, &synchronous_commit, NULL);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -692,10 +704,11 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				int			logical_wm;
 				bool		logical_wm_given;
 
-				parse_subscription_options(stmt->options, NULL, NULL, NULL,
+				parse_subscription_options(stmt->options, NULL,
+										   NULL, NULL,	/* enabled */
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL,
-										   &logical_wm, &logical_wm_given);
+										   &logical_wm_given, &logical_wm,
+										   NULL, &synchronous_commit, NULL);
 
 				if (slotname_given)
 				{
@@ -737,9 +750,10 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 							enabled_given;
 
 				parse_subscription_options(stmt->options, NULL,
-										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL,
-										   NULL, NULL);
+										   &enabled_given, &enabled,
+										   NULL, NULL, NULL,	/* slot */
+										   NULL, NULL,	/* logical wm */
+										   NULL, NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -775,9 +789,11 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				bool		copy_data;
 				bool		refresh;
 
-				parse_subscription_options(stmt->options, NULL, NULL, NULL,
-										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh, NULL, NULL);
+				parse_subscription_options(stmt->options, NULL,
+										   NULL, NULL,	/* enabled */
+										   NULL, NULL, NULL,	/* slot */
+										   NULL, NULL,	/* logical wm */
+										   &copy_data, NULL, &refresh);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -812,9 +828,11 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 							(errcode(ERRCODE_SYNTAX_ERROR),
 							 errmsg("ALTER SUBSCRIPTION ... REFRESH is not allowed for disabled subscriptions")));
 
-				parse_subscription_options(stmt->options, NULL, NULL, NULL,
-										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL, NULL, NULL);
+				parse_subscription_options(stmt->options, NULL,
+										   NULL, NULL,	/* enabled */
+										   NULL, NULL, NULL,	/* slot */
+										   NULL, NULL,	/* logical wm */
+										   &copy_data, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 536722b32f..d243d90821 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -13,6 +13,7 @@
 #include "postgres.h"
 
 #include "catalog/pg_publication.h"
+#include "commands/defrem.h"
 #include "fmgr.h"
 #include "replication/logical.h"
 #include "replication/logicalproto.h"
@@ -141,26 +142,20 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 		}
 		else if (strcmp(defel->defname, "work_mem") == 0)
 		{
-			int64	parsed;
-
 			if (work_mem_given)
 				ereport(ERROR,
 						(errcode(ERRCODE_SYNTAX_ERROR),
 						 errmsg("conflicting or redundant options")));
 			work_mem_given = true;
+			/* Test if the value is valid for logical_decoding_work_mem */
+			(void) set_config_option("logical_decoding_work_mem", defGetString(defel),
+									 PGC_BACKEND, PGC_S_TEST, GUC_ACTION_SET,
+									 false, 0, false);
 
-			if (!scanint8(strVal(defel->arg), true, &parsed))
-				ereport(ERROR,
-						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-						 errmsg("invalid work_mem")));
-
-			if (parsed > PG_INT32_MAX || parsed < 64)
-				ereport(ERROR,
-						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-						 errmsg("work_mem \"%s\" out of range",
-								strVal(defel->arg))));
-
-			*logical_decoding_work_mem = (int)parsed;
+			/* by here it must be valid, so this shouldn't fail */
+			if (!parse_int(defGetString(defel), logical_decoding_work_mem,
+						   GUC_UNIT_KB, NULL))
+				elog(ERROR, "parse_int failed");	/* shouldn't happen */
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index f3c7eb96fa..956ad41f56 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -5933,7 +5933,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false};
+	false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -5961,8 +5961,10 @@ describeSubscriptions(const char *pattern, bool verbose)
 	{
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
+						  ",  pg_catalog.pg_size_pretty(subworkmem::bigint * 1024) AS \"%s\"\n"
 						  ",  subconninfo AS \"%s\"\n",
 						  gettext_noop("Synchronous commit"),
+						  gettext_noop("Working Memory"),
 						  gettext_noop("Conninfo"));
 	}
 
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3394379f86..eef585b0e5 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -47,7 +47,6 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
-
 	int32		subworkmem;		/* Memory to use to decode changes. */
 
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
-- 
2.20.1

#201Amit Kapila
amit.kapila16@gmail.com
In reply to: Alvaro Herrera (#200)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jan 22, 2020 at 10:07 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

I looked at this patchset and it seemed natural to apply 0008 next
(adding work_mem to subscriptions).

I am not so sure whether we need this patch as the exact scenario
where it can help is not very clear to me and neither did anyone
explained. I have raised this concern earlier as well [1]/messages/by-id/CAA4eK1J+3kab6RSZrgj0YiQV1r+H3FWVaNjKhWvpEe5-bpZiBw@mail.gmail.com. The point
is that 'logical_decoding_work_mem' applies to the entire
ReorderBuffer in the publisher's side and how will a parameter from a
particular subscription help in that?

[1]: /messages/by-id/CAA4eK1J+3kab6RSZrgj0YiQV1r+H3FWVaNjKhWvpEe5-bpZiBw@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#202Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#198)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jan 22, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Hmm, I think this can turn out to be inefficient because we can easily
end up spilling the data even when we don't need to so. Consider
cases, where part of the streamed changes are for toast, and remaining
are the changes which we would have streamed and hence can be removed.
In such cases, we could have easily consumed remaining changes for
toast without spilling. Also, I am not sure if spilling changes from
the hash table is a good idea as they are no more in the same order as
they were in ReorderBuffer which means the order in which we serialize
the changes normally would change and that might have some impact, so
we would need some more study if we want to pursue this idea.

I have fixed this bug and attached it as a separate patch. I will
merge it to the main patch after we agree with the idea and after some
more testing.

The idea is that whenever we get the toasted chunk instead of directly
inserting it into the toast hash I am inserting it into some local
list so that if we don't get the change for the main table then we can
insert these changes back to the txn->changes. So once we get the
change for the main table at that time I am preparing the hash table
to merge the chunks.

I think this idea will work but appears to be quite costly because (a)
you might need to serialize/deserialize the changes multiple times and
might attempt streaming multiple times even though you can't do (b)
you need to remove/add the same set of changes from the main list
multiple times.

It seems to me that we need to add all of this new handling because
while taking the decision whether to stream or not we don't know
whether the txn has changes that can't be streamed. One idea to make
it work is that we identify it while decoding the WAL. I think we
need to set a bit in the insert/delete WAL record to identify if the
tuple belongs to a toast relation. This won't add any additional
overhead in WAL and reduce a lot of complexity in the logical decoding
and also decoding will be efficient. If this is feasible, then we can
do the same for speculative insertions.

In patch v8-0013-Bugfix-handling-of-incomplete-toast-tuple, why is
below change required?

--- a/contrib/test_decoding/logical.conf
+++ b/contrib/test_decoding/logical.conf
@@ -1,3 +1,4 @@
 wal_level = logical
 max_replication_slots = 4
 logical_decoding_work_mem = 64kB
+logging_collector=on

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#203Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#202)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jan 28, 2020 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 22, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Hmm, I think this can turn out to be inefficient because we can easily
end up spilling the data even when we don't need to so. Consider
cases, where part of the streamed changes are for toast, and remaining
are the changes which we would have streamed and hence can be removed.
In such cases, we could have easily consumed remaining changes for
toast without spilling. Also, I am not sure if spilling changes from
the hash table is a good idea as they are no more in the same order as
they were in ReorderBuffer which means the order in which we serialize
the changes normally would change and that might have some impact, so
we would need some more study if we want to pursue this idea.

I have fixed this bug and attached it as a separate patch. I will
merge it to the main patch after we agree with the idea and after some
more testing.

The idea is that whenever we get the toasted chunk instead of directly
inserting it into the toast hash I am inserting it into some local
list so that if we don't get the change for the main table then we can
insert these changes back to the txn->changes. So once we get the
change for the main table at that time I am preparing the hash table
to merge the chunks.

I think this idea will work but appears to be quite costly because (a)
you might need to serialize/deserialize the changes multiple times and
might attempt streaming multiple times even though you can't do (b)
you need to remove/add the same set of changes from the main list
multiple times.

I agree with this.

It seems to me that we need to add all of this new handling because
while taking the decision whether to stream or not we don't know
whether the txn has changes that can't be streamed. One idea to make
it work is that we identify it while decoding the WAL. I think we
need to set a bit in the insert/delete WAL record to identify if the
tuple belongs to a toast relation. This won't add any additional
overhead in WAL and reduce a lot of complexity in the logical decoding
and also decoding will be efficient. If this is feasible, then we can
do the same for speculative insertions.

The Idea looks good to me. I will work on this.

In patch v8-0013-Bugfix-handling-of-incomplete-toast-tuple, why is
below change required?

--- a/contrib/test_decoding/logical.conf
+++ b/contrib/test_decoding/logical.conf
@@ -1,3 +1,4 @@
wal_level = logical
max_replication_slots = 4
logical_decoding_work_mem = 64kB
+logging_collector=on

Sorry, these are some local changes which got included in the patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#204Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#203)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jan 28, 2020 at 11:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jan 28, 2020 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 22, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Hmm, I think this can turn out to be inefficient because we can easily
end up spilling the data even when we don't need to so. Consider
cases, where part of the streamed changes are for toast, and remaining
are the changes which we would have streamed and hence can be removed.
In such cases, we could have easily consumed remaining changes for
toast without spilling. Also, I am not sure if spilling changes from
the hash table is a good idea as they are no more in the same order as
they were in ReorderBuffer which means the order in which we serialize
the changes normally would change and that might have some impact, so
we would need some more study if we want to pursue this idea.

I have fixed this bug and attached it as a separate patch. I will
merge it to the main patch after we agree with the idea and after some
more testing.

The idea is that whenever we get the toasted chunk instead of directly
inserting it into the toast hash I am inserting it into some local
list so that if we don't get the change for the main table then we can
insert these changes back to the txn->changes. So once we get the
change for the main table at that time I am preparing the hash table
to merge the chunks.

I think this idea will work but appears to be quite costly because (a)
you might need to serialize/deserialize the changes multiple times and
might attempt streaming multiple times even though you can't do (b)
you need to remove/add the same set of changes from the main list
multiple times.

I agree with this.

It seems to me that we need to add all of this new handling because
while taking the decision whether to stream or not we don't know
whether the txn has changes that can't be streamed. One idea to make
it work is that we identify it while decoding the WAL. I think we
need to set a bit in the insert/delete WAL record to identify if the
tuple belongs to a toast relation. This won't add any additional
overhead in WAL and reduce a lot of complexity in the logical decoding
and also decoding will be efficient. If this is feasible, then we can
do the same for speculative insertions.

The Idea looks good to me. I will work on this.

One more thing we can do is to identify whether the tuple belongs to
toast relation while decoding it. However, I think to do that we need
to have access to relcache at that time and that might add some
overhead as we need to do that for each tuple. Can we investigate
what it will take to do that and if it is better than setting a bit
during WAL logging.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#205Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#204)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 28, 2020 at 11:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jan 28, 2020 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 22, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Hmm, I think this can turn out to be inefficient because we can easily
end up spilling the data even when we don't need to so. Consider
cases, where part of the streamed changes are for toast, and remaining
are the changes which we would have streamed and hence can be removed.
In such cases, we could have easily consumed remaining changes for
toast without spilling. Also, I am not sure if spilling changes from
the hash table is a good idea as they are no more in the same order as
they were in ReorderBuffer which means the order in which we serialize
the changes normally would change and that might have some impact, so
we would need some more study if we want to pursue this idea.

I have fixed this bug and attached it as a separate patch. I will
merge it to the main patch after we agree with the idea and after some
more testing.

The idea is that whenever we get the toasted chunk instead of directly
inserting it into the toast hash I am inserting it into some local
list so that if we don't get the change for the main table then we can
insert these changes back to the txn->changes. So once we get the
change for the main table at that time I am preparing the hash table
to merge the chunks.

I think this idea will work but appears to be quite costly because (a)
you might need to serialize/deserialize the changes multiple times and
might attempt streaming multiple times even though you can't do (b)
you need to remove/add the same set of changes from the main list
multiple times.

I agree with this.

It seems to me that we need to add all of this new handling because
while taking the decision whether to stream or not we don't know
whether the txn has changes that can't be streamed. One idea to make
it work is that we identify it while decoding the WAL. I think we
need to set a bit in the insert/delete WAL record to identify if the
tuple belongs to a toast relation. This won't add any additional
overhead in WAL and reduce a lot of complexity in the logical decoding
and also decoding will be efficient. If this is feasible, then we can
do the same for speculative insertions.

The Idea looks good to me. I will work on this.

One more thing we can do is to identify whether the tuple belongs to
toast relation while decoding it. However, I think to do that we need
to have access to relcache at that time and that might add some
overhead as we need to do that for each tuple. Can we investigate
what it will take to do that and if it is better than setting a bit
during WAL logging.

IMHO, for the catalog scan, we will have to start/stop the transaction
for each change. So do you want that we should evaluate its
performance? Also, during we get the change we might not have the
complete historic snapshot ready to fetch the rel cache entry.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#206Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#205)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jan 28, 2020 at 11:58 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

It seems to me that we need to add all of this new handling because
while taking the decision whether to stream or not we don't know
whether the txn has changes that can't be streamed. One idea to make
it work is that we identify it while decoding the WAL. I think we
need to set a bit in the insert/delete WAL record to identify if the
tuple belongs to a toast relation. This won't add any additional
overhead in WAL and reduce a lot of complexity in the logical decoding
and also decoding will be efficient. If this is feasible, then we can
do the same for speculative insertions.

The Idea looks good to me. I will work on this.

One more thing we can do is to identify whether the tuple belongs to
toast relation while decoding it. However, I think to do that we need
to have access to relcache at that time and that might add some
overhead as we need to do that for each tuple. Can we investigate
what it will take to do that and if it is better than setting a bit
during WAL logging.

IMHO, for the catalog scan, we will have to start/stop the transaction
for each change. So do you want that we should evaluate its
performance?

No, I was not thinking about each change, but at the level of ReorderBufferTXN.

Also, during we get the change we might not have the
complete historic snapshot ready to fetch the rel cache entry.

Before decoding each change (say DecodeInsert), we call
SnapBuildProcessChange. Isn't that sufficient?

Even, if the above is possible, I am not sure how good is it for each
change we fetch rel cache entry, that is the point I was worried.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#207Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#206)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jan 28, 2020 at 1:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 28, 2020 at 11:58 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

It seems to me that we need to add all of this new handling because
while taking the decision whether to stream or not we don't know
whether the txn has changes that can't be streamed. One idea to make
it work is that we identify it while decoding the WAL. I think we
need to set a bit in the insert/delete WAL record to identify if the
tuple belongs to a toast relation. This won't add any additional
overhead in WAL and reduce a lot of complexity in the logical decoding
and also decoding will be efficient. If this is feasible, then we can
do the same for speculative insertions.

The Idea looks good to me. I will work on this.

One more thing we can do is to identify whether the tuple belongs to
toast relation while decoding it. However, I think to do that we need
to have access to relcache at that time and that might add some
overhead as we need to do that for each tuple. Can we investigate
what it will take to do that and if it is better than setting a bit
during WAL logging.

IMHO, for the catalog scan, we will have to start/stop the transaction
for each change. So do you want that we should evaluate its
performance?

No, I was not thinking about each change, but at the level of ReorderBufferTXN.

That means we will have to keep that transaction open until we decode
the commit WAL for that ReorderBufferTXN or you have anything else in
mind?

Also, during we get the change we might not have the
complete historic snapshot ready to fetch the rel cache entry.

Before decoding each change (say DecodeInsert), we call
SnapBuildProcessChange. Isn't that sufficient?

Yeah, Right, we can get some recache entry based on the base snapshot.
And, that might be sufficient to know whether it's a toast relation or
not.

Even, if the above is possible, I am not sure how good is it for each
change we fetch rel cache entry, that is the point I was worried.

We might not need to scan the catalog every time, we might get it from
the cache itself.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#208Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#207)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jan 28, 2020 at 1:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jan 28, 2020 at 1:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 28, 2020 at 11:58 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

It seems to me that we need to add all of this new handling because
while taking the decision whether to stream or not we don't know
whether the txn has changes that can't be streamed. One idea to make
it work is that we identify it while decoding the WAL. I think we
need to set a bit in the insert/delete WAL record to identify if the
tuple belongs to a toast relation. This won't add any additional
overhead in WAL and reduce a lot of complexity in the logical decoding
and also decoding will be efficient. If this is feasible, then we can
do the same for speculative insertions.

The Idea looks good to me. I will work on this.

One more thing we can do is to identify whether the tuple belongs to
toast relation while decoding it. However, I think to do that we need
to have access to relcache at that time and that might add some
overhead as we need to do that for each tuple. Can we investigate
what it will take to do that and if it is better than setting a bit
during WAL logging.

IMHO, for the catalog scan, we will have to start/stop the transaction
for each change. So do you want that we should evaluate its
performance?

No, I was not thinking about each change, but at the level of ReorderBufferTXN.

That means we will have to keep that transaction open until we decode
the commit WAL for that ReorderBufferTXN or you have anything else in
mind?

or probably till we start streaming.

Also, during we get the change we might not have the
complete historic snapshot ready to fetch the rel cache entry.

Before decoding each change (say DecodeInsert), we call
SnapBuildProcessChange. Isn't that sufficient?

Yeah, Right, we can get some recache entry based on the base snapshot.
And, that might be sufficient to know whether it's a toast relation or
not.

Even, if the above is possible, I am not sure how good is it for each
change we fetch rel cache entry, that is the point I was worried.

We might not need to scan the catalog every time, we might get it from
the cache itself.

Right, but I am not completely sure if that is better than setting a
bit in WAL record for toast tuples.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#209Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#187)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Jan 10, 2020 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Few more comments:
--------------------------------
v4-0007-Implement-streaming-mode-in-ReorderBuffer
1.
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
+ /*
+ * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
+ * information about
subtransactions, which could arrive after streaming start.
+ */
+ if (!txn->is_schema_sent)
+ snapshot_now
= ReorderBufferCopySnap(rb, txn->base_snapshot,
+ txn,
command_id);
..
}

Why are we using base snapshot here instead of the snapshot we saved
the first time streaming has happened? And as mentioned in comments,
won't we need to consider the snapshots for subtransactions that
arrived after the last time we have streamed the changes?

Fixed

+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
+ /*
+ * We can not use txn->snapshot_now directly because after we there
+ * might be some new sub-transaction which after the last streaming run
+ * so we need to add those sub-xip in the snapshot.
+ */
+ snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+ txn, command_id);

"because after we there", you seem to forget a word between 'we' and
'there'. So as we are copying it now, does this mean it will consider
the snapshots for subtransactions that arrived after the last time we
have streamed the changes? If so, have you tested it and can we add
the same in comments.

Also, if we need to copy the snapshot here, then do we need to again
copy it in ReorderBufferProcessTXN(in below code and in catch block in
the same function).

{
..
+ /*
+ * Remember the command ID and snapshot if transaction is streaming
+ * otherwise free the snapshot if we have copied it.
+ */
+ if (streaming)
+ {
+ txn->command_id = command_id;
+ txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+   txn, command_id);
+ }
+ else if (snapshot_now->copied)
+ ReorderBufferFreeSnap(rb, snapshot_now);
..
}

4. In ReorderBufferStreamTXN(), don't we need to set some of the txn
fields like origin_id, origin_lsn as we do in ReorderBufferCommit()
especially to cover the case when it gets called due to memory
overflow (aka via ReorderBufferCheckMemoryLimit).

We get origin_lsn during commit time so I am not sure how can we do
that. I have also noticed that currently, we are not using origin_lsn
on the subscriber side. I think need more investigation that if we
want this then do we need to log it early.

Have you done any investigation of this point? You might want to look
at pg_replication_origin* APIs. Today, again looking at this code, I
think with current coding, it won't be used even when we encounter
commit record. Because ReorderBufferCommit calls
ReorderBufferStreamCommit which will make sure that origin_id and
origin_lsn is never sent. I think at least that should be fixed, if
not, probably, we need a comment with reasoning why we think it is
okay not to do in this case.

+ /*
+ * If we are streaming the in-progress transaction then Discard the

/Discard/discard

v4-0006-Gracefully-handle-concurrent-aborts-of-uncommitte
1.
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));

Why here we can't use TransactionIdDidAbort? If we can't use it, then
can you add comments stating the reason of the same.

Done

+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out.  Instead of directly checking the abort status we do check
+ * if it is not in progress transaction and no committed. Because if there
+ * were a system crash then status of the the transaction which were running
+ * at that time might not have marked.  So we need to consider them as
+ * aborted.  Refer detailed comments at snapmgr.c where the variable is
+ * declared.

How about replacing the above comment with below one:

If CheckXidAlive is valid, then we check if it aborted. If it did, we
error out. We can't directly use TransactionIdDidAbort as after crash
such transaction might not have been marked as aborted. See detailed
comments at snapmgr.c where the variable is declared.

I am not able to understand the change in
v8-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup. Do you have
any explanation for the same?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#210Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#209)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jan 30, 2020 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 10, 2020 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Few more comments:
--------------------------------
v4-0007-Implement-streaming-mode-in-ReorderBuffer
1.
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
+ /*
+ * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
+ * information about
subtransactions, which could arrive after streaming start.
+ */
+ if (!txn->is_schema_sent)
+ snapshot_now
= ReorderBufferCopySnap(rb, txn->base_snapshot,
+ txn,
command_id);
..
}

Why are we using base snapshot here instead of the snapshot we saved
the first time streaming has happened? And as mentioned in comments,
won't we need to consider the snapshots for subtransactions that
arrived after the last time we have streamed the changes?

Fixed

+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
+ /*
+ * We can not use txn->snapshot_now directly because after we there
+ * might be some new sub-transaction which after the last streaming run
+ * so we need to add those sub-xip in the snapshot.
+ */
+ snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+ txn, command_id);

"because after we there", you seem to forget a word between 'we' and
'there'. So as we are copying it now, does this mean it will consider
the snapshots for subtransactions that arrived after the last time we
have streamed the changes? If so, have you tested it and can we add
the same in comments.

Ok

Also, if we need to copy the snapshot here, then do we need to again
copy it in ReorderBufferProcessTXN(in below code and in catch block in
the same function).

I think so because as part of the
"REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT" change, we might directly
point to the snapshot and that will get truncated when we truncate all
the changes of the ReorderBufferTXN. So I think we can check if
snapshot_now->copied is true then we can avoid copying otherwise we
can copy?

Other comments look fine to me so I will reply to them along with the
next version of the patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#211Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#210)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jan 30, 2020 at 6:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jan 30, 2020 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Also, if we need to copy the snapshot here, then do we need to again
copy it in ReorderBufferProcessTXN(in below code and in catch block in
the same function).

I think so because as part of the
"REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT" change, we might directly
point to the snapshot and that will get truncated when we truncate all
the changes of the ReorderBufferTXN. So I think we can check if
snapshot_now->copied is true then we can avoid copying otherwise we
can copy?

Yeah, that makes sense, but I think then we also need to ensure that
ReorderBufferStreamTXN frees the snapshot only when it is copied. It
seems to me it should be always copied in the place where we are
trying to free it, so probably we should have an Assert there.

One more thing:
ReorderBufferProcessTXN()
{
..
+ if (streaming)
+ {
+ /*
+ * While streaming an in-progress transaction there is a
+ * possibility that the (sub)transaction might get aborted
+ * concurrently.  In such case if the (sub)transaction has
+ * catalog update then we might decode the tuple using wrong
+ * catalog version.  So for detecting the concurrent abort we
+ * set CheckXidAlive to the current (sub)transaction's xid for
+ * which this change belongs to.  And, during catalog scan we
+ * can check the status of the xid and if it is aborted we will
+ * report an specific error which we can ignore.  We might have
+ * already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the
+ * abort we will stream abort message to truncate the changes in
+ * the subscriber.
+ */
+ CheckXidAlive = change->txn->xid;
+ }
..
}

I think it is better to move the above code into an inline function
(something like SetXidAlive). It will make the code in function
ReorderBufferProcessTXN look cleaner and easier to understand.

Other comments look fine to me so I will reply to them along with the
next version of the patch.

Okay, thanks.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#212Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#210)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jan 30, 2020 at 6:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Other comments look fine to me so I will reply to them along with the
next version of the patch.

This still needs more work, so I have moved this to the next CF.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#213Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#189)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Jan 10, 2020 at 10:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Dec 30, 2019 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

2. During commit time in DecodeCommit we check whether we need to skip
the changes of the transaction or not by calling
SnapBuildXactNeedsSkip but since now we support streaming so it's
possible that before we decode the commit WAL, we might have already
sent the changes to the output plugin even though we could have
skipped those changes. So my question is instead of checking at the
commit time can't we check before adding to ReorderBuffer itself

I think if we can do that then the same will be true for current code
irrespective of this patch. I think it is possible that we can't take
that decision while decoding because we haven't assembled a consistent
snapshot yet. I think we might be able to do that while we try to
stream the changes. I think we need to take care of all the
conditions during streaming (when the logical_decoding_workmem limit
is reached) as we do in DecodeCommit. This needs a bit more study.

I have analyzed this further and I think we can not decide all the
conditions even while streaming. Because IMHO once we get the
SNAPBUILD_FULL_SNAPSHOT we can add the changes to the reorder buffer
so that if we get the commit of the transaction after we reach to the
SNAPBUILD_CONSISTENT. However, if we get the commit before we reach
to SNAPBUILD_CONSISTENT then we need to ignore this transaction. Now,
even if we have SNAPBUILD_FULL_SNAPSHOT we can stream the changes
which might get dropped later but that we can not decide while
streaming.

This makes sense to me, but we should add a comment for the same when
we are streaming to say we can't skip similar to how we do during
commit time because of the above reason described by you. Also, what
about other conditions where we can skip the transaction, basically
cases like (a) when the transaction happened in another database, (b)
when the output plugin is not interested in the origin and (c) when we
are doing fast-forwarding

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#214Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#213)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Feb 3, 2020 at 9:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 10, 2020 at 10:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Dec 30, 2019 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

2. During commit time in DecodeCommit we check whether we need to skip
the changes of the transaction or not by calling
SnapBuildXactNeedsSkip but since now we support streaming so it's
possible that before we decode the commit WAL, we might have already
sent the changes to the output plugin even though we could have
skipped those changes. So my question is instead of checking at the
commit time can't we check before adding to ReorderBuffer itself

I think if we can do that then the same will be true for current code
irrespective of this patch. I think it is possible that we can't take
that decision while decoding because we haven't assembled a consistent
snapshot yet. I think we might be able to do that while we try to
stream the changes. I think we need to take care of all the
conditions during streaming (when the logical_decoding_workmem limit
is reached) as we do in DecodeCommit. This needs a bit more study.

I have analyzed this further and I think we can not decide all the
conditions even while streaming. Because IMHO once we get the
SNAPBUILD_FULL_SNAPSHOT we can add the changes to the reorder buffer
so that if we get the commit of the transaction after we reach to the
SNAPBUILD_CONSISTENT. However, if we get the commit before we reach
to SNAPBUILD_CONSISTENT then we need to ignore this transaction. Now,
even if we have SNAPBUILD_FULL_SNAPSHOT we can stream the changes
which might get dropped later but that we can not decide while
streaming.

This makes sense to me, but we should add a comment for the same when
we are streaming to say we can't skip similar to how we do during
commit time because of the above reason described by you. Also, what
about other conditions where we can skip the transaction, basically
cases like (a) when the transaction happened in another database, (b)
when the output plugin is not interested in the origin and (c) when we
are doing fast-forwarding

I will analyze those and fix in my next version of the patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#215Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#204)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 28, 2020 at 11:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jan 28, 2020 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 22, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Hmm, I think this can turn out to be inefficient because we can easily
end up spilling the data even when we don't need to so. Consider
cases, where part of the streamed changes are for toast, and remaining
are the changes which we would have streamed and hence can be removed.
In such cases, we could have easily consumed remaining changes for
toast without spilling. Also, I am not sure if spilling changes from
the hash table is a good idea as they are no more in the same order as
they were in ReorderBuffer which means the order in which we serialize
the changes normally would change and that might have some impact, so
we would need some more study if we want to pursue this idea.

I have fixed this bug and attached it as a separate patch. I will
merge it to the main patch after we agree with the idea and after some
more testing.

The idea is that whenever we get the toasted chunk instead of directly
inserting it into the toast hash I am inserting it into some local
list so that if we don't get the change for the main table then we can
insert these changes back to the txn->changes. So once we get the
change for the main table at that time I am preparing the hash table
to merge the chunks.

I think this idea will work but appears to be quite costly because (a)
you might need to serialize/deserialize the changes multiple times and
might attempt streaming multiple times even though you can't do (b)
you need to remove/add the same set of changes from the main list
multiple times.

I agree with this.

It seems to me that we need to add all of this new handling because
while taking the decision whether to stream or not we don't know
whether the txn has changes that can't be streamed. One idea to make
it work is that we identify it while decoding the WAL. I think we
need to set a bit in the insert/delete WAL record to identify if the
tuple belongs to a toast relation. This won't add any additional
overhead in WAL and reduce a lot of complexity in the logical decoding
and also decoding will be efficient. If this is feasible, then we can
do the same for speculative insertions.

The Idea looks good to me. I will work on this.

One more thing we can do is to identify whether the tuple belongs to
toast relation while decoding it. However, I think to do that we need
to have access to relcache at that time and that might add some
overhead as we need to do that for each tuple. Can we investigate
what it will take to do that and if it is better than setting a bit
during WAL logging.

I have done some more analysis on this and it appears that there are
few problems in doing this. Basically, once we get the confirmed
flush location, we advance the replication_slot_catalog_xmin so that
vacuum can garbage collect the old tuple. So the problem is that
while we are collecting the changes in the ReorderBuffer our catalog
version might have removed, and we might not find any relation entry
with that relfilenodeid (because it is dropped or altered in the
future).

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#216Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#215)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Feb 4, 2020 at 11:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

One more thing we can do is to identify whether the tuple belongs to
toast relation while decoding it. However, I think to do that we need
to have access to relcache at that time and that might add some
overhead as we need to do that for each tuple. Can we investigate
what it will take to do that and if it is better than setting a bit
during WAL logging.

I have done some more analysis on this and it appears that there are
few problems in doing this. Basically, once we get the confirmed
flush location, we advance the replication_slot_catalog_xmin so that
vacuum can garbage collect the old tuple. So the problem is that
while we are collecting the changes in the ReorderBuffer our catalog
version might have removed, and we might not find any relation entry
with that relfilenodeid (because it is dropped or altered in the
future).

Hmm, this means this can also occur while streaming the changes. The
main reason as I understand is that it is because before decoding
commit, we don't know whether these changes are already sent to the
subscriber (based on confirmed_flush_location/start_decoding_at). I
think it is better to skip streaming such transactions as we can't
make the right decision about these and as this can happen generally
after the crash for the first few transactions, it shouldn't matter
much if we serialize such transactions instead of streaming them.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#217Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#209)
12 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jan 30, 2020 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 10, 2020 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Few more comments:
--------------------------------
v4-0007-Implement-streaming-mode-in-ReorderBuffer
1.
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
+ /*
+ * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
+ * information about
subtransactions, which could arrive after streaming start.
+ */
+ if (!txn->is_schema_sent)
+ snapshot_now
= ReorderBufferCopySnap(rb, txn->base_snapshot,
+ txn,
command_id);
..
}

Why are we using base snapshot here instead of the snapshot we saved
the first time streaming has happened? And as mentioned in comments,
won't we need to consider the snapshots for subtransactions that
arrived after the last time we have streamed the changes?

Fixed

+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
+ /*
+ * We can not use txn->snapshot_now directly because after we there
+ * might be some new sub-transaction which after the last streaming run
+ * so we need to add those sub-xip in the snapshot.
+ */
+ snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+ txn, command_id);

"because after we there", you seem to forget a word between 'we' and
'there'.

Fixed

So as we are copying it now, does this mean it will consider

the snapshots for subtransactions that arrived after the last time we
have streamed the changes? If so, have you tested it and can we add
the same in comments.

Yes I have tested. Comment added.

Also, if we need to copy the snapshot here, then do we need to again
copy it in ReorderBufferProcessTXN(in below code and in catch block in
the same function).

{
..
+ /*
+ * Remember the command ID and snapshot if transaction is streaming
+ * otherwise free the snapshot if we have copied it.
+ */
+ if (streaming)
+ {
+ txn->command_id = command_id;
+ txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+   txn, command_id);
+ }
+ else if (snapshot_now->copied)
+ ReorderBufferFreeSnap(rb, snapshot_now);
..
}

Fixed

4. In ReorderBufferStreamTXN(), don't we need to set some of the txn
fields like origin_id, origin_lsn as we do in ReorderBufferCommit()
especially to cover the case when it gets called due to memory
overflow (aka via ReorderBufferCheckMemoryLimit).

We get origin_lsn during commit time so I am not sure how can we do
that. I have also noticed that currently, we are not using origin_lsn
on the subscriber side. I think need more investigation that if we
want this then do we need to log it early.

Have you done any investigation of this point? You might want to look
at pg_replication_origin* APIs. Today, again looking at this code, I
think with current coding, it won't be used even when we encounter
commit record. Because ReorderBufferCommit calls
ReorderBufferStreamCommit which will make sure that origin_id and
origin_lsn is never sent. I think at least that should be fixed, if
not, probably, we need a comment with reasoning why we think it is
okay not to do in this case.

Still, the problem is the same because, currently, we are sending
origin_lsn as part of the "pgoutput_begin" message. Now, for the
streaming transaction,
we have already sent the stream start. However, we might send this
during the stream commit, but I am not completely sure because
currently,
the consumer of this message "apply_handle_origin" is just ignoring
it. I have also looked into pg_replication_origin* APIs and they are
used for setting origin id and
tracking the progress, but they will not consume the origin_lsn we are
sending in pgoutput_begin so this is not directly related.

+ /*
+ * If we are streaming the in-progress transaction then Discard the

/Discard/discard

Done

v4-0006-Gracefully-handle-concurrent-aborts-of-uncommitte
1.
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));

Why here we can't use TransactionIdDidAbort? If we can't use it, then
can you add comments stating the reason of the same.

Done

+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out.  Instead of directly checking the abort status we do check
+ * if it is not in progress transaction and no committed. Because if there
+ * were a system crash then status of the the transaction which were running
+ * at that time might not have marked.  So we need to consider them as
+ * aborted.  Refer detailed comments at snapmgr.c where the variable is
+ * declared.

How about replacing the above comment with below one:

If CheckXidAlive is valid, then we check if it aborted. If it did, we
error out. We can't directly use TransactionIdDidAbort as after crash
such transaction might not have been marked as aborted. See detailed
comments at snapmgr.c where the variable is declared.

Done

I am not able to understand the change in
v8-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup. Do you have
any explanation for the same?

It appears that in ReorderBufferCommitChild we are always setting the
final_lsn of the subxacts so it should not be invalid. For testing, I
have changed this as an assert and checked but it never hit. So maybe
we can remove this change.

Apart from that, I have fixed the toast tuple streaming bug by setting
the flag bit in the WAL (attached as 0012). I have also extended this
solution for handling the speculative insert bug so old patch for a
speculative insert bug fix is removed. I am also exploring the
solution that how can we do this without setting the flag in the WAL
as we discussed upthread.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v9-0002-Issue-individual-invalidations-with-wal_level-log.patchapplication/octet-stream; name=v9-0002-Issue-individual-invalidations-with-wal_level-log.patchDownload
From 615734e42767e840360c2524dad52defeb0e9aa9 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:26:33 +0530
Subject: [PATCH v9 02/12] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.

The new invalidations are written to WAL immediately, without any
such caching. Perhaps it would be possible to add similar caching,
e.g. at the command level, or something like that?
---
 src/backend/access/rmgrdesc/xactdesc.c          | 50 +++++++++++++++++
 src/backend/access/transam/xact.c               |  7 +++
 src/backend/replication/logical/decode.c        | 23 ++++++++
 src/backend/replication/logical/reorderbuffer.c | 55 +++++++++++++++---
 src/backend/utils/cache/inval.c                 | 75 +++++++++++++++++++++++++
 src/include/access/xact.h                       | 18 +++++-
 src/include/replication/reorderbuffer.h         | 14 +++++
 7 files changed, 234 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index fbc5942..6191060 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,11 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +401,14 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs,
+								xlrec->dbId, xlrec->tsId,
+								xlrec->relcacheInitFileInval);
+	}
 }
 
 const char *
@@ -423,7 +436,44 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval)
+{
+	int			i;
+
+	if (relcacheInitFileInval)
+		appendStringInfo(buf, "; relcache init file inval dbid %u tsid %u",
+						 dbId, tsId);
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index da32a4f..c9a64bb 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5997,6 +5997,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a99fcaf..13a11ac 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,29 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/* XXX for now we're issuing invalidations one by one */
+				Assert(invals->nmsgs == 1);
+
+				if (!TransactionIdIsValid(xid))
+					break;
+
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->dbId, invals->tsId,
+											 invals->relcacheInitFileInval,
+											 invals->msgs[0]);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index aeebbf2..3d6cbcf 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -473,6 +473,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
@@ -1822,17 +1823,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation message locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -2212,6 +2218,38 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 
 /*
  * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn,
+							 Oid dbId, Oid tsId, bool relcacheInitFileInval,
+							 SharedInvalidationMessage msg)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	/* XXX Should we even write invalidations without valid XID? */
+	if (xid == InvalidTransactionId)
+		return;
+
+	Assert(xid != InvalidTransactionId);
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.dbId = dbId;
+	change->data.inval.tsId = tsId;
+	change->data.inval.relcacheInitFileInval = relcacheInitFileInval;
+	change->data.inval.msg = msg;
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Setup the invalidation of the toplevel transaction.
  *
  * This needs to be done before ReorderBufferCommit is called!
  */
@@ -2658,6 +2696,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -2765,6 +2804,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -3050,6 +3090,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..e0d04b9 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write individual invalidations into WAL to support
+ *	the decoding of the in-progress transaction.  As of now it was enough to
+ *	log invalidation only at commit because we are only decoding the transaction
+ *	at the commit time.   We only need to log the catalog cache and relcache
+ *	invalidation.  There can not be any active MVCC scan in logical decoding so
+ *	we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,9 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -489,6 +499,18 @@ RegisterCatcacheInvalidation(int cacheId,
 {
 	AddCatcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   cacheId, hashValue, dbId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cc.id = (int8) cacheId;
+		msg.cc.dbId = dbId;
+		msg.cc.hashValue = hashValue;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -501,6 +523,18 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 {
 	AddCatalogInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								  dbId, catId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cat.id = SHAREDINVALCATALOG_ID;
+		msg.cat.dbId = dbId;
+		msg.cat.catId = catId;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -511,6 +545,8 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 static void
 RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 {
+	bool		RelcacheInitFileInval = false;
+
 	AddRelcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   dbId, relId);
 
@@ -529,7 +565,22 @@ RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 	 * as well.  Also zap when we are invalidating whole relcache.
 	 */
 	if (relId == InvalidOid || RelationIdIsInInitFile(relId))
+	{
 		transInvalInfo->RelcacheInitFileInval = true;
+		RelcacheInitFileInval = true;
+	}
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.rc.id = SHAREDINVALRELCACHE_ID;
+		msg.rc.dbId = dbId;
+		msg.rc.relId = relId;
+
+		LogLogicalInvalidations(1, &msg, RelcacheInitFileInval);
+	}
 }
 
 /*
@@ -1501,3 +1552,27 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval)
+{
+	xl_xact_invalidations xlrec;
+
+	/* prepare record */
+	memset(&xlrec, 0, sizeof(xlrec));
+	xlrec.dbId = MyDatabaseId;
+	xlrec.tsId = MyDatabaseTableSpace;
+	xlrec.relcacheInitFileInval = relcacheInitFileInval;
+	xlrec.nmsgs = nmsgs;
+
+	/* perform insertion */
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+	XLogRegisterData((char *) msgs,
+					 nmsgs * sizeof(SharedInvalidationMessage));
+	XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b38..6f2a583 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,22 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ *
+ * XXX Currently nmsgs=1 but that might change in the future.
+ */
+typedef struct xl_xact_invalidations
+{
+	Oid			dbId;			/* MyDatabaseId */
+	Oid			tsId;			/* MyDatabaseTableSpace */
+	bool		relcacheInitFileInval;	/* invalidate relcache init file */
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4..9a3f045 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,16 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			Oid			dbId;	/* MyDatabaseId */
+			Oid			tsId;	/* MyDatabaseTableSpace */
+			bool		relcacheInitFileInval;	/* invalidate relcache init
+												 * file */
+			SharedInvalidationMessage msg;	/* invalidation message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +470,9 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  Oid dbId, Oid tsId, bool relcacheInitFileInval,
+								  SharedInvalidationMessage msg);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
1.8.3.1

v9-0004-Gracefully-handle-concurrent-aborts-of-uncommitte.patchapplication/octet-stream; name=v9-0004-Gracefully-handle-concurrent-aborts-of-uncommitte.patchDownload
From b2fb494992211c457fd4b1508b4b501fe7d0a9c7 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:32:34 +0530
Subject: [PATCH v9 04/12] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml               |  5 ++-
 src/backend/access/heap/heapam.c                | 41 +++++++++++++++++++++++++
 src/backend/access/index/genam.c                | 40 ++++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c |  8 ++---
 src/backend/utils/time/snapmgr.c                | 25 +++++++++++++--
 src/include/utils/snapmgr.h                     |  4 ++-
 6 files changed, 115 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index ace21ec..319349a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,7 +432,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
      includes writing to tables, performing DDL changes, and
      calling <literal>txid_current()</literal>.
     </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index db6fad7..24d0d7a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1304,6 +1304,15 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(scan->rs_base.rs_rd) ||
+		  RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	HEAPDEBUG_1;				/* heap_getnext( info ) */
@@ -1423,6 +1432,14 @@ heap_fetch(Relation relation,
 	bool		valid;
 
 	/*
+	 * We don't expect direct calls to heap_fetch with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_fetch call during logical decoding");
+
+	/*
 	 * Fetch and pin the appropriate page of the relation.
 	 */
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
@@ -1537,6 +1554,14 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		valid;
 	bool		skip;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search_buffer with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_hot_search_buffer call during logical decoding");
+
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
 		*all_dead = first_call;
@@ -1686,6 +1711,14 @@ heap_get_latest_tid(TableScanDesc sscan,
 	Assert(ItemPointerIsValid(tid));
 
 	/*
+	 * We don't expect direct calls to heap_get_latest_tid with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_get_latest_tid call during logical decoding");
+
+	/*
 	 * Loop to chase down t_ctid links.  At top of loop, ctid is the tuple we
 	 * need to examine, and *tid is the TID we will return if ctid turns out
 	 * to be bogus.
@@ -5483,6 +5516,14 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	ItemId		lp = NULL;
 	HeapTupleHeader htup;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_hot_search call during logical decoding");
+
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index c16eb05..5b0ef72 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -477,6 +478,19 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  We can't directly use TransactionIdDidAbort as after crash
+	 * such transaction might not have been marked as aborted.  See detailed
+	 * comments at snapmgr.c where the variable is declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
@@ -513,6 +527,19 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  We can't directly use TransactionIdDidAbort as after crash
+	 * such transaction might not have been marked as aborted.  See detailed
+	 * comments at snapmgr.c where the variable is declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return result;
 }
 
@@ -639,6 +666,19 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  We can't directly use TransactionIdDidAbort as after crash
+	 * such transaction might not have been marked as aborted.  See detailed
+	 * comments at snapmgr.c where the variable is declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3d6cbcf..3ca960c 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -692,7 +692,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 			txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 		/* setup snapshot to allow catalog access */
-		SetupHistoricSnapshot(snapshot_now, NULL);
+		SetupHistoricSnapshot(snapshot_now, NULL, xid);
 		PG_TRY();
 		{
 			rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1551,7 +1551,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1802,7 +1802,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1822,7 +1822,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					}
 
 					break;
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c5..93a0c04 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -154,6 +154,13 @@ static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
 /*
+ * An xid value pointing to a possibly ongoing (sub)transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
+/*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2029,10 +2036,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
  *
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive.  This is to re-check XID status while accessing catalog.
+ *
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+					  TransactionId snapshot_xid)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -2041,8 +2052,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 
 	/* setup (cmin, cmax) lookup hash */
 	tuplecid_data = tuplecids;
-}
 
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
+	 */
+	if (TransactionIdIsValid(snapshot_xid) &&
+		!TransactionIdDidCommit(snapshot_xid))
+		CheckXidAlive = snapshot_xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
 /*
  * Make catalog snapshots behave normally again.
@@ -2052,6 +2072,7 @@ TeardownHistoricSnapshot(bool is_error)
 {
 	HistoricSnapshot = NULL;
 	tuplecid_data = NULL;
+	CheckXidAlive = InvalidTransactionId;
 }
 
 bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13c..12f737b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,8 +145,10 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+								  TransactionId snapshot_xid);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-- 
1.8.3.1

v9-0001-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=v9-0001-Immediately-WAL-log-assignments.patchDownload
From 09c846e3b5a04e56046866b3421c5887503e58e9 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 09:53:08 +0530
Subject: [PATCH v9 01/12] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 45 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 39 +++++++++++++--------------
 src/include/access/xact.h                |  3 +++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 +++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 98 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e3c60f2..da32a4f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -190,6 +190,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -5096,6 +5097,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -5998,3 +6000,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 2fa0a7f..b11b0c2 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -88,11 +88,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -194,6 +196,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -397,7 +403,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -743,6 +749,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		curinsert_flags |= XLOG_INCLUDE_XID;
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 32f0225..51b6485 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1186,6 +1186,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1224,6 +1225,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5e1dc8a..a99fcaf 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel_xid is valid, we need to assign the subxact to the
+	 * toplevel transaction. We need to do this for all records, hence we
+	 * do it before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 record->toplevel_xid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrIds) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(r->toplevel_xid))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04ba..8645b38 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033f..e23892a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -227,6 +227,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 4582196..756f6df 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -150,6 +150,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -285,6 +287,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v9-0005-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v9-0005-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From 2771f016d8c329a72c91a38bffac05878456b172 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri, 10 Jan 2020 18:03:27 -0300
Subject: [PATCH v9 05/12] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c     |  38 +-
 src/backend/replication/logical/reorderbuffer.c | 710 +++++++++++++++++++++---
 src/include/replication/reorderbuffer.h         |  36 ++
 3 files changed, 693 insertions(+), 91 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..160b167 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3ca960c..f68b2e4 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -236,6 +236,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +245,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +381,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -769,6 +782,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -864,6 +909,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -987,7 +1035,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1023,6 +1071,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1037,6 +1088,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1320,6 +1374,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1345,8 +1408,93 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
  */
 static void
 ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1354,9 +1502,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
-
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
 
 	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
@@ -1495,63 +1640,75 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report an specific error which we can ignore.  We
+ * might have already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the abort we will
+ * stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
-		return;
-	}
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 
-	/* build data to be able to lookup the CommandIds of catalog tuples */
+	/*
+	 * build data to be able to lookup the CommandIds of catalog tuples
+	 */
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, txn->xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1567,15 +1724,20 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 	PG_TRY();
 	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 		ReorderBufferChange *change;
 		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction("stream");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* start streaming this chunk of transaction */
+		if (streaming)
+			rb->stream_start(rb, txn);
+		else
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1583,6 +1745,19 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1592,8 +1767,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * use as a normal record. It'll be cleaned up at the end
 					 * of INSERT processing.
 					 */
-					if (specinsert == NULL)
-						elog(ERROR, "invalid ordering of speculative insertion changes");
 					Assert(specinsert->data.tp.oldtuple == NULL);
 					change = specinsert;
 					change->action = REORDER_BUFFER_CHANGE_INSERT;
@@ -1659,7 +1832,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						if (streaming)
+						{
+							rb->stream_change(rb, txn, relation, change);
+
+							/* Remember that we have sent some data for this xid. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_change(rb, txn, relation, change);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1680,8 +1861,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						 * freed/reused while restoring spooled data from
 						 * disk.
 						 */
-						Assert(change->data.tp.newtuple != NULL);
-
 						dlist_delete(&change->node);
 						ReorderBufferToastAppendChunk(rb, txn, relation,
 													  change);
@@ -1699,7 +1878,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1757,7 +1936,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						if (streaming)
+						{
+							rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+							/* Remember that we have sent some data. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_truncate(rb, txn, nrelations, relations, change);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1766,10 +1953,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					if (streaming)
+						rb->stream_message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
+					else
+						rb->message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1800,9 +1993,9 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+										  txn->xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1822,7 +2015,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+											  txn->xid);
 					}
 
 					break;
@@ -1860,14 +2054,46 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			/*
+			 * Set the last last of the stream as the final lsn before calling
+			 * stream stop.
+			 */
+			if (!XLogRecPtrIsInvalid(prev_lsn))
+				txn->final_lsn = prev_lsn;
+			rb->stream_stop(rb, txn);
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+		{
+			txn->command_id = command_id;
+
+			/* Avoid copying if it's already copied. */
+			if (snapshot_now->copied)
+				txn->snapshot_now = snapshot_now;
+			else
+				txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+														  txn, command_id);
+		}
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1885,14 +2111,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,18 +2145,122 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		if (streaming)
+		{
+			/* Discard the changes that we just streamed. */
+			ReorderBufferTruncateTXN(rb, txn);
 
-		PG_RE_THROW();
+			/* Re-throw only if it's not an abort. */
+			if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+			{
+				MemoryContextSwitchTo(ecxt);
+				PG_RE_THROW();
+			}
+			else
+			{
+				/* remember the command ID and snapshot for the streaming run */
+				txn->command_id = command_id;
+
+				/* Avoid copying if it's already copied. */
+				if (snapshot_now->copied)
+					txn->snapshot_now = snapshot_now;
+				else
+					txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+															  txn, command_id);
+				/*
+				 * Set the last last of the stream as the final lsn before
+				 * calling stream stop.
+				 */
+				txn->final_lsn = prev_lsn;
+				rb->stream_stop(rb, txn);
+
+				FlushErrorState();
+			}
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
 /*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * We currently can only decode a transaction's contents when its commit
+ * record is read because that's the only place where we know about cache
+ * invalidations. Thus, once a toplevel commit is read, we iterate over the top
+ * and subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -1946,6 +2284,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2015,6 +2360,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2150,8 +2502,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2159,6 +2520,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2170,19 +2532,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2211,6 +2582,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2300,6 +2672,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2404,6 +2783,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
  *
@@ -2423,15 +2834,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2734,6 +3176,102 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 15bb5ed..adb8f9d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -192,6 +193,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -227,6 +246,16 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	final_lsn;
 
 	/*
+	 * Have we sent any changes for this transaction in output plugin?
+	 */
+	bool		any_data_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
+	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
 	XLogRecPtr	end_lsn;
@@ -257,6 +286,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
-- 
1.8.3.1

v9-0003-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=v9-0003-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From 82a6796f85a3a6ed341f250af3e4c4ff4edd8444 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v9 03/12] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 +++++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  57 +++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..64f651f 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8db9686..ace21ec 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index e3da7d3..ec40755 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +204,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -860,6 +908,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->final_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 3b7ca7f..f24e246 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..0d0a94a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 9a3f045..15bb5ed 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -356,6 +356,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -395,6 +441,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v9-0006-Support-logical_decoding_work_mem-set-from-create.patchapplication/octet-stream; name=v9-0006-Support-logical_decoding_work_mem-set-from-create.patchDownload
From 0a7273e0d77f4d90d28836db1e4282e9078476da Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 11:51:04 +0530
Subject: [PATCH v9 06/12] Support logical_decoding_work_mem set from create
 subscription command

---
 doc/src/sgml/config.sgml                           | 21 +++++++++++
 doc/src/sgml/ref/create_subscription.sgml          | 12 ++++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/commands/subscriptioncmds.c            | 44 +++++++++++++++++++---
 .../libpqwalreceiver/libpqwalreceiver.c            |  3 ++
 src/backend/replication/logical/worker.c           |  1 +
 src/backend/replication/pgoutput/pgoutput.c        | 30 ++++++++++++++-
 src/include/catalog/pg_subscription.h              |  3 ++
 src/include/replication/walreceiver.h              |  1 +
 9 files changed, 108 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c1128f8..a5d1675 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1751,6 +1751,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-logical-decoding-work-mem" xreflabel="logical_decoding_work_mem">
+      <term><varname>logical_decoding_work_mem</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>logical_decoding_work_mem</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to be used by logical decoding,
+        before some of the decoded changes are either written to local disk.
+        This limits the amount of memory used by logical streaming replication
+        connections. It defaults to 64 megabytes (<literal>64MB</literal>).
+        Since each replication connection only uses a single buffer of this size,
+        and an installation normally doesn't have many such connections
+        concurrently (as limited by <varname>max_wal_senders</varname>), it's
+        safe to set this value significantly higher than <varname>work_mem</varname>,
+        reducing the amount of decoded changes written to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c24..91790b0 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>work_mem</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          Limits the amount of memory used to decode changes on the
+          publisher.  If not specified, the publisher will use the default
+          specified by <varname>logical_decoding_work_mem</varname>. When
+          needed, additional data are spilled to disk.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index f77a83b..5cd1daa 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->workmem = subform->subworkmem;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 119a9ce..d10f330 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -58,7 +58,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, int *logical_wm,
+						   bool *logical_wm_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -89,6 +90,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (logical_wm)
+		*logical_wm_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -174,6 +177,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0 && logical_wm)
+		{
+			if (*logical_wm_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*logical_wm_given = true;
+			*logical_wm = defGetInt32(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -317,6 +330,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	int			logical_wm;
+	bool		logical_wm_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -333,7 +348,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &logical_wm, &logical_wm_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -411,6 +426,12 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (logical_wm_given)
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(logical_wm);
+	else
+		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -668,10 +689,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				int			logical_wm;
+				bool		logical_wm_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &logical_wm, &logical_wm_given);
 
 				if (slotname_given)
 				{
@@ -696,6 +720,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (logical_wm_given)
+				{
+					values[Anum_pg_subscription_subworkmem - 1] =
+						Int32GetDatum(logical_wm);
+					replaces[Anum_pg_subscription_subworkmem - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -707,7 +738,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -745,7 +777,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -782,7 +814,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9..896ddab 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		appendStringInfo(&cmd, ", work_mem '%d'",
+						 options->proto.logical.work_mem);
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 7a5471f..48b960c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1745,6 +1745,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.work_mem = MySubscription->workmem;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7525082..536722b 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -18,6 +18,7 @@
 #include "replication/logicalproto.h"
 #include "replication/origin.h"
 #include "replication/pgoutput.h"
+#include "utils/guc.h"
 #include "utils/int8.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
@@ -87,11 +88,12 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, int *logical_decoding_work_mem)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		work_mem_given = false;
 
 	foreach(lc, options)
 	{
@@ -137,6 +139,29 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "work_mem") == 0)
+		{
+			int64	parsed;
+
+			if (work_mem_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			work_mem_given = true;
+
+			if (!scanint8(strVal(defel->arg), true, &parsed))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid work_mem")));
+
+			if (parsed > PG_INT32_MAX || parsed < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("work_mem \"%s\" out of range",
+								strVal(defel->arg))));
+
+			*logical_decoding_work_mem = (int)parsed;
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -171,7 +196,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&logical_decoding_work_mem);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d4..3394379 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	int32		subworkmem;		/* Memory to use to decode changes. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	int			workmem;		/* Memory to decode changes. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e08afc6..4c7acfb 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -169,6 +169,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			int			work_mem;	/* Memory limit to use for decoding */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
-- 
1.8.3.1

v9-0007-Add-support-for-streaming-to-built-in-replication.patchapplication/octet-stream; name=v9-0007-Add-support-for-streaming-to-built-in-replication.patchDownload
From 220a1a49543d23ac2d9fd7e038aa23f2e7adfd6d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 13 Jan 2020 14:24:39 +0530
Subject: [PATCH v9 07/12] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |    5 +-
 doc/src/sgml/ref/create_subscription.sgml          |   12 +
 src/backend/catalog/pg_subscription.c              |    1 +
 src/backend/commands/subscriptioncmds.c            |   60 +-
 src/backend/postmaster/pgstat.c                    |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    8 +-
 src/backend/replication/logical/launcher.c         |    2 +
 src/backend/replication/logical/logical.c          |    4 +-
 src/backend/replication/logical/proto.c            |  157 ++-
 src/backend/replication/logical/worker.c           | 1031 ++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c        |  310 +++++-
 src/backend/replication/slotfuncs.c                |    7 +
 src/backend/replication/walsender.c                |    6 +
 src/include/catalog/pg_subscription.h              |    3 +
 src/include/pgstat.h                               |    6 +-
 src/include/replication/logicalproto.h             |   42 +-
 src/include/replication/walreceiver.h              |    1 +
 src/test/subscription/t/009_stream_simple.pl       |   86 ++
 src/test/subscription/t/010_stream_subxact.pl      |  102 ++
 src/test/subscription/t/011_stream_ddl.pl          |   95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |   82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |   84 ++
 22 files changed, 2074 insertions(+), 42 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 6dfb2e4..e1fb907 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
+      and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 91790b0..d9abf5e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -218,6 +218,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 5cd1daa..1dc486c 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->workmem = subform->subworkmem;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index d10f330..a4f960c 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
 						   bool *refresh, int *logical_wm,
-						   bool *logical_wm_given)
+						   bool *logical_wm_given, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -92,6 +93,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*refresh = true;
 	if (logical_wm)
 		*logical_wm_given = false;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -186,6 +189,26 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 
 			*logical_wm_given = true;
 			*logical_wm = defGetInt32(defel);
+
+			/*
+			 * Check that the value is not smaller than 64kB (which is
+			 * the minimum value for logical_work_mem).
+			 */
+			if (*logical_wm < 64)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("%d is outside the valid range for parameter \"work_mem\" (64 .. 2147483647)",
+								*logical_wm)));
+		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
 		}
 		else
 			ereport(ERROR,
@@ -332,6 +355,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	int			logical_wm;
 	bool		logical_wm_given;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -348,7 +373,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL, &logical_wm, &logical_wm_given);
+							   NULL, &logical_wm, &logical_wm_given,
+							   &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -430,7 +456,15 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 		values[Anum_pg_subscription_subworkmem - 1] =
 			Int32GetDatum(logical_wm);
 	else
-		nulls[Anum_pg_subscription_subworkmem - 1] = true;
+		values[Anum_pg_subscription_subworkmem - 1] =
+			Int32GetDatum(-1);
+
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
 
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
@@ -691,11 +725,14 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				int			logical_wm;
 				bool		logical_wm_given;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
 										   NULL, &synchronous_commit, NULL,
-										   &logical_wm, &logical_wm_given);
+										   &logical_wm, &logical_wm_given,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -727,6 +764,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subworkmem - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -739,7 +783,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
 										   NULL, NULL, NULL, NULL, NULL,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -777,7 +821,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh, NULL, NULL);
+										   NULL, &refresh, NULL, NULL,
+										   NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -814,7 +859,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7169509..eb00242 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4102,6 +4102,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 896ddab..1b8303c 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,8 +408,12 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
-		appendStringInfo(&cmd, ", work_mem '%d'",
-						 options->proto.logical.work_mem);
+		if (options->proto.logical.work_mem != -1)
+			appendStringInfo(&cmd, ", work_mem '%d'",
+							 options->proto.logical.work_mem);
+
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
 
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e..e80d00c 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,6 +14,8 @@
  *
  *-------------------------------------------------------------------------
  */
+#include <sys/types.h>
+#include <unistd.h>
 
 #include "postgres.h"
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ec40755..6e38eac 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1146,7 +1146,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1191,7 +1191,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd..93780b2 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,13 +257,18 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
+	pq_sendbyte(out, 'D');		/* action DELETE */
+
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
-	pq_sendbyte(out, 'D');		/* action DELETE */
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
 
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,119 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendint32(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgint(in, 4) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+}
+
+TransactionId
+logicalrep_read_stream_stop(StringInfo in)
+{
+	TransactionId xid;
+
+	xid = pq_getmsgint(in, 4);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 48b960c..5c20c0e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in /tmp by default, and the filenames include both
+ * the XID of the toplevel transaction and OID of the subscription. This
+ * is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -31,6 +52,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -60,6 +82,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -67,6 +90,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -106,12 +130,59 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	off_t		offset;			/* offset in the file */
+}			SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
+static bool handle_streamed_transaction(const char action, StringInfo s);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -163,6 +234,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -529,6 +636,318 @@ apply_handle_origin(StringInfo s)
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found PG_USED_FOR_ASSERTS_ONLY = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* FIXME optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/* We should not receive aborts for unknown subtransactions. */
+		Assert(found);
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	/* XXX Should this be allocated in another memory context? */
+
+	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	ensure_transaction();
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pfree(buffer);
+	pfree(s2.data);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -541,6 +960,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -556,6 +978,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -591,6 +1016,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -695,6 +1123,9 @@ apply_handle_update(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -830,6 +1261,9 @@ apply_handle_delete(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -929,6 +1363,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1020,6 +1457,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1117,6 +1570,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 }
 
 /*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i]);
+}
+
+/*
  * Apply main loop.
  */
 static void
@@ -1132,6 +1601,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1580,6 +2052,564 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	CloseTransientFile(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	CloseTransientFile(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * TODO: Add missing_ok flag to specify in which cases it's OK not to
+ * find the files, and when it's an error.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialied in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -1746,6 +2776,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.work_mem = MySubscription->workmem;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 536722b..ebe0423 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -45,17 +45,45 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
 
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
+
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 
-/* Entry in the map used to remember which relation schemas we sent. */
+/*
+ * Entry in the map used to remember which relation schemas we sent.
+ *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
+	TransactionId	xid;		/* transaction that created the record */
 	bool		schema_sent;	/* did we send the schema? */
+	List	   *streamed_txns;	/* streamed toplevel transactions with
+								 * this schema */
 	bool		replicate_valid;
 	PublicationActions pubactions;
 } RelationSyncEntry;
@@ -64,11 +92,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -84,16 +118,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, int *logical_decoding_work_mem)
+						List **publication_names, int *logical_decoding_work_mem,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		work_mem_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -162,6 +206,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*logical_decoding_work_mem = (int)parsed;
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -174,6 +235,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -197,7 +259,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&logical_decoding_work_mem);
+								&logical_decoding_work_mem,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -217,6 +280,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -284,9 +368,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (!relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (!schema_sent)
 	{
 		TupleDesc	desc;
 		int			i;
@@ -312,19 +435,26 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 				continue;
 
 			OutputPluginPrepareWrite(ctx, false);
-			logicalrep_write_typ(ctx->out, att->atttypid);
+			logicalrep_write_typ(ctx->out, xid, att->atttypid);
 			OutputPluginWrite(ctx, false);
 		}
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_rel(ctx->out, relation);
+		logicalrep_write_rel(ctx->out, xid, relation);
 		OutputPluginWrite(ctx, false);
-		relentry->schema_sent = true;
+		relentry->xid = change->txn->xid;
+
+		if (in_streaming)
+			set_schema_sent_in_streamed_txn(relentry, topxid);
+		else
+			relentry->schema_sent = true;
 	}
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -333,6 +463,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -361,14 +495,14 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
 	{
 		case REORDER_BUFFER_CHANGE_INSERT:
 			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_insert(ctx->out, relation,
+			logicalrep_write_insert(ctx->out, xid, relation,
 									&change->data.tp.newtuple->tuple);
 			OutputPluginWrite(ctx, true);
 			break;
@@ -378,7 +512,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				&change->data.tp.oldtuple->tuple : NULL;
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple,
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
 										&change->data.tp.newtuple->tuple);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -387,7 +521,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			if (change->data.tp.oldtuple)
 			{
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation,
+				logicalrep_write_delete(ctx->out, xid, relation,
 										&change->data.tp.oldtuple->tuple);
 				OutputPluginWrite(ctx, true);
 			}
@@ -413,6 +547,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -433,13 +571,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -513,6 +652,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -549,6 +773,34 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  */
 static RelationSyncEntry *
@@ -623,6 +875,36 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -657,7 +939,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2c9d5de..30db2c2 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -147,6 +147,13 @@ create_logical_replication_slot(char *name, char *plugin,
 									logical_read_local_xlog_page, NULL, NULL,
 									NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+
 	/* build initial snapshot, might take a while */
 	DecodingContextFindStartpoint(ctx);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index abb533b..6ee7fa2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -968,6 +968,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndUpdateProgress);
 
 		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
+		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
 		 * messages or send keepalives. As we possibly need to wait for
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3394379..18f416f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -50,6 +50,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	int32		subworkmem;		/* Memory to use to decode changes. */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -76,6 +78,7 @@ typedef struct Subscription
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	int			workmem;		/* Memory to decode changes. */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index aecb601..146d7c4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -945,7 +945,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 2cc2dc4..ade4188 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -85,25 +89,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out, TransactionId xid);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4c7acfb..54054a4 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -170,6 +170,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			int			work_mem;	/* Memory limit to use for decoding */
+			bool		streaming;	/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v9-0008-Track-statistics-for-streaming.patchapplication/octet-stream; name=v9-0008-Track-statistics-for-streaming.patchDownload
From b898ea48662e3af274c248c8f9943086ec749f3e Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Jan 2020 09:45:27 +0530
Subject: [PATCH v9 08/12] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                    | 25 +++++++++++++++++++
 src/backend/catalog/system_views.sql            |  5 +++-
 src/backend/replication/logical/reorderbuffer.c | 13 ++++++++++
 src/backend/replication/walsender.c             | 32 +++++++++++++++++++++----
 src/include/catalog/pg_proc.dat                 |  6 ++---
 src/include/replication/reorderbuffer.h         | 13 ++++++----
 src/include/replication/walsender_private.h     |  5 ++++
 src/test/regress/expected/rules.out             |  7 ++++--
 8 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8839699..2c7089e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2004,6 +2004,31 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       may get spilled repeatedly, and this counter gets incremented on every
       such invocation.</entry>
     </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
+
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index c9e6060..d6f07d6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -786,7 +786,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index f68b2e4..a3c4509 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -331,6 +331,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3267,6 +3271,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
 
+	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 6ee7fa2..21c4da0 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1292,7 +1292,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1313,7 +1313,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2356,6 +2357,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3194,7 +3198,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3251,6 +3255,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillTxns;
 		int64		spillCount;
 		int64		spillBytes;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3274,6 +3281,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		memset(nulls, 0, sizeof(nulls));
@@ -3360,6 +3370,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3608,11 +3623,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2228256..969fa9e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5193,9 +5193,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index adb8f9d..5e1337e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -519,15 +519,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 366828f..3888b0c 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -85,6 +85,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2ab2115..eed0088 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1983,9 +1983,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
1.8.3.1

v9-0010-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patchapplication/octet-stream; name=v9-0010-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patchDownload
From 263065276ed552bcc782dd637df42f87e9adaee3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:14:45 +0200
Subject: [PATCH v9 10/12] BUGFIX: set final_lsn for subxacts before cleanup

---
 src/backend/replication/logical/reorderbuffer.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index a3c4509..2004d6a 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1327,6 +1327,10 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
+		/* make sure subtxn has final_lsn */
+		if (subtxn->final_lsn == InvalidXLogRecPtr)
+			subtxn->final_lsn = txn->final_lsn;
+
 		/*
 		 * Subtransactions are always associated to the toplevel TXN, even if
 		 * they originally were happening inside another subtxn, so we won't
-- 
1.8.3.1

v9-0009-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v9-0009-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From 29ee9abb78efc7e6b9b391d89dfc77794743f178 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v9 09/12] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 6da7f71..086d0c7 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -70,7 +70,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 34ab11e..ad3ed13 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 81520a7..0c9c6b3 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v9-0011-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v9-0011-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From b16b084d53cc3c5c15eed454806433a3211c3872 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v9 11/12] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v9-0012-Bugfix-handling-of-incomplete-toast-tuple.patchapplication/octet-stream; name=v9-0012-Bugfix-handling-of-incomplete-toast-tuple.patchDownload
From c70b5d93f5018e4c0920feefb8615a37a83a147e Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 30 Jan 2020 14:21:04 +0530
Subject: [PATCH v9 12/12] Bugfix handling of incomplete toast tuple

---
 src/backend/access/heap/heapam.c                |   3 +
 src/backend/replication/logical/decode.c        |  17 ++-
 src/backend/replication/logical/reorderbuffer.c | 145 +++++++++++++-----------
 src/include/access/heapam_xlog.h                |   1 +
 src/include/replication/reorderbuffer.h         |  17 ++-
 5 files changed, 110 insertions(+), 73 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 24d0d7a..faaaf67 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2018,6 +2018,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 13a11ac..4ee528f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -734,7 +734,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -801,7 +803,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -858,7 +861,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -894,7 +898,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -999,7 +1003,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1037,7 +1041,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2004d6a..524a66e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -650,7 +650,7 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
@@ -664,6 +664,28 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Otherwise, if
+	 * we have toast insert bit set and this is insert/update then clear the
+	 * bit.
+	 */
+	if (toast_insert)
+		txn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(txn) &&
+			 ((change->action == REORDER_BUFFER_CHANGE_INSERT) ||
+			 (change->action == REORDER_BUFFER_CHANGE_UPDATE)))
+		txn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * If this is a speculative insert then set the corresponding bit.
+	 * Otherwise, if we have speculative insert bit set and this is spec confirm
+	 * record then clear the bit.
+	 */
+	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)
+		txn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM)
+		txn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
@@ -696,7 +718,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1870,8 +1892,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						 * disk.
 						 */
 						dlist_delete(&change->node);
-						ReorderBufferToastAppendChunk(rb, txn, relation,
-													  change);
+							ReorderBufferToastAppendChunk(rb, txn, relation,
+														  change);
 					}
 
 			change_done:
@@ -2457,7 +2479,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2506,7 +2528,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2529,6 +2551,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2544,7 +2567,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 	/* if subxact, and streaming supported, use the toplevel instead */
 	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+		toptxn = txn->toptxn;
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2552,12 +2575,16 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+		if (toptxn)
+			toptxn->size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+		if (toptxn)
+			toptxn->size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2623,7 +2650,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 	change->data.inval.relcacheInitFileInval = relcacheInitFileInval;
 	change->data.inval.msg = msg;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -2810,15 +2837,16 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		/*
+		 * if the current transaction is larger and doesn't have incomplete data
+		 * remember it.
+		 */
+		if (((!largest) || (txn->size > largest->size)) &&
+			((txn->size > 0) && !rbtxn_has_toast_insert(txn) &&
+			!rbtxn_has_spec_insert(txn)))
+		largest = txn;
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2836,66 +2864,51 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
 	ReorderBufferTXN *txn;
 
-	/* bail out if we haven't exceeded the memory limit */
-	if (rb->size < logical_decoding_work_mem * 1024L)
-		return;
-
-	/*
-	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by streaming, if supported. Otherwise spill to disk.
-	 */
-	if (ReorderBufferCanStream(rb))
+	/* Loop until we reach under the memory limit. */
+	while (rb->size >= logical_decoding_work_mem * 1024L)
 	{
 		/*
-		 * Pick the largest toplevel transaction and evict it from memory by
-		 * streaming the already decoded part.
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTopTXN(rb);
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn && !txn->toptxn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
-		ReorderBufferStreamTXN(rb, txn);
-	}
-	else
-	{
 		/*
-		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * After eviction, the transaction should have no entries in memory, and
+		 * should use 0 bytes for changes.
+		 *
+		 * XXX Checking the size is fine for both cases - spill to disk and
+		 * streaming. But for streaming we should really check nentries_mem for
+		 * all subtransactions too.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
-
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
-
-		ReorderBufferSerializeTXN(rb, txn);
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
 	}
 
-	/*
-	 * After eviction, the transaction should have no entries in memory, and
-	 * should use 0 bytes for changes.
-	 *
-	 * XXX Checking the size is fine for both cases - spill to disk and
-	 * streaming. But for streaming we should really check nentries_mem for
-	 * all subtransactions too.
-	 */
-	Assert(txn->size == 0);
-	Assert(txn->nentries_mem == 0);
-
-	/*
-	 * And furthermore, evicting the transaction should get us below the
-	 * memory limit again - it is not possible that we're still exceeding the
-	 * memory limit after evicting the transaction.
-	 *
-	 * This follows from the simple fact that the selected transaction is at
-	 * least as large as the most recent change (which caused us to go over
-	 * the memory limit). So by evicting it we're definitely back below the
-	 * memory limit.
-	 */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5e1337e..a2646c5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -174,6 +174,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -193,6 +195,17 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -547,7 +560,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

#218Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#211)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Jan 31, 2020 at 8:08 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jan 30, 2020 at 6:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jan 30, 2020 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Also, if we need to copy the snapshot here, then do we need to again
copy it in ReorderBufferProcessTXN(in below code and in catch block in
the same function).

I think so because as part of the
"REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT" change, we might directly
point to the snapshot and that will get truncated when we truncate all
the changes of the ReorderBufferTXN. So I think we can check if
snapshot_now->copied is true then we can avoid copying otherwise we
can copy?

Yeah, that makes sense, but I think then we also need to ensure that
ReorderBufferStreamTXN frees the snapshot only when it is copied. It
seems to me it should be always copied in the place where we are
trying to free it, so probably we should have an Assert there.

One more thing:
ReorderBufferProcessTXN()
{
..
+ if (streaming)
+ {
+ /*
+ * While streaming an in-progress transaction there is a
+ * possibility that the (sub)transaction might get aborted
+ * concurrently.  In such case if the (sub)transaction has
+ * catalog update then we might decode the tuple using wrong
+ * catalog version.  So for detecting the concurrent abort we
+ * set CheckXidAlive to the current (sub)transaction's xid for
+ * which this change belongs to.  And, during catalog scan we
+ * can check the status of the xid and if it is aborted we will
+ * report an specific error which we can ignore.  We might have
+ * already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the
+ * abort we will stream abort message to truncate the changes in
+ * the subscriber.
+ */
+ CheckXidAlive = change->txn->xid;
+ }
..
}

I think it is better to move the above code into an inline function
(something like SetXidAlive). It will make the code in function
ReorderBufferProcessTXN look cleaner and easier to understand.

Fixed in the latest version sent upthread.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#219Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#216)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Feb 5, 2020 at 9:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Feb 4, 2020 at 11:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

One more thing we can do is to identify whether the tuple belongs to
toast relation while decoding it. However, I think to do that we need
to have access to relcache at that time and that might add some
overhead as we need to do that for each tuple. Can we investigate
what it will take to do that and if it is better than setting a bit
during WAL logging.

I have done some more analysis on this and it appears that there are
few problems in doing this. Basically, once we get the confirmed
flush location, we advance the replication_slot_catalog_xmin so that
vacuum can garbage collect the old tuple. So the problem is that
while we are collecting the changes in the ReorderBuffer our catalog
version might have removed, and we might not find any relation entry
with that relfilenodeid (because it is dropped or altered in the
future).

Hmm, this means this can also occur while streaming the changes. The
main reason as I understand is that it is because before decoding
commit, we don't know whether these changes are already sent to the
subscriber (based on confirmed_flush_location/start_decoding_at).

Right.

I think it is better to skip streaming such transactions as we can't
make the right decision about these and as this can happen generally
after the crash for the first few transactions, it shouldn't matter
much if we serialize such transactions instead of streaming them.

I think the idea makes sense to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#220Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#218)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Fixed in the latest version sent upthread.

Okay, thanks. I haven't looked at the latest version of patch series
as I was reviewing the previous version and I think all of these
comments are in the patch which is not modified. Here are my
comments:

I think we don't need to maintain
v8-0007-Support-logical_decoding_work_mem-set-from-create as per
discussion in one of the above emails [1]/messages/by-id/CAA4eK1LH7xzF+-qHRv9EDXQTFYjPUYZw5B7FSK9QLEg7F603OQ@mail.gmail.com as its usage is not clear.

v8-0008-Add-support-for-streaming-to-built-in-replication
1.
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
+      and <literal>streaming</literal>.

As per the discussion above [1]/messages/by-id/CAA4eK1LH7xzF+-qHRv9EDXQTFYjPUYZw5B7FSK9QLEg7F603OQ@mail.gmail.com, I don't think we need work_mem here.
You might want to remove the other usage from the patch as well.

2.
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool
*connect, bool *enabled_given,
     bool *slot_name_given, char **slot_name,
     bool *copy_data, char **synchronous_commit,
     bool *refresh, int *logical_wm,
-    bool *logical_wm_given)
+    bool *logical_wm_given, bool *streaming,
+    bool *streaming_given)

It is not clear to me why we need two parameters 'streaming' and
'streaming_given' in this API. Can't we handle similar to parameter
'refresh'?

3.
diff --git a/src/backend/replication/logical/launcher.c
b/src/backend/replication/logical/launcher.c
index aec885e..e80d00c 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,6 +14,8 @@
  *
  *-------------------------------------------------------------------------
  */
+#include <sys/types.h>
+#include <unistd.h>

#include "postgres.h"

I see only the above change in launcher.c. Why we need to include
these if there is no other change (at least not in this patch).

4.
stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
  /* Push callback + info on the error context stack */
  state.ctx = ctx;
  state.callback_name = "stream_start";
- /* state.report_location = apply_lsn; */
+ state.report_location = InvalidXLogRecPtr;
  errcallback.callback = output_plugin_error_callback;
  errcallback.arg = (void *) &state;
  errcallback.previous = error_context_stack;
@@ -1194,7 +1194,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn)
  /* Push callback + info on the error context stack */
  state.ctx = ctx;
  state.callback_name = "stream_stop";
- /* state.report_location = apply_lsn; */
+ state.report_location = InvalidXLogRecPtr;
  errcallback.callback = output_plugin_error_callback;
  errcallback.arg = (void *) &state;
  errcallback.previous = error_context_stack;

Don't we want to set txn->final_lsn in report location as we do at few
other places?

5.
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+ Relation rel, HeapTuple oldtuple)
 {
+ pq_sendbyte(out, 'D'); /* action DELETE */
+
  Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
     rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
     rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);

- pq_sendbyte(out, 'D'); /* action DELETE */

Why this patch need to change the above code?

6.
+void
+logicalrep_write_stream_start(StringInfo out,
+   TransactionId xid, bool first_segment)
+{
+ pq_sendbyte(out, 'S'); /* action STREAM START */
+
+ Assert(TransactionIdIsValid(xid));
+
+ /* transaction ID (we're starting to stream, so must be valid) */
+ pq_sendint32(out, xid);
+
+ /* 1 if this is the first streaming segment for this xid */
+ pq_sendint32(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+ TransactionId xid;
+
+ Assert(first_segment);
+
+ xid = pq_getmsgint(in, 4);
+ *first_segment = (pq_getmsgint(in, 4) == 1);
+
+ return xid;
+}

In these functions for sending bool, pq_sendint32 is used. Can't we
use pq_sendbyte similar to what we do in boolsend?

7.
+void
+logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
+{
+ pq_sendbyte(out, 'E'); /* action STREAM END */
+
+ Assert(TransactionIdIsValid(xid));
+
+ /* transaction ID (we're starting to stream, so must be valid) */
+ pq_sendint32(out, xid);
+}

In comments, 'starting to stream' is mentioned whereas this function
is to stop it.

8.
+void
+logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
+{
+ pq_sendbyte(out, 'E'); /* action STREAM END */
+
+ Assert(TransactionIdIsValid(xid));
+
+ /* transaction ID (we're starting to stream, so must be valid) */
+ pq_sendint32(out, xid);
+}
+
+TransactionId
+logicalrep_read_stream_stop(StringInfo in)
+{
+ TransactionId xid;
+
+ xid = pq_getmsgint(in, 4);
+
+ return xid;
+}

Is there a reason to send xid on stopping stream? I don't see any use
of function logicalrep_read_stream_stop.

9.
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
{
..
+ pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
..
+ pgstat_report_wait_end();
..
}

I see the calls to pgstat_report_wait_start/pgstat_report_wait_end in
this function, so not sure if the above comment makes sense.

10.
+ * The files are placed in /tmp by default, and the filenames include both
+ * the XID of the toplevel transaction and OID of the subscription.
Are we keeping files in /tmp or pg's temp tablespace dir.  Seeing
below code, it doesn't seem that we place them in /tmp.  If I am
correct, then can you update the comment.
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+ char tempdirpath[MAXPGPATH];
+
+ TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
11.
+ * The change is serialied in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
..
+ */
+static void
+stream_write_change(char action, StringInfo s)

The part of the comment which says "with length (not including the
length) .." is not clear to me. What does "not including the length"
mean?

12.
+ * TODO: Add missing_ok flag to specify in which cases it's OK not to
+ * find the files, and when it's an error.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)

I think we can implement this TODO. It is clear when this function is
called from apply_handle_stream_commit, the file must exist. We can
similarly analyze other callers of this API.

13.
+apply_handle_stream_abort(StringInfo s)
{
..
+ /* FIXME optimize the search by bsearch on sorted data */
+ for (i = nsubxacts; i > 0; i--)
..

I am not sure how important this optimization is, so instead of FIXME,
it is better to keep it as a XXX comment. In the future, if we hit
any performance issue due to this, we can revisit our decision.

[1]: /messages/by-id/CAA4eK1LH7xzF+-qHRv9EDXQTFYjPUYZw5B7FSK9QLEg7F603OQ@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#221Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#217)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Feb 5, 2020 at 9:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I am not able to understand the change in
v8-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup. Do you have
any explanation for the same?

It appears that in ReorderBufferCommitChild we are always setting the
final_lsn of the subxacts so it should not be invalid. For testing, I
have changed this as an assert and checked but it never hit. So maybe
we can remove this change.

Tomas, do you remember anything about this change? We are talking
about below change:

From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:14:45 +0200
Subject: [PATCH v8 11/13] BUGFIX: set final_lsn for subxacts before cleanup

---
src/backend/replication/logical/reorderbuffer.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/src/backend/replication/logical/reorderbuffer.c
b/src/backend/replication/logical/reorderbuffer.c
index fe4e57c..beb6cd2 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1327,6 +1327,10 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)

subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);

+ /* make sure subtxn has final_lsn */
+ if (subtxn->final_lsn == InvalidXLogRecPtr)
+ subtxn->final_lsn = txn->final_lsn;
+

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#222Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#220)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Feb 5, 2020 at 4:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Fixed in the latest version sent upthread.

Okay, thanks. I haven't looked at the latest version of patch series
as I was reviewing the previous version and I think all of these
comments are in the patch which is not modified. Here are my
comments:

I think we don't need to maintain
v8-0007-Support-logical_decoding_work_mem-set-from-create as per
discussion in one of the above emails [1] as its usage is not clear.

v8-0008-Add-support-for-streaming-to-built-in-replication
1.
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
+      and <literal>streaming</literal>.

As per the discussion above [1], I don't think we need work_mem here.
You might want to remove the other usage from the patch as well.

After putting more thought on this it appears that there could be some
use cases for setting the work_mem from the subscription, Assume a
case where data are coming from two different origins and based on the
origin ids different slots might collect different type of changes,
So isn't it good to have different work_mem for different slots? I am
not saying that the current way of implementing is the best one but
that we can improve. First, we need to decide whether we have a use
case for this or not. Please let me know your thought on the same.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#223Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#222)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Feb 7, 2020 at 4:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Feb 5, 2020 at 4:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Fixed in the latest version sent upthread.

Okay, thanks. I haven't looked at the latest version of patch series
as I was reviewing the previous version and I think all of these
comments are in the patch which is not modified. Here are my
comments:

I think we don't need to maintain
v8-0007-Support-logical_decoding_work_mem-set-from-create as per
discussion in one of the above emails [1] as its usage is not clear.

v8-0008-Add-support-for-streaming-to-built-in-replication
1.
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
+      and <literal>streaming</literal>.

As per the discussion above [1], I don't think we need work_mem here.
You might want to remove the other usage from the patch as well.

After putting more thought on this it appears that there could be some
use cases for setting the work_mem from the subscription, Assume a
case where data are coming from two different origins and based on the
origin ids different slots might collect different type of changes,
So isn't it good to have different work_mem for different slots? I am
not saying that the current way of implementing is the best one but
that we can improve. First, we need to decide whether we have a use
case for this or not.

That is the whole point. I don't see a very clear usage of this and
neither did anybody explained clearly how it will be useful. I am not
denying that what you are describing has no use, but as you said we
might need to invent an entirely new way even if we have such a use.
I think it is better to avoid the requirements which are not essential
for this patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#224Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#223)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Feb 10, 2020 at 1:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Feb 7, 2020 at 4:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Feb 5, 2020 at 4:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Fixed in the latest version sent upthread.

Okay, thanks. I haven't looked at the latest version of patch series
as I was reviewing the previous version and I think all of these
comments are in the patch which is not modified. Here are my
comments:

I think we don't need to maintain
v8-0007-Support-logical_decoding_work_mem-set-from-create as per
discussion in one of the above emails [1] as its usage is not clear.

v8-0008-Add-support-for-streaming-to-built-in-replication
1.
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
+      and <literal>streaming</literal>.

As per the discussion above [1], I don't think we need work_mem here.
You might want to remove the other usage from the patch as well.

After putting more thought on this it appears that there could be some
use cases for setting the work_mem from the subscription, Assume a
case where data are coming from two different origins and based on the
origin ids different slots might collect different type of changes,
So isn't it good to have different work_mem for different slots? I am
not saying that the current way of implementing is the best one but
that we can improve. First, we need to decide whether we have a use
case for this or not.

That is the whole point. I don't see a very clear usage of this and
neither did anybody explained clearly how it will be useful. I am not
denying that what you are describing has no use, but as you said we
might need to invent an entirely new way even if we have such a use.
I think it is better to avoid the requirements which are not essential
for this patch.

Ok, I will include this change in the next patch set.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#225Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#220)
10 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Feb 5, 2020 at 4:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
I think we don't need to maintain
v8-0007-Support-logical_decoding_work_mem-set-from-create as per
discussion in one of the above emails [1] as its usage is not clear.

Done

v8-0008-Add-support-for-streaming-to-built-in-replication
1.
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
+      and <literal>streaming</literal>.

As per the discussion above [1], I don't think we need work_mem here.
You might want to remove the other usage from the patch as well.

Done

2.
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool
*connect, bool *enabled_given,
bool *slot_name_given, char **slot_name,
bool *copy_data, char **synchronous_commit,
bool *refresh, int *logical_wm,
-    bool *logical_wm_given)
+    bool *logical_wm_given, bool *streaming,
+    bool *streaming_given)

It is not clear to me why we need two parameters 'streaming' and
'streaming_given' in this API. Can't we handle similar to parameter
'refresh'?

The streaming option we need to update in the system table, so if we
don't remember whether the user has given its value or not then how we
will know whether to update this column or not? Or you are suggesting
that we should always mark this as updated but IMHO that is not a good
idea.

3.
diff --git a/src/backend/replication/logical/launcher.c
b/src/backend/replication/logical/launcher.c
index aec885e..e80d00c 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,6 +14,8 @@
*
*-------------------------------------------------------------------------
*/
+#include <sys/types.h>
+#include <unistd.h>

#include "postgres.h"

I see only the above change in launcher.c. Why we need to include
these if there is no other change (at least not in this patch).

Removed

4.
stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
/* Push callback + info on the error context stack */
state.ctx = ctx;
state.callback_name = "stream_start";
- /* state.report_location = apply_lsn; */
+ state.report_location = InvalidXLogRecPtr;
errcallback.callback = output_plugin_error_callback;
errcallback.arg = (void *) &state;
errcallback.previous = error_context_stack;
@@ -1194,7 +1194,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn)
/* Push callback + info on the error context stack */
state.ctx = ctx;
state.callback_name = "stream_stop";
- /* state.report_location = apply_lsn; */
+ state.report_location = InvalidXLogRecPtr;
errcallback.callback = output_plugin_error_callback;
errcallback.arg = (void *) &state;
errcallback.previous = error_context_stack;

Don't we want to set txn->final_lsn in report location as we do at few
other places?

Fixed

5.
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+ Relation rel, HeapTuple oldtuple)
{
+ pq_sendbyte(out, 'D'); /* action DELETE */
+
Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);

- pq_sendbyte(out, 'D'); /* action DELETE */

Why this patch need to change the above code?

Fixed

6.
+void
+logicalrep_write_stream_start(StringInfo out,
+   TransactionId xid, bool first_segment)
+{
+ pq_sendbyte(out, 'S'); /* action STREAM START */
+
+ Assert(TransactionIdIsValid(xid));
+
+ /* transaction ID (we're starting to stream, so must be valid) */
+ pq_sendint32(out, xid);
+
+ /* 1 if this is the first streaming segment for this xid */
+ pq_sendint32(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+ TransactionId xid;
+
+ Assert(first_segment);
+
+ xid = pq_getmsgint(in, 4);
+ *first_segment = (pq_getmsgint(in, 4) == 1);
+
+ return xid;
+}

In these functions for sending bool, pq_sendint32 is used. Can't we
use pq_sendbyte similar to what we do in boolsend?

Done

7.
+void
+logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
+{
+ pq_sendbyte(out, 'E'); /* action STREAM END */
+
+ Assert(TransactionIdIsValid(xid));
+
+ /* transaction ID (we're starting to stream, so must be valid) */
+ pq_sendint32(out, xid);
+}

In comments, 'starting to stream' is mentioned whereas this function
is to stop it.

Fixed

8.
+void
+logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
+{
+ pq_sendbyte(out, 'E'); /* action STREAM END */
+
+ Assert(TransactionIdIsValid(xid));
+
+ /* transaction ID (we're starting to stream, so must be valid) */
+ pq_sendint32(out, xid);
+}
+
+TransactionId
+logicalrep_read_stream_stop(StringInfo in)
+{
+ TransactionId xid;
+
+ xid = pq_getmsgint(in, 4);
+
+ return xid;
+}

Is there a reason to send xid on stopping stream? I don't see any use
of function logicalrep_read_stream_stop.

Removed

9.
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
{
..
+ pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
..
+ pgstat_report_wait_end();
..
}

I see the calls to pgstat_report_wait_start/pgstat_report_wait_end in
this function, so not sure if the above comment makes sense.

Fixed

10.
+ * The files are placed in /tmp by default, and the filenames include both
+ * the XID of the toplevel transaction and OID of the subscription.
Are we keeping files in /tmp or pg's temp tablespace dir.  Seeing
below code, it doesn't seem that we place them in /tmp.  If I am
correct, then can you update the comment.
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+ char tempdirpath[MAXPGPATH];
+
+ TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);

Done

11.
+ * The change is serialied in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
..
+ */
+static void
+stream_write_change(char action, StringInfo s)

The part of the comment which says "with length (not including the
length) .." is not clear to me. What does "not including the length"
mean?

Basically, it says that the 4 bytes which are used for storing then
the length of total data doesn't include the 4 bytes.

12.
+ * TODO: Add missing_ok flag to specify in which cases it's OK not to
+ * find the files, and when it's an error.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)

I think we can implement this TODO. It is clear when this function is
called from apply_handle_stream_commit, the file must exist. We can
similarly analyze other callers of this API.

Done

13.
+apply_handle_stream_abort(StringInfo s)
{
..
+ /* FIXME optimize the search by bsearch on sorted data */
+ for (i = nsubxacts; i > 0; i--)
..

I am not sure how important this optimization is, so instead of FIXME,
it is better to keep it as a XXX comment. In the future, if we hit
any performance issue due to this, we can revisit our decision.

Done

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v10-0002-Issue-individual-invalidations-with-wal_level-lo.patchapplication/octet-stream; name=v10-0002-Issue-individual-invalidations-with-wal_level-lo.patchDownload
From 4c2e923c86919b6e681bde0ec25a876402da2c19 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:26:33 +0530
Subject: [PATCH v10 02/10] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.

The new invalidations are written to WAL immediately, without any
such caching. Perhaps it would be possible to add similar caching,
e.g. at the command level, or something like that?
---
 src/backend/access/rmgrdesc/xactdesc.c          | 50 +++++++++++++++++
 src/backend/access/transam/xact.c               |  7 +++
 src/backend/replication/logical/decode.c        | 23 ++++++++
 src/backend/replication/logical/reorderbuffer.c | 55 +++++++++++++++---
 src/backend/utils/cache/inval.c                 | 75 +++++++++++++++++++++++++
 src/include/access/xact.h                       | 18 +++++-
 src/include/replication/reorderbuffer.h         | 14 +++++
 7 files changed, 234 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index fbc5942..6191060 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,11 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +401,14 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs,
+								xlrec->dbId, xlrec->tsId,
+								xlrec->relcacheInitFileInval);
+	}
 }
 
 const char *
@@ -423,7 +436,44 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval)
+{
+	int			i;
+
+	if (relcacheInitFileInval)
+		appendStringInfo(buf, "; relcache init file inval dbid %u tsid %u",
+						 dbId, tsId);
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index da32a4f..c9a64bb 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5997,6 +5997,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a99fcaf..13a11ac 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,29 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/* XXX for now we're issuing invalidations one by one */
+				Assert(invals->nmsgs == 1);
+
+				if (!TransactionIdIsValid(xid))
+					break;
+
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->dbId, invals->tsId,
+											 invals->relcacheInitFileInval,
+											 invals->msgs[0]);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index aeebbf2..3d6cbcf 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -473,6 +473,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
@@ -1822,17 +1823,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation message locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -2212,6 +2218,38 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 
 /*
  * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn,
+							 Oid dbId, Oid tsId, bool relcacheInitFileInval,
+							 SharedInvalidationMessage msg)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	/* XXX Should we even write invalidations without valid XID? */
+	if (xid == InvalidTransactionId)
+		return;
+
+	Assert(xid != InvalidTransactionId);
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.dbId = dbId;
+	change->data.inval.tsId = tsId;
+	change->data.inval.relcacheInitFileInval = relcacheInitFileInval;
+	change->data.inval.msg = msg;
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Setup the invalidation of the toplevel transaction.
  *
  * This needs to be done before ReorderBufferCommit is called!
  */
@@ -2658,6 +2696,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -2765,6 +2804,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -3050,6 +3090,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..e0d04b9 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write individual invalidations into WAL to support
+ *	the decoding of the in-progress transaction.  As of now it was enough to
+ *	log invalidation only at commit because we are only decoding the transaction
+ *	at the commit time.   We only need to log the catalog cache and relcache
+ *	invalidation.  There can not be any active MVCC scan in logical decoding so
+ *	we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,9 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -489,6 +499,18 @@ RegisterCatcacheInvalidation(int cacheId,
 {
 	AddCatcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   cacheId, hashValue, dbId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cc.id = (int8) cacheId;
+		msg.cc.dbId = dbId;
+		msg.cc.hashValue = hashValue;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -501,6 +523,18 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 {
 	AddCatalogInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								  dbId, catId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cat.id = SHAREDINVALCATALOG_ID;
+		msg.cat.dbId = dbId;
+		msg.cat.catId = catId;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -511,6 +545,8 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 static void
 RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 {
+	bool		RelcacheInitFileInval = false;
+
 	AddRelcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   dbId, relId);
 
@@ -529,7 +565,22 @@ RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 	 * as well.  Also zap when we are invalidating whole relcache.
 	 */
 	if (relId == InvalidOid || RelationIdIsInInitFile(relId))
+	{
 		transInvalInfo->RelcacheInitFileInval = true;
+		RelcacheInitFileInval = true;
+	}
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.rc.id = SHAREDINVALRELCACHE_ID;
+		msg.rc.dbId = dbId;
+		msg.rc.relId = relId;
+
+		LogLogicalInvalidations(1, &msg, RelcacheInitFileInval);
+	}
 }
 
 /*
@@ -1501,3 +1552,27 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval)
+{
+	xl_xact_invalidations xlrec;
+
+	/* prepare record */
+	memset(&xlrec, 0, sizeof(xlrec));
+	xlrec.dbId = MyDatabaseId;
+	xlrec.tsId = MyDatabaseTableSpace;
+	xlrec.relcacheInitFileInval = relcacheInitFileInval;
+	xlrec.nmsgs = nmsgs;
+
+	/* perform insertion */
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+	XLogRegisterData((char *) msgs,
+					 nmsgs * sizeof(SharedInvalidationMessage));
+	XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b38..6f2a583 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,22 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ *
+ * XXX Currently nmsgs=1 but that might change in the future.
+ */
+typedef struct xl_xact_invalidations
+{
+	Oid			dbId;			/* MyDatabaseId */
+	Oid			tsId;			/* MyDatabaseTableSpace */
+	bool		relcacheInitFileInval;	/* invalidate relcache init file */
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4..9a3f045 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,16 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			Oid			dbId;	/* MyDatabaseId */
+			Oid			tsId;	/* MyDatabaseTableSpace */
+			bool		relcacheInitFileInval;	/* invalidate relcache init
+												 * file */
+			SharedInvalidationMessage msg;	/* invalidation message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +470,9 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  Oid dbId, Oid tsId, bool relcacheInitFileInval,
+								  SharedInvalidationMessage msg);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
1.8.3.1

v10-0003-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=v10-0003-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From c73b48b0d2f023be736b53fdce13f73ceb10c4be Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v10 03/10] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 +++++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  57 +++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..64f651f 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8db9686..ace21ec 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index e3da7d3..ec40755 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +204,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -860,6 +908,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->final_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 3b7ca7f..f24e246 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..0d0a94a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 9a3f045..15bb5ed 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -356,6 +356,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -395,6 +441,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v10-0005-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v10-0005-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From 6ed52d0d4ab1c17c0f423c478f670c0efa3a98fe Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri, 10 Jan 2020 18:03:27 -0300
Subject: [PATCH v10 05/10] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c     |  38 +-
 src/backend/replication/logical/reorderbuffer.c | 710 +++++++++++++++++++++---
 src/include/replication/reorderbuffer.h         |  36 ++
 3 files changed, 693 insertions(+), 91 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..160b167 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3ca960c..f68b2e4 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -236,6 +236,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +245,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +381,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -769,6 +782,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -864,6 +909,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -987,7 +1035,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1023,6 +1071,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1037,6 +1088,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1320,6 +1374,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1345,8 +1408,93 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
  */
 static void
 ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1354,9 +1502,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
-
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
 
 	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
@@ -1495,63 +1640,75 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report an specific error which we can ignore.  We
+ * might have already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the abort we will
+ * stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
-		return;
-	}
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 
-	/* build data to be able to lookup the CommandIds of catalog tuples */
+	/*
+	 * build data to be able to lookup the CommandIds of catalog tuples
+	 */
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, txn->xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1567,15 +1724,20 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 	PG_TRY();
 	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 		ReorderBufferChange *change;
 		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction("stream");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* start streaming this chunk of transaction */
+		if (streaming)
+			rb->stream_start(rb, txn);
+		else
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1583,6 +1745,19 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1592,8 +1767,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * use as a normal record. It'll be cleaned up at the end
 					 * of INSERT processing.
 					 */
-					if (specinsert == NULL)
-						elog(ERROR, "invalid ordering of speculative insertion changes");
 					Assert(specinsert->data.tp.oldtuple == NULL);
 					change = specinsert;
 					change->action = REORDER_BUFFER_CHANGE_INSERT;
@@ -1659,7 +1832,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						if (streaming)
+						{
+							rb->stream_change(rb, txn, relation, change);
+
+							/* Remember that we have sent some data for this xid. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_change(rb, txn, relation, change);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1680,8 +1861,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						 * freed/reused while restoring spooled data from
 						 * disk.
 						 */
-						Assert(change->data.tp.newtuple != NULL);
-
 						dlist_delete(&change->node);
 						ReorderBufferToastAppendChunk(rb, txn, relation,
 													  change);
@@ -1699,7 +1878,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1757,7 +1936,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						if (streaming)
+						{
+							rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+							/* Remember that we have sent some data. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_truncate(rb, txn, nrelations, relations, change);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1766,10 +1953,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					if (streaming)
+						rb->stream_message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
+					else
+						rb->message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1800,9 +1993,9 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+										  txn->xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1822,7 +2015,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+											  txn->xid);
 					}
 
 					break;
@@ -1860,14 +2054,46 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			/*
+			 * Set the last last of the stream as the final lsn before calling
+			 * stream stop.
+			 */
+			if (!XLogRecPtrIsInvalid(prev_lsn))
+				txn->final_lsn = prev_lsn;
+			rb->stream_stop(rb, txn);
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+		{
+			txn->command_id = command_id;
+
+			/* Avoid copying if it's already copied. */
+			if (snapshot_now->copied)
+				txn->snapshot_now = snapshot_now;
+			else
+				txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+														  txn, command_id);
+		}
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1885,14 +2111,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,18 +2145,122 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		if (streaming)
+		{
+			/* Discard the changes that we just streamed. */
+			ReorderBufferTruncateTXN(rb, txn);
 
-		PG_RE_THROW();
+			/* Re-throw only if it's not an abort. */
+			if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+			{
+				MemoryContextSwitchTo(ecxt);
+				PG_RE_THROW();
+			}
+			else
+			{
+				/* remember the command ID and snapshot for the streaming run */
+				txn->command_id = command_id;
+
+				/* Avoid copying if it's already copied. */
+				if (snapshot_now->copied)
+					txn->snapshot_now = snapshot_now;
+				else
+					txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+															  txn, command_id);
+				/*
+				 * Set the last last of the stream as the final lsn before
+				 * calling stream stop.
+				 */
+				txn->final_lsn = prev_lsn;
+				rb->stream_stop(rb, txn);
+
+				FlushErrorState();
+			}
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
 /*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * We currently can only decode a transaction's contents when its commit
+ * record is read because that's the only place where we know about cache
+ * invalidations. Thus, once a toplevel commit is read, we iterate over the top
+ * and subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -1946,6 +2284,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2015,6 +2360,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2150,8 +2502,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2159,6 +2520,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2170,19 +2532,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2211,6 +2582,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2300,6 +2672,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2404,6 +2783,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
  *
@@ -2423,15 +2834,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2734,6 +3176,102 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 15bb5ed..adb8f9d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -192,6 +193,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -227,6 +246,16 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	final_lsn;
 
 	/*
+	 * Have we sent any changes for this transaction in output plugin?
+	 */
+	bool		any_data_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
+	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
 	XLogRecPtr	end_lsn;
@@ -257,6 +286,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
-- 
1.8.3.1

v10-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchapplication/octet-stream; name=v10-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchDownload
From 7ef2ab25181de245718bc12e3ae565aa8c62f3fa Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:32:34 +0530
Subject: [PATCH v10 04/10] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml               |  5 ++-
 src/backend/access/heap/heapam.c                | 41 +++++++++++++++++++++++++
 src/backend/access/index/genam.c                | 40 ++++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c |  8 ++---
 src/backend/utils/time/snapmgr.c                | 25 +++++++++++++--
 src/include/utils/snapmgr.h                     |  4 ++-
 6 files changed, 115 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index ace21ec..319349a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,7 +432,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
      includes writing to tables, performing DDL changes, and
      calling <literal>txid_current()</literal>.
     </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index db6fad7..24d0d7a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1304,6 +1304,15 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(scan->rs_base.rs_rd) ||
+		  RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	HEAPDEBUG_1;				/* heap_getnext( info ) */
@@ -1423,6 +1432,14 @@ heap_fetch(Relation relation,
 	bool		valid;
 
 	/*
+	 * We don't expect direct calls to heap_fetch with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_fetch call during logical decoding");
+
+	/*
 	 * Fetch and pin the appropriate page of the relation.
 	 */
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
@@ -1537,6 +1554,14 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		valid;
 	bool		skip;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search_buffer with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_hot_search_buffer call during logical decoding");
+
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
 		*all_dead = first_call;
@@ -1686,6 +1711,14 @@ heap_get_latest_tid(TableScanDesc sscan,
 	Assert(ItemPointerIsValid(tid));
 
 	/*
+	 * We don't expect direct calls to heap_get_latest_tid with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_get_latest_tid call during logical decoding");
+
+	/*
 	 * Loop to chase down t_ctid links.  At top of loop, ctid is the tuple we
 	 * need to examine, and *tid is the TID we will return if ctid turns out
 	 * to be bogus.
@@ -5483,6 +5516,14 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	ItemId		lp = NULL;
 	HeapTupleHeader htup;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_hot_search call during logical decoding");
+
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index c16eb05..5b0ef72 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -477,6 +478,19 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  We can't directly use TransactionIdDidAbort as after crash
+	 * such transaction might not have been marked as aborted.  See detailed
+	 * comments at snapmgr.c where the variable is declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
@@ -513,6 +527,19 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  We can't directly use TransactionIdDidAbort as after crash
+	 * such transaction might not have been marked as aborted.  See detailed
+	 * comments at snapmgr.c where the variable is declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return result;
 }
 
@@ -639,6 +666,19 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  We can't directly use TransactionIdDidAbort as after crash
+	 * such transaction might not have been marked as aborted.  See detailed
+	 * comments at snapmgr.c where the variable is declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3d6cbcf..3ca960c 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -692,7 +692,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 			txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 		/* setup snapshot to allow catalog access */
-		SetupHistoricSnapshot(snapshot_now, NULL);
+		SetupHistoricSnapshot(snapshot_now, NULL, xid);
 		PG_TRY();
 		{
 			rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1551,7 +1551,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1802,7 +1802,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1822,7 +1822,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					}
 
 					break;
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c5..93a0c04 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -154,6 +154,13 @@ static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
 /*
+ * An xid value pointing to a possibly ongoing (sub)transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
+/*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2029,10 +2036,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
  *
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive.  This is to re-check XID status while accessing catalog.
+ *
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+					  TransactionId snapshot_xid)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -2041,8 +2052,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 
 	/* setup (cmin, cmax) lookup hash */
 	tuplecid_data = tuplecids;
-}
 
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
+	 */
+	if (TransactionIdIsValid(snapshot_xid) &&
+		!TransactionIdDidCommit(snapshot_xid))
+		CheckXidAlive = snapshot_xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
 /*
  * Make catalog snapshots behave normally again.
@@ -2052,6 +2072,7 @@ TeardownHistoricSnapshot(bool is_error)
 {
 	HistoricSnapshot = NULL;
 	tuplecid_data = NULL;
+	CheckXidAlive = InvalidTransactionId;
 }
 
 bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13c..12f737b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,8 +145,10 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+								  TransactionId snapshot_xid);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-- 
1.8.3.1

v10-0008-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v10-0008-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From 34a5848db37e831048142b0657ca7aea26af60fa Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v10 08/10] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 6da7f71..086d0c7 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -70,7 +70,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 34ab11e..ad3ed13 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 81520a7..0c9c6b3 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v10-0009-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v10-0009-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From 995f6e4f47b697a85290a1ef9dead428e67c2e46 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v10 09/10] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v10-0006-Add-support-for-streaming-to-built-in-replicatio.patchapplication/octet-stream; name=v10-0006-Add-support-for-streaming-to-built-in-replicatio.patchDownload
From 941068d2af88d83f3bbb3ed9e0e81b32f9758e40 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 10 Feb 2020 11:00:48 +0530
Subject: [PATCH v10 06/10] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |    4 +-
 doc/src/sgml/ref/create_subscription.sgml          |   12 +
 src/backend/catalog/pg_subscription.c              |    1 +
 src/backend/commands/subscriptioncmds.c            |   45 +-
 src/backend/postmaster/pgstat.c                    |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    3 +
 src/backend/replication/logical/launcher.c         |    1 -
 src/backend/replication/logical/logical.c          |    4 +-
 src/backend/replication/logical/proto.c            |  140 ++-
 src/backend/replication/logical/worker.c           | 1028 ++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c        |  309 +++++-
 src/backend/replication/slotfuncs.c                |    7 +
 src/backend/replication/walsender.c                |    6 +
 src/include/catalog/pg_subscription.h              |    3 +
 src/include/pgstat.h                               |    6 +-
 src/include/replication/logicalproto.h             |   42 +-
 src/include/replication/walreceiver.h              |    1 +
 src/test/subscription/t/009_stream_simple.pl       |   86 ++
 src/test/subscription/t/010_stream_subxact.pl      |  102 ++
 src/test/subscription/t/011_stream_ddl.pl          |   95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |   82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |   84 ++
 22 files changed, 2034 insertions(+), 39 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 6dfb2e4..47b414c 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c24..3349cc4 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index f77a83b..7d7f721 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 119a9ce..54ca2d3 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -58,7 +58,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -89,6 +90,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -174,6 +177,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -317,6 +330,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -333,7 +348,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -411,6 +426,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -668,10 +690,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -696,6 +721,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -707,7 +739,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -745,7 +778,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -782,7 +815,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7169509..eb00242 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4102,6 +4102,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9..5257ab0 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e..8156a42 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,7 +14,6 @@
  *
  *-------------------------------------------------------------------------
  */
-
 #include "postgres.h"
 
 #include "access/heapam.h"
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ec40755..b5d854f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1146,7 +1146,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1191,7 +1191,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = txn->final_lsn;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd..5242ac0 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 7a5471f..61388bd 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -31,6 +52,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -60,6 +82,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -67,6 +90,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -106,12 +130,59 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	off_t		offset;			/* offset in the file */
+}			SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
+static bool handle_streamed_transaction(const char action, StringInfo s);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -163,6 +234,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -529,6 +636,318 @@ apply_handle_origin(StringInfo s)
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found PG_USED_FOR_ASSERTS_ONLY = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/* We should not receive aborts for unknown subtransactions. */
+		Assert(found);
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	/* XXX Should this be allocated in another memory context? */
+
+	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	ensure_transaction();
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+
+	pfree(buffer);
+	pfree(s2.data);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -541,6 +960,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -556,6 +978,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -591,6 +1016,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -695,6 +1123,9 @@ apply_handle_update(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -830,6 +1261,9 @@ apply_handle_delete(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -929,6 +1363,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1020,6 +1457,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1117,6 +1570,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 }
 
 /*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
+/*
  * Apply main loop.
  */
 static void
@@ -1132,6 +1601,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1580,6 +2052,561 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	CloseTransientFile(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	CloseTransientFile(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -1745,6 +2772,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7525082..11e249e 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -44,17 +44,45 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
 
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
+
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 
-/* Entry in the map used to remember which relation schemas we sent. */
+/*
+ * Entry in the map used to remember which relation schemas we sent.
+ *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
+	TransactionId	xid;		/* transaction that created the record */
 	bool		schema_sent;	/* did we send the schema? */
+	List	   *streamed_txns;	/* streamed toplevel transactions with
+								 * this schema */
 	bool		replicate_valid;
 	PublicationActions pubactions;
 } RelationSyncEntry;
@@ -63,11 +91,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -83,15 +117,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -137,6 +180,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -149,6 +209,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -171,7 +232,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -191,6 +253,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -258,9 +341,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (!relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (!schema_sent)
 	{
 		TupleDesc	desc;
 		int			i;
@@ -286,19 +408,26 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 				continue;
 
 			OutputPluginPrepareWrite(ctx, false);
-			logicalrep_write_typ(ctx->out, att->atttypid);
+			logicalrep_write_typ(ctx->out, xid, att->atttypid);
 			OutputPluginWrite(ctx, false);
 		}
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_rel(ctx->out, relation);
+		logicalrep_write_rel(ctx->out, xid, relation);
 		OutputPluginWrite(ctx, false);
-		relentry->schema_sent = true;
+		relentry->xid = change->txn->xid;
+
+		if (in_streaming)
+			set_schema_sent_in_streamed_txn(relentry, topxid);
+		else
+			relentry->schema_sent = true;
 	}
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -307,6 +436,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -335,14 +468,14 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
 	{
 		case REORDER_BUFFER_CHANGE_INSERT:
 			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_insert(ctx->out, relation,
+			logicalrep_write_insert(ctx->out, xid, relation,
 									&change->data.tp.newtuple->tuple);
 			OutputPluginWrite(ctx, true);
 			break;
@@ -352,7 +485,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				&change->data.tp.oldtuple->tuple : NULL;
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple,
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
 										&change->data.tp.newtuple->tuple);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -361,7 +494,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			if (change->data.tp.oldtuple)
 			{
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation,
+				logicalrep_write_delete(ctx->out, xid, relation,
 										&change->data.tp.oldtuple->tuple);
 				OutputPluginWrite(ctx, true);
 			}
@@ -387,6 +520,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -407,13 +544,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -487,6 +625,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -523,6 +746,34 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  */
 static RelationSyncEntry *
@@ -597,6 +848,36 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -631,7 +912,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2c9d5de..30db2c2 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -147,6 +147,13 @@ create_logical_replication_slot(char *name, char *plugin,
 									logical_read_local_xlog_page, NULL, NULL,
 									NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+
 	/* build initial snapshot, might take a while */
 	DecodingContextFindStartpoint(ctx);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index abb533b..6ee7fa2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -968,6 +968,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndUpdateProgress);
 
 		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
+		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
 		 * messages or send keepalives. As we possibly need to wait for
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d4..617b909 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index aecb601..146d7c4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -945,7 +945,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 2cc2dc4..277e44c 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -85,25 +89,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e08afc6..0ebd140 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -169,6 +169,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v10-0007-Track-statistics-for-streaming.patchapplication/octet-stream; name=v10-0007-Track-statistics-for-streaming.patchDownload
From 444ef04d5d2c4620d098de0926da40405893c94d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Jan 2020 09:45:27 +0530
Subject: [PATCH v10 07/10] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                    | 25 +++++++++++++++++++
 src/backend/catalog/system_views.sql            |  5 +++-
 src/backend/replication/logical/reorderbuffer.c | 13 ++++++++++
 src/backend/replication/walsender.c             | 32 +++++++++++++++++++++----
 src/include/catalog/pg_proc.dat                 |  6 ++---
 src/include/replication/reorderbuffer.h         | 13 ++++++----
 src/include/replication/walsender_private.h     |  5 ++++
 src/test/regress/expected/rules.out             |  7 ++++--
 8 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8839699..2c7089e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2004,6 +2004,31 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       may get spilled repeatedly, and this counter gets incremented on every
       such invocation.</entry>
     </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
+
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index c9e6060..d6f07d6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -786,7 +786,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index f68b2e4..a3c4509 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -331,6 +331,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3267,6 +3271,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
 
+	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 6ee7fa2..21c4da0 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1292,7 +1292,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1313,7 +1313,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2356,6 +2357,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3194,7 +3198,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3251,6 +3255,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillTxns;
 		int64		spillCount;
 		int64		spillBytes;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3274,6 +3281,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		memset(nulls, 0, sizeof(nulls));
@@ -3360,6 +3370,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3608,11 +3623,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2228256..969fa9e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5193,9 +5193,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index adb8f9d..5e1337e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -519,15 +519,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 366828f..3888b0c 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -85,6 +85,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2ab2115..eed0088 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1983,9 +1983,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
1.8.3.1

v10-0010-Bugfix-handling-of-incomplete-toast-tuple.patchapplication/octet-stream; name=v10-0010-Bugfix-handling-of-incomplete-toast-tuple.patchDownload
From 80bdec86bb5c2b9183e082e51558a7a81430a75c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 30 Jan 2020 14:21:04 +0530
Subject: [PATCH v10 10/10] Bugfix handling of incomplete toast tuple

---
 src/backend/access/heap/heapam.c                |   3 +
 src/backend/replication/logical/decode.c        |  17 ++-
 src/backend/replication/logical/reorderbuffer.c | 145 +++++++++++++-----------
 src/include/access/heapam_xlog.h                |   1 +
 src/include/replication/reorderbuffer.h         |  17 ++-
 5 files changed, 110 insertions(+), 73 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 24d0d7a..faaaf67 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2018,6 +2018,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 13a11ac..4ee528f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -734,7 +734,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -801,7 +803,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -858,7 +861,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -894,7 +898,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -999,7 +1003,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1037,7 +1041,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index a3c4509..0cadc28 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -650,7 +650,7 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
@@ -664,6 +664,28 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Otherwise, if
+	 * we have toast insert bit set and this is insert/update then clear the
+	 * bit.
+	 */
+	if (toast_insert)
+		txn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(txn) &&
+			 ((change->action == REORDER_BUFFER_CHANGE_INSERT) ||
+			 (change->action == REORDER_BUFFER_CHANGE_UPDATE)))
+		txn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * If this is a speculative insert then set the corresponding bit.
+	 * Otherwise, if we have speculative insert bit set and this is spec confirm
+	 * record then clear the bit.
+	 */
+	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)
+		txn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM)
+		txn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
@@ -696,7 +718,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1866,8 +1888,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						 * disk.
 						 */
 						dlist_delete(&change->node);
-						ReorderBufferToastAppendChunk(rb, txn, relation,
-													  change);
+							ReorderBufferToastAppendChunk(rb, txn, relation,
+														  change);
 					}
 
 			change_done:
@@ -2453,7 +2475,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2502,7 +2524,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2525,6 +2547,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2540,7 +2563,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 	/* if subxact, and streaming supported, use the toplevel instead */
 	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+		toptxn = txn->toptxn;
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2548,12 +2571,16 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+		if (toptxn)
+			toptxn->size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+		if (toptxn)
+			toptxn->size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2619,7 +2646,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 	change->data.inval.relcacheInitFileInval = relcacheInitFileInval;
 	change->data.inval.msg = msg;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -2806,15 +2833,16 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		/*
+		 * if the current transaction is larger and doesn't have incomplete data
+		 * remember it.
+		 */
+		if (((!largest) || (txn->size > largest->size)) &&
+			((txn->size > 0) && !rbtxn_has_toast_insert(txn) &&
+			!rbtxn_has_spec_insert(txn)))
+		largest = txn;
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2832,66 +2860,51 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
 	ReorderBufferTXN *txn;
 
-	/* bail out if we haven't exceeded the memory limit */
-	if (rb->size < logical_decoding_work_mem * 1024L)
-		return;
-
-	/*
-	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by streaming, if supported. Otherwise spill to disk.
-	 */
-	if (ReorderBufferCanStream(rb))
+	/* Loop until we reach under the memory limit. */
+	while (rb->size >= logical_decoding_work_mem * 1024L)
 	{
 		/*
-		 * Pick the largest toplevel transaction and evict it from memory by
-		 * streaming the already decoded part.
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTopTXN(rb);
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn && !txn->toptxn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
-		ReorderBufferStreamTXN(rb, txn);
-	}
-	else
-	{
 		/*
-		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * After eviction, the transaction should have no entries in memory, and
+		 * should use 0 bytes for changes.
+		 *
+		 * XXX Checking the size is fine for both cases - spill to disk and
+		 * streaming. But for streaming we should really check nentries_mem for
+		 * all subtransactions too.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
-
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
-
-		ReorderBufferSerializeTXN(rb, txn);
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
 	}
 
-	/*
-	 * After eviction, the transaction should have no entries in memory, and
-	 * should use 0 bytes for changes.
-	 *
-	 * XXX Checking the size is fine for both cases - spill to disk and
-	 * streaming. But for streaming we should really check nentries_mem for
-	 * all subtransactions too.
-	 */
-	Assert(txn->size == 0);
-	Assert(txn->nentries_mem == 0);
-
-	/*
-	 * And furthermore, evicting the transaction should get us below the
-	 * memory limit again - it is not possible that we're still exceeding the
-	 * memory limit after evicting the transaction.
-	 *
-	 * This follows from the simple fact that the selected transaction is at
-	 * least as large as the most recent change (which caused us to go over
-	 * the memory limit). So by evicting it we're definitely back below the
-	 * memory limit.
-	 */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5e1337e..a2646c5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -174,6 +174,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -193,6 +195,17 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -547,7 +560,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v10-0001-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=v10-0001-Immediately-WAL-log-assignments.patchDownload
From 9e9840335d75f6fae6236dda243b5a3ba925463b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 09:53:08 +0530
Subject: [PATCH v10 01/10] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 45 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 39 +++++++++++++--------------
 src/include/access/xact.h                |  3 +++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 +++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 98 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e3c60f2..da32a4f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -190,6 +190,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -5096,6 +5097,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -5998,3 +6000,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 2fa0a7f..b11b0c2 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -88,11 +88,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -194,6 +196,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -397,7 +403,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -743,6 +749,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		curinsert_flags |= XLOG_INCLUDE_XID;
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 32f0225..51b6485 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1186,6 +1186,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1224,6 +1225,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5e1dc8a..a99fcaf 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel_xid is valid, we need to assign the subxact to the
+	 * toplevel transaction. We need to do this for all records, hence we
+	 * do it before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 record->toplevel_xid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrIds) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(r->toplevel_xid))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04ba..8645b38 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033f..e23892a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -227,6 +227,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 4582196..756f6df 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -150,6 +150,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -285,6 +287,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

#226Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#225)
10 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Feb 11, 2020 at 8:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

The patch set was not applying on the head so I have rebased it.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v11-0001-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=v11-0001-Immediately-WAL-log-assignments.patchDownload
From 50809a61c8ae64fa844954da4d5f0dca6a7a4f85 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 09:53:08 +0530
Subject: [PATCH v11 01/10] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 45 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 39 +++++++++++++--------------
 src/include/access/xact.h                |  3 +++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 +++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 98 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e3c60f2..da32a4f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -190,6 +190,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -5096,6 +5097,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -5998,3 +6000,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 2fa0a7f..b11b0c2 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -88,11 +88,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -194,6 +196,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -397,7 +403,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -743,6 +749,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		curinsert_flags |= XLOG_INCLUDE_XID;
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 32f0225..51b6485 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1186,6 +1186,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1224,6 +1225,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5e1dc8a..a99fcaf 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel_xid is valid, we need to assign the subxact to the
+	 * toplevel transaction. We need to do this for all records, hence we
+	 * do it before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 record->toplevel_xid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrIds) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(r->toplevel_xid))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04ba..8645b38 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033f..e23892a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -227,6 +227,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 4582196..756f6df 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -150,6 +150,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -285,6 +287,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v11-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchapplication/octet-stream; name=v11-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchDownload
From bf22b8b25df23514a1f33144ac1fc0d4cc2880d4 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:32:34 +0530
Subject: [PATCH v11 04/10] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml               |  5 ++-
 src/backend/access/heap/heapam.c                | 41 +++++++++++++++++++++++++
 src/backend/access/index/genam.c                | 40 ++++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c |  8 ++---
 src/backend/utils/time/snapmgr.c                | 25 +++++++++++++--
 src/include/utils/snapmgr.h                     |  4 ++-
 6 files changed, 115 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 3a95fb2..3a54b35 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,7 +432,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
      includes writing to tables, performing DDL changes, and
      calling <literal>txid_current()</literal>.
     </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index db6fad7..24d0d7a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1304,6 +1304,15 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(scan->rs_base.rs_rd) ||
+		  RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	HEAPDEBUG_1;				/* heap_getnext( info ) */
@@ -1423,6 +1432,14 @@ heap_fetch(Relation relation,
 	bool		valid;
 
 	/*
+	 * We don't expect direct calls to heap_fetch with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_fetch call during logical decoding");
+
+	/*
 	 * Fetch and pin the appropriate page of the relation.
 	 */
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
@@ -1537,6 +1554,14 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		valid;
 	bool		skip;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search_buffer with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_hot_search_buffer call during logical decoding");
+
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
 		*all_dead = first_call;
@@ -1686,6 +1711,14 @@ heap_get_latest_tid(TableScanDesc sscan,
 	Assert(ItemPointerIsValid(tid));
 
 	/*
+	 * We don't expect direct calls to heap_get_latest_tid with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_get_latest_tid call during logical decoding");
+
+	/*
 	 * Loop to chase down t_ctid links.  At top of loop, ctid is the tuple we
 	 * need to examine, and *tid is the TID we will return if ctid turns out
 	 * to be bogus.
@@ -5483,6 +5516,14 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	ItemId		lp = NULL;
 	HeapTupleHeader htup;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_hot_search call during logical decoding");
+
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index c16eb05..5b0ef72 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -477,6 +478,19 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  We can't directly use TransactionIdDidAbort as after crash
+	 * such transaction might not have been marked as aborted.  See detailed
+	 * comments at snapmgr.c where the variable is declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
@@ -513,6 +527,19 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  We can't directly use TransactionIdDidAbort as after crash
+	 * such transaction might not have been marked as aborted.  See detailed
+	 * comments at snapmgr.c where the variable is declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return result;
 }
 
@@ -639,6 +666,19 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  We can't directly use TransactionIdDidAbort as after crash
+	 * such transaction might not have been marked as aborted.  See detailed
+	 * comments at snapmgr.c where the variable is declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3d6cbcf..3ca960c 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -692,7 +692,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 			txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 		/* setup snapshot to allow catalog access */
-		SetupHistoricSnapshot(snapshot_now, NULL);
+		SetupHistoricSnapshot(snapshot_now, NULL, xid);
 		PG_TRY();
 		{
 			rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1551,7 +1551,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1802,7 +1802,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1822,7 +1822,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					}
 
 					break;
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c5..93a0c04 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -154,6 +154,13 @@ static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
 /*
+ * An xid value pointing to a possibly ongoing (sub)transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
+/*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2029,10 +2036,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
  *
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive.  This is to re-check XID status while accessing catalog.
+ *
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+					  TransactionId snapshot_xid)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -2041,8 +2052,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 
 	/* setup (cmin, cmax) lookup hash */
 	tuplecid_data = tuplecids;
-}
 
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
+	 */
+	if (TransactionIdIsValid(snapshot_xid) &&
+		!TransactionIdDidCommit(snapshot_xid))
+		CheckXidAlive = snapshot_xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
 /*
  * Make catalog snapshots behave normally again.
@@ -2052,6 +2072,7 @@ TeardownHistoricSnapshot(bool is_error)
 {
 	HistoricSnapshot = NULL;
 	tuplecid_data = NULL;
+	CheckXidAlive = InvalidTransactionId;
 }
 
 bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13c..12f737b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,8 +145,10 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+								  TransactionId snapshot_xid);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-- 
1.8.3.1

v11-0002-Issue-individual-invalidations-with-wal_level-lo.patchapplication/octet-stream; name=v11-0002-Issue-individual-invalidations-with-wal_level-lo.patchDownload
From 47db44bc1ee6aac744eeef66537ca2023ad74c87 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:26:33 +0530
Subject: [PATCH v11 02/10] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.

The new invalidations are written to WAL immediately, without any
such caching. Perhaps it would be possible to add similar caching,
e.g. at the command level, or something like that?
---
 src/backend/access/rmgrdesc/xactdesc.c          | 50 +++++++++++++++++
 src/backend/access/transam/xact.c               |  7 +++
 src/backend/replication/logical/decode.c        | 23 ++++++++
 src/backend/replication/logical/reorderbuffer.c | 55 +++++++++++++++---
 src/backend/utils/cache/inval.c                 | 75 +++++++++++++++++++++++++
 src/include/access/xact.h                       | 18 +++++-
 src/include/replication/reorderbuffer.h         | 14 +++++
 7 files changed, 234 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index fbc5942..6191060 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,11 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +401,14 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs,
+								xlrec->dbId, xlrec->tsId,
+								xlrec->relcacheInitFileInval);
+	}
 }
 
 const char *
@@ -423,7 +436,44 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs,
+						Oid dbId, Oid tsId,
+						bool relcacheInitFileInval)
+{
+	int			i;
+
+	if (relcacheInitFileInval)
+		appendStringInfo(buf, "; relcache init file inval dbid %u tsid %u",
+						 dbId, tsId);
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index da32a4f..c9a64bb 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5997,6 +5997,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a99fcaf..13a11ac 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,29 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/* XXX for now we're issuing invalidations one by one */
+				Assert(invals->nmsgs == 1);
+
+				if (!TransactionIdIsValid(xid))
+					break;
+
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->dbId, invals->tsId,
+											 invals->relcacheInitFileInval,
+											 invals->msgs[0]);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index aeebbf2..3d6cbcf 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -473,6 +473,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
@@ -1822,17 +1823,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation message locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					LocalExecuteInvalidationMessage(&change->data.inval.msg);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -2212,6 +2218,38 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 
 /*
  * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn,
+							 Oid dbId, Oid tsId, bool relcacheInitFileInval,
+							 SharedInvalidationMessage msg)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	/* XXX Should we even write invalidations without valid XID? */
+	if (xid == InvalidTransactionId)
+		return;
+
+	Assert(xid != InvalidTransactionId);
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.dbId = dbId;
+	change->data.inval.tsId = tsId;
+	change->data.inval.relcacheInitFileInval = relcacheInitFileInval;
+	change->data.inval.msg = msg;
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Setup the invalidation of the toplevel transaction.
  *
  * This needs to be done before ReorderBufferCommit is called!
  */
@@ -2658,6 +2696,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -2765,6 +2804,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			/* ReorderBufferChange contains everything important */
 			break;
 	}
@@ -3050,6 +3090,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
 			break;
 	}
 
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..e0d04b9 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write individual invalidations into WAL to support
+ *	the decoding of the in-progress transaction.  As of now it was enough to
+ *	log invalidation only at commit because we are only decoding the transaction
+ *	at the commit time.   We only need to log the catalog cache and relcache
+ *	invalidation.  There can not be any active MVCC scan in logical decoding so
+ *	we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,9 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -489,6 +499,18 @@ RegisterCatcacheInvalidation(int cacheId,
 {
 	AddCatcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   cacheId, hashValue, dbId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cc.id = (int8) cacheId;
+		msg.cc.dbId = dbId;
+		msg.cc.hashValue = hashValue;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -501,6 +523,18 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 {
 	AddCatalogInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								  dbId, catId);
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.cat.id = SHAREDINVALCATALOG_ID;
+		msg.cat.dbId = dbId;
+		msg.cat.catId = catId;
+
+		LogLogicalInvalidations(1, &msg, false);
+	}
 }
 
 /*
@@ -511,6 +545,8 @@ RegisterCatalogInvalidation(Oid dbId, Oid catId)
 static void
 RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 {
+	bool		RelcacheInitFileInval = false;
+
 	AddRelcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
 								   dbId, relId);
 
@@ -529,7 +565,22 @@ RegisterRelcacheInvalidation(Oid dbId, Oid relId)
 	 * as well.  Also zap when we are invalidating whole relcache.
 	 */
 	if (relId == InvalidOid || RelationIdIsInInitFile(relId))
+	{
 		transInvalInfo->RelcacheInitFileInval = true;
+		RelcacheInitFileInval = true;
+	}
+
+	/* Issue an invalidation WAL record (when wal_level=logical) */
+	if (XLogLogicalInfoActive())
+	{
+		SharedInvalidationMessage msg;
+
+		msg.rc.id = SHAREDINVALRELCACHE_ID;
+		msg.rc.dbId = dbId;
+		msg.rc.relId = relId;
+
+		LogLogicalInvalidations(1, &msg, RelcacheInitFileInval);
+	}
 }
 
 /*
@@ -1501,3 +1552,27 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
+						bool relcacheInitFileInval)
+{
+	xl_xact_invalidations xlrec;
+
+	/* prepare record */
+	memset(&xlrec, 0, sizeof(xlrec));
+	xlrec.dbId = MyDatabaseId;
+	xlrec.tsId = MyDatabaseTableSpace;
+	xlrec.relcacheInitFileInval = relcacheInitFileInval;
+	xlrec.nmsgs = nmsgs;
+
+	/* perform insertion */
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+	XLogRegisterData((char *) msgs,
+					 nmsgs * sizeof(SharedInvalidationMessage));
+	XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b38..6f2a583 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,22 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ *
+ * XXX Currently nmsgs=1 but that might change in the future.
+ */
+typedef struct xl_xact_invalidations
+{
+	Oid			dbId;			/* MyDatabaseId */
+	Oid			tsId;			/* MyDatabaseTableSpace */
+	bool		relcacheInitFileInval;	/* invalidate relcache init file */
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4..9a3f045 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,16 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			Oid			dbId;	/* MyDatabaseId */
+			Oid			tsId;	/* MyDatabaseTableSpace */
+			bool		relcacheInitFileInval;	/* invalidate relcache init
+												 * file */
+			SharedInvalidationMessage msg;	/* invalidation message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +470,9 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  Oid dbId, Oid tsId, bool relcacheInitFileInval,
+								  SharedInvalidationMessage msg);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
1.8.3.1

v11-0003-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=v11-0003-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From 1680c9678f93474063293a7051d8400afba97662 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v11 03/10] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 +++++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  57 +++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..64f651f 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index bce6d37..3a95fb2 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index e3da7d3..ec40755 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +204,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -860,6 +908,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->final_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 3b7ca7f..f24e246 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..0d0a94a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 9a3f045..15bb5ed 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -356,6 +356,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -395,6 +441,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v11-0005-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v11-0005-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From 74072fe1077efa3bb7e689cb23bba285b6a254d3 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri, 10 Jan 2020 18:03:27 -0300
Subject: [PATCH v11 05/10] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c     |  38 +-
 src/backend/replication/logical/reorderbuffer.c | 710 +++++++++++++++++++++---
 src/include/replication/reorderbuffer.h         |  36 ++
 3 files changed, 693 insertions(+), 91 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..160b167 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3ca960c..f68b2e4 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -236,6 +236,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +245,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +381,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -769,6 +782,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -864,6 +909,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -987,7 +1035,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1023,6 +1071,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1037,6 +1088,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1320,6 +1374,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1345,8 +1408,93 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
  */
 static void
 ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1354,9 +1502,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
-
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
 
 	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
@@ -1495,63 +1640,75 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report an specific error which we can ignore.  We
+ * might have already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the abort we will
+ * stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
-		return;
-	}
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 
-	/* build data to be able to lookup the CommandIds of catalog tuples */
+	/*
+	 * build data to be able to lookup the CommandIds of catalog tuples
+	 */
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, txn->xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1567,15 +1724,20 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 	PG_TRY();
 	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 		ReorderBufferChange *change;
 		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction("stream");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* start streaming this chunk of transaction */
+		if (streaming)
+			rb->stream_start(rb, txn);
+		else
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1583,6 +1745,19 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1592,8 +1767,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * use as a normal record. It'll be cleaned up at the end
 					 * of INSERT processing.
 					 */
-					if (specinsert == NULL)
-						elog(ERROR, "invalid ordering of speculative insertion changes");
 					Assert(specinsert->data.tp.oldtuple == NULL);
 					change = specinsert;
 					change->action = REORDER_BUFFER_CHANGE_INSERT;
@@ -1659,7 +1832,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						if (streaming)
+						{
+							rb->stream_change(rb, txn, relation, change);
+
+							/* Remember that we have sent some data for this xid. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_change(rb, txn, relation, change);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1680,8 +1861,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						 * freed/reused while restoring spooled data from
 						 * disk.
 						 */
-						Assert(change->data.tp.newtuple != NULL);
-
 						dlist_delete(&change->node);
 						ReorderBufferToastAppendChunk(rb, txn, relation,
 													  change);
@@ -1699,7 +1878,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1757,7 +1936,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						if (streaming)
+						{
+							rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+							/* Remember that we have sent some data. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_truncate(rb, txn, nrelations, relations, change);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1766,10 +1953,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					if (streaming)
+						rb->stream_message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
+					else
+						rb->message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1800,9 +1993,9 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+										  txn->xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1822,7 +2015,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+											  txn->xid);
 					}
 
 					break;
@@ -1860,14 +2054,46 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			/*
+			 * Set the last last of the stream as the final lsn before calling
+			 * stream stop.
+			 */
+			if (!XLogRecPtrIsInvalid(prev_lsn))
+				txn->final_lsn = prev_lsn;
+			rb->stream_stop(rb, txn);
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+		{
+			txn->command_id = command_id;
+
+			/* Avoid copying if it's already copied. */
+			if (snapshot_now->copied)
+				txn->snapshot_now = snapshot_now;
+			else
+				txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+														  txn, command_id);
+		}
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1885,14 +2111,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,18 +2145,122 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		if (streaming)
+		{
+			/* Discard the changes that we just streamed. */
+			ReorderBufferTruncateTXN(rb, txn);
 
-		PG_RE_THROW();
+			/* Re-throw only if it's not an abort. */
+			if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+			{
+				MemoryContextSwitchTo(ecxt);
+				PG_RE_THROW();
+			}
+			else
+			{
+				/* remember the command ID and snapshot for the streaming run */
+				txn->command_id = command_id;
+
+				/* Avoid copying if it's already copied. */
+				if (snapshot_now->copied)
+					txn->snapshot_now = snapshot_now;
+				else
+					txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+															  txn, command_id);
+				/*
+				 * Set the last last of the stream as the final lsn before
+				 * calling stream stop.
+				 */
+				txn->final_lsn = prev_lsn;
+				rb->stream_stop(rb, txn);
+
+				FlushErrorState();
+			}
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
 /*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * We currently can only decode a transaction's contents when its commit
+ * record is read because that's the only place where we know about cache
+ * invalidations. Thus, once a toplevel commit is read, we iterate over the top
+ * and subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -1946,6 +2284,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2015,6 +2360,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2150,8 +2502,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2159,6 +2520,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2170,19 +2532,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2211,6 +2582,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2300,6 +2672,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2404,6 +2783,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
  *
@@ -2423,15 +2834,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2734,6 +3176,102 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 15bb5ed..adb8f9d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -192,6 +193,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -227,6 +246,16 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	final_lsn;
 
 	/*
+	 * Have we sent any changes for this transaction in output plugin?
+	 */
+	bool		any_data_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
+	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
 	XLogRecPtr	end_lsn;
@@ -257,6 +286,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
-- 
1.8.3.1

v11-0007-Track-statistics-for-streaming.patchapplication/octet-stream; name=v11-0007-Track-statistics-for-streaming.patchDownload
From 044d924ca93ca8adf88ccc388eb5b27299f10d60 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Tue, 11 Feb 2020 12:10:43 +0530
Subject: [PATCH v11 07/10] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                    | 25 +++++++++++++++++++
 src/backend/catalog/system_views.sql            |  5 +++-
 src/backend/replication/logical/reorderbuffer.c | 13 ++++++++++
 src/backend/replication/walsender.c             | 32 +++++++++++++++++++++----
 src/include/catalog/pg_proc.dat                 |  6 ++---
 src/include/replication/reorderbuffer.h         | 13 ++++++----
 src/include/replication/walsender_private.h     |  5 ++++
 src/test/regress/expected/rules.out             |  7 ++++--
 8 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9129f79..ecf1c57 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2016,6 +2016,31 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       may get spilled repeatedly, and this counter gets incremented on every
       such invocation.</entry>
     </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
+
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f681aaf..2ede8f3 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -787,7 +787,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index f68b2e4..a3c4509 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -331,6 +331,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3267,6 +3271,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
 
+	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 6ee7fa2..21c4da0 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1292,7 +1292,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1313,7 +1313,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2356,6 +2357,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3194,7 +3198,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3251,6 +3255,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillTxns;
 		int64		spillCount;
 		int64		spillBytes;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3274,6 +3281,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		memset(nulls, 0, sizeof(nulls));
@@ -3360,6 +3370,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3608,11 +3623,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 226c904..7f95079 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5193,9 +5193,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index adb8f9d..5e1337e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -519,15 +519,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 366828f..3888b0c 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -85,6 +85,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 634f825..f9c30e8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1984,9 +1984,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
1.8.3.1

v11-0006-Add-support-for-streaming-to-built-in-replicatio.patchapplication/octet-stream; name=v11-0006-Add-support-for-streaming-to-built-in-replicatio.patchDownload
From 1a15a960aa9eeb9f47173e1897866864394045bc Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 10 Feb 2020 11:00:48 +0530
Subject: [PATCH v11 06/10] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |    4 +-
 doc/src/sgml/ref/create_subscription.sgml          |   12 +
 src/backend/catalog/pg_subscription.c              |    1 +
 src/backend/commands/subscriptioncmds.c            |   45 +-
 src/backend/postmaster/pgstat.c                    |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    3 +
 src/backend/replication/logical/launcher.c         |    1 -
 src/backend/replication/logical/logical.c          |    4 +-
 src/backend/replication/logical/proto.c            |  140 ++-
 src/backend/replication/logical/worker.c           | 1028 ++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c        |  309 +++++-
 src/backend/replication/slotfuncs.c                |    7 +
 src/backend/replication/walsender.c                |    6 +
 src/include/catalog/pg_subscription.h              |    3 +
 src/include/pgstat.h                               |    6 +-
 src/include/replication/logicalproto.h             |   42 +-
 src/include/replication/walreceiver.h              |    1 +
 src/test/subscription/t/009_stream_simple.pl       |   86 ++
 src/test/subscription/t/010_stream_subxact.pl      |  102 ++
 src/test/subscription/t/011_stream_ddl.pl          |   95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |   82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |   84 ++
 22 files changed, 2034 insertions(+), 39 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 1b8bead..95b7c24 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -164,8 +164,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c24..3349cc4 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index f77a83b..7d7f721 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 119a9ce..54ca2d3 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -58,7 +58,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -89,6 +90,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -174,6 +177,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -317,6 +330,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -333,7 +348,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -411,6 +426,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -668,10 +690,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -696,6 +721,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -707,7 +739,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -745,7 +778,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -782,7 +815,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7169509..eb00242 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4102,6 +4102,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9..5257ab0 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e..8156a42 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,7 +14,6 @@
  *
  *-------------------------------------------------------------------------
  */
-
 #include "postgres.h"
 
 #include "access/heapam.h"
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ec40755..b5d854f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1146,7 +1146,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1191,7 +1191,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = txn->final_lsn;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd..5242ac0 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 7a5471f..61388bd 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -31,6 +52,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -60,6 +82,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -67,6 +90,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -106,12 +130,59 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	off_t		offset;			/* offset in the file */
+}			SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
+static bool handle_streamed_transaction(const char action, StringInfo s);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -163,6 +234,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -529,6 +636,318 @@ apply_handle_origin(StringInfo s)
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found PG_USED_FOR_ASSERTS_ONLY = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/* We should not receive aborts for unknown subtransactions. */
+		Assert(found);
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	/* XXX Should this be allocated in another memory context? */
+
+	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	ensure_transaction();
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+
+	pfree(buffer);
+	pfree(s2.data);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -541,6 +960,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -556,6 +978,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -591,6 +1016,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -695,6 +1123,9 @@ apply_handle_update(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -830,6 +1261,9 @@ apply_handle_delete(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -929,6 +1363,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1020,6 +1457,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1117,6 +1570,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 }
 
 /*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
+/*
  * Apply main loop.
  */
 static void
@@ -1132,6 +1601,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1580,6 +2052,561 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	CloseTransientFile(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	CloseTransientFile(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -1745,6 +2772,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7525082..11e249e 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -44,17 +44,45 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
 
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
+
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 
-/* Entry in the map used to remember which relation schemas we sent. */
+/*
+ * Entry in the map used to remember which relation schemas we sent.
+ *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
+	TransactionId	xid;		/* transaction that created the record */
 	bool		schema_sent;	/* did we send the schema? */
+	List	   *streamed_txns;	/* streamed toplevel transactions with
+								 * this schema */
 	bool		replicate_valid;
 	PublicationActions pubactions;
 } RelationSyncEntry;
@@ -63,11 +91,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -83,15 +117,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -137,6 +180,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -149,6 +209,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -171,7 +232,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -191,6 +253,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -258,9 +341,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (!relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (!schema_sent)
 	{
 		TupleDesc	desc;
 		int			i;
@@ -286,19 +408,26 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 				continue;
 
 			OutputPluginPrepareWrite(ctx, false);
-			logicalrep_write_typ(ctx->out, att->atttypid);
+			logicalrep_write_typ(ctx->out, xid, att->atttypid);
 			OutputPluginWrite(ctx, false);
 		}
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_rel(ctx->out, relation);
+		logicalrep_write_rel(ctx->out, xid, relation);
 		OutputPluginWrite(ctx, false);
-		relentry->schema_sent = true;
+		relentry->xid = change->txn->xid;
+
+		if (in_streaming)
+			set_schema_sent_in_streamed_txn(relentry, topxid);
+		else
+			relentry->schema_sent = true;
 	}
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -307,6 +436,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -335,14 +468,14 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
 	{
 		case REORDER_BUFFER_CHANGE_INSERT:
 			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_insert(ctx->out, relation,
+			logicalrep_write_insert(ctx->out, xid, relation,
 									&change->data.tp.newtuple->tuple);
 			OutputPluginWrite(ctx, true);
 			break;
@@ -352,7 +485,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				&change->data.tp.oldtuple->tuple : NULL;
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple,
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
 										&change->data.tp.newtuple->tuple);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -361,7 +494,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			if (change->data.tp.oldtuple)
 			{
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation,
+				logicalrep_write_delete(ctx->out, xid, relation,
 										&change->data.tp.oldtuple->tuple);
 				OutputPluginWrite(ctx, true);
 			}
@@ -387,6 +520,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -407,13 +544,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -487,6 +625,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -523,6 +746,34 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  */
 static RelationSyncEntry *
@@ -597,6 +848,36 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -631,7 +912,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2c9d5de..30db2c2 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -147,6 +147,13 @@ create_logical_replication_slot(char *name, char *plugin,
 									logical_read_local_xlog_page, NULL, NULL,
 									NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+
 	/* build initial snapshot, might take a while */
 	DecodingContextFindStartpoint(ctx);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index abb533b..6ee7fa2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -968,6 +968,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndUpdateProgress);
 
 		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
+		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
 		 * messages or send keepalives. As we possibly need to wait for
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d4..617b909 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index aecb601..146d7c4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -945,7 +945,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 2cc2dc4..277e44c 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -85,25 +89,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e08afc6..0ebd140 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -169,6 +169,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v11-0008-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v11-0008-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From 2b5c0bcf4d6e9bb4896b5530595047de26167793 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v11 08/10] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 6da7f71..086d0c7 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -70,7 +70,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 34ab11e..ad3ed13 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 81520a7..0c9c6b3 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v11-0009-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v11-0009-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From 7ce58a6151eb1a3afe1a1510f3547d8c1cb8a27f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v11 09/10] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v11-0010-Bugfix-handling-of-incomplete-toast-tuple.patchapplication/octet-stream; name=v11-0010-Bugfix-handling-of-incomplete-toast-tuple.patchDownload
From 170dfd601dcb64d9969436cd6eba35d518280716 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 30 Jan 2020 14:21:04 +0530
Subject: [PATCH v11 10/10] Bugfix handling of incomplete toast tuple

---
 src/backend/access/heap/heapam.c                |   3 +
 src/backend/replication/logical/decode.c        |  17 ++-
 src/backend/replication/logical/reorderbuffer.c | 145 +++++++++++++-----------
 src/include/access/heapam_xlog.h                |   1 +
 src/include/replication/reorderbuffer.h         |  17 ++-
 5 files changed, 110 insertions(+), 73 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 24d0d7a..faaaf67 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2018,6 +2018,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 13a11ac..4ee528f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -734,7 +734,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -801,7 +803,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -858,7 +861,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -894,7 +898,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -999,7 +1003,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1037,7 +1041,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index a3c4509..0cadc28 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -650,7 +650,7 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
@@ -664,6 +664,28 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Otherwise, if
+	 * we have toast insert bit set and this is insert/update then clear the
+	 * bit.
+	 */
+	if (toast_insert)
+		txn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(txn) &&
+			 ((change->action == REORDER_BUFFER_CHANGE_INSERT) ||
+			 (change->action == REORDER_BUFFER_CHANGE_UPDATE)))
+		txn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * If this is a speculative insert then set the corresponding bit.
+	 * Otherwise, if we have speculative insert bit set and this is spec confirm
+	 * record then clear the bit.
+	 */
+	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)
+		txn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM)
+		txn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
@@ -696,7 +718,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1866,8 +1888,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						 * disk.
 						 */
 						dlist_delete(&change->node);
-						ReorderBufferToastAppendChunk(rb, txn, relation,
-													  change);
+							ReorderBufferToastAppendChunk(rb, txn, relation,
+														  change);
 					}
 
 			change_done:
@@ -2453,7 +2475,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2502,7 +2524,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2525,6 +2547,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2540,7 +2563,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 	/* if subxact, and streaming supported, use the toplevel instead */
 	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+		toptxn = txn->toptxn;
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2548,12 +2571,16 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+		if (toptxn)
+			toptxn->size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+		if (toptxn)
+			toptxn->size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2619,7 +2646,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 	change->data.inval.relcacheInitFileInval = relcacheInitFileInval;
 	change->data.inval.msg = msg;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -2806,15 +2833,16 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		/*
+		 * if the current transaction is larger and doesn't have incomplete data
+		 * remember it.
+		 */
+		if (((!largest) || (txn->size > largest->size)) &&
+			((txn->size > 0) && !rbtxn_has_toast_insert(txn) &&
+			!rbtxn_has_spec_insert(txn)))
+		largest = txn;
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2832,66 +2860,51 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
 	ReorderBufferTXN *txn;
 
-	/* bail out if we haven't exceeded the memory limit */
-	if (rb->size < logical_decoding_work_mem * 1024L)
-		return;
-
-	/*
-	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by streaming, if supported. Otherwise spill to disk.
-	 */
-	if (ReorderBufferCanStream(rb))
+	/* Loop until we reach under the memory limit. */
+	while (rb->size >= logical_decoding_work_mem * 1024L)
 	{
 		/*
-		 * Pick the largest toplevel transaction and evict it from memory by
-		 * streaming the already decoded part.
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTopTXN(rb);
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn && !txn->toptxn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
-		ReorderBufferStreamTXN(rb, txn);
-	}
-	else
-	{
 		/*
-		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * After eviction, the transaction should have no entries in memory, and
+		 * should use 0 bytes for changes.
+		 *
+		 * XXX Checking the size is fine for both cases - spill to disk and
+		 * streaming. But for streaming we should really check nentries_mem for
+		 * all subtransactions too.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
-
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
-
-		ReorderBufferSerializeTXN(rb, txn);
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
 	}
 
-	/*
-	 * After eviction, the transaction should have no entries in memory, and
-	 * should use 0 bytes for changes.
-	 *
-	 * XXX Checking the size is fine for both cases - spill to disk and
-	 * streaming. But for streaming we should really check nentries_mem for
-	 * all subtransactions too.
-	 */
-	Assert(txn->size == 0);
-	Assert(txn->nentries_mem == 0);
-
-	/*
-	 * And furthermore, evicting the transaction should get us below the
-	 * memory limit again - it is not possible that we're still exceeding the
-	 * memory limit after evicting the transaction.
-	 *
-	 * This follows from the simple fact that the selected transaction is at
-	 * least as large as the most recent change (which caused us to go over
-	 * the memory limit). So by evicting it we're definitely back below the
-	 * memory limit.
-	 */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5e1337e..a2646c5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -174,6 +174,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -193,6 +195,17 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -547,7 +560,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

#227Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#226)
10 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Feb 13, 2020 at 8:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Feb 11, 2020 at 8:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

The patch set was not applying on the head so I have rebased it.

I have changed the patch 0002 so that instead of logging the WAL for
each invalidation, now we log at each command end as discussed
upthread[1]/messages/by-id/CAA4eK1LOa+2KqNX=m=1qMBDW+o50AuwjAOX6ZqL-rWGiH1F9MQ@mail.gmail.com

Soon we will evaluate the performance for the same and post the results.

[1]: /messages/by-id/CAA4eK1LOa+2KqNX=m=1qMBDW+o50AuwjAOX6ZqL-rWGiH1F9MQ@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v12-0001-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=v12-0001-Immediately-WAL-log-assignments.patchDownload
From b092f8ab260a8616e093c5d075010a307eae41ee Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 09:53:08 +0530
Subject: [PATCH v12 01/10] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 45 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 39 +++++++++++++--------------
 src/include/access/xact.h                |  3 +++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 +++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 98 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e3c60f2..da32a4f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -190,6 +190,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -5096,6 +5097,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -5998,3 +6000,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 2fa0a7f..b11b0c2 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -88,11 +88,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -194,6 +196,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -397,7 +403,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -743,6 +749,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		curinsert_flags |= XLOG_INCLUDE_XID;
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 32f0225..51b6485 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1186,6 +1186,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1224,6 +1225,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5e1dc8a..a99fcaf 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel_xid is valid, we need to assign the subxact to the
+	 * toplevel transaction. We need to do this for all records, hence we
+	 * do it before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 record->toplevel_xid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrIds) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(r->toplevel_xid))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04ba..8645b38 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033f..e23892a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -227,6 +227,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 4582196..756f6df 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -150,6 +150,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -285,6 +287,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v12-0002-Issue-individual-invalidations-with-wal_level-lo.patchapplication/octet-stream; name=v12-0002-Issue-individual-invalidations-with-wal_level-lo.patchDownload
From 0e124734e61a36407a5907b9bd2d514d5b3764b3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:26:33 +0530
Subject: [PATCH v12 02/10] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.

The new invalidations are written to WAL immediately, without any
such caching. Perhaps it would be possible to add similar caching,
e.g. at the command level, or something like that?
---
 src/backend/access/rmgrdesc/xactdesc.c          |  40 +++++++++
 src/backend/access/transam/xact.c               |   7 ++
 src/backend/replication/logical/decode.c        |  18 ++++
 src/backend/replication/logical/reorderbuffer.c | 110 +++++++++++++++++++++---
 src/backend/utils/cache/inval.c                 |  59 +++++++++++--
 src/include/access/xact.h                       |  13 ++-
 src/include/replication/reorderbuffer.h         |  11 +++
 7 files changed, 239 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index fbc5942..17c06f7 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index da32a4f..c9a64bb 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5997,6 +5997,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a99fcaf..693ac01 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				if (!TransactionIdIsValid(xid))
+					break;
+
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->nmsgs, invals->msgs);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 481277a..03f6888 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -220,7 +220,7 @@ static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
 static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
-static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
 
 /*
  * ---------------------------------------
@@ -455,6 +455,11 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 				pfree(change->data.msg.message);
 			change->data.msg.message = NULL;
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			if (change->data.inval.invalidations)
+				pfree(change->data.inval.invalidations);
+			change->data.inval.invalidations = NULL;
+			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			if (change->data.snapshot)
 			{
@@ -1819,17 +1824,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation messages locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					ReorderBufferExecuteInvalidations(
+							change->data.inval.ninvalidations,
+							change->data.inval.invalidations);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -1871,7 +1883,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1897,7 +1910,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2209,6 +2223,39 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 
 /*
  * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn, int nmsgs,
+							 SharedInvalidationMessage *msgs)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	/* XXX Should we even write invalidations without valid XID? */
+	if (xid == InvalidTransactionId)
+		return;
+
+	Assert(xid != InvalidTransactionId);
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.ninvalidations = nmsgs;
+	change->data.inval.invalidations = (SharedInvalidationMessage *)
+		MemoryContextAlloc(rb->context,
+						   sizeof(SharedInvalidationMessage) * nmsgs);
+	memcpy(change->data.inval.invalidations, msgs,
+		   sizeof(SharedInvalidationMessage) * nmsgs);
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Setup the invalidation of the toplevel transaction.
  *
  * This needs to be done before ReorderBufferCommit is called!
  */
@@ -2239,12 +2286,12 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  * in the changestream but we don't know which those are.
  */
 static void
-ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
 {
 	int			i;
 
-	for (i = 0; i < txn->ninvalidations; i++)
-		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	for (i = 0; i < nmsgs; i++)
+		LocalExecuteInvalidationMessage(&msgs[i]);
 }
 
 /*
@@ -2596,6 +2643,24 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				char	   *data;
+				Size		inval_size = sizeof(SharedInvalidationMessage) *
+										change->data.inval.ninvalidations;
+
+				sz += inval_size;
+
+				ReorderBufferSerializeReserve(rb, sz);
+				data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+
+				/* might have been reallocated above */
+				ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+				memcpy(data, change->data.inval.invalidations, inval_size);
+				data += inval_size;
+
+				break;
+			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -2741,6 +2806,11 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			sz += sizeof(SharedInvalidationMessage) *
+					change->data.inval.ninvalidations;
+			break;
+
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -3009,6 +3079,20 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				Size	inval_size = sizeof(SharedInvalidationMessage) *
+									change->data.inval.ninvalidations;
+
+				change->data.inval.invalidations =
+						MemoryContextAlloc(rb->context,
+										   change->data.msg.message_size);
+				/* read the message */
+				memcpy(change->data.inval.invalidations, data, inval_size);
+				data += inval_size;
+
+				break;
+			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	oldsnap;
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..47be680 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *  support the decoding of the in-progress transaction.  As of now it was
+ *  enough to log invalidation only at commit because we are only decoding the
+ *  transaction at the commit time.   We only need to log the catalog cache and
+ *  relcache invalidation.  There can not be any active MVCC scan in logical
+ *  decoding so we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1068,8 +1077,8 @@ AtEOSubXact_Inval(bool isCommit)
 
 /*
  * CommandEndInvalidationMessages
- *		Process queued-up invalidation messages at end of one command
- *		in a transaction.
+ *              Process queued-up invalidation messages at end of one command
+ *              in a transaction.
  *
  * Here, we send no messages to the shared queue, since we don't know yet if
  * we will commit.  We do need to locally process the CurrentCmdInvalidMsgs
@@ -1078,8 +1087,8 @@ AtEOSubXact_Inval(bool isCommit)
  * of the prior-cmds list.
  *
  * Note:
- *		This should be called during CommandCounterIncrement(),
- *		after we have advanced the command ID.
+ *              This should be called during CommandCounterIncrement(),
+ *              after we have advanced the command ID.
  */
 void
 CommandEndInvalidationMessages(void)
@@ -1090,7 +1099,10 @@ CommandEndInvalidationMessages(void)
 	 * just quietly return if no state to work on.
 	 */
 	if (transInvalInfo == NULL)
-		return;
+			return;
+
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
 
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
@@ -1501,3 +1513,40 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage	*invalMessages;
+	int	nmsgs = 0;
+
+	if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+	{
+		ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+								 MakeSharedInvalidMessagesArray);
+		invalMessages = SharedInvalidMessagesArray;
+		nmsgs  = numSharedInvalidMessagesArray;
+		SharedInvalidMessagesArray = NULL;
+		numSharedInvalidMessagesArray = 0;
+	}
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, sizeof(xlrec));
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b38..b822c5e 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,17 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4..af35287 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,14 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations;		/* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation
+														 * message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +468,8 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  int nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
1.8.3.1

v12-0005-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v12-0005-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From ae386d1de96900d2b2277f2b13f45dcb937009ce Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri, 10 Jan 2020 18:03:27 -0300
Subject: [PATCH v12 05/10] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c     |  38 +-
 src/backend/replication/logical/reorderbuffer.c | 710 +++++++++++++++++++++---
 src/include/replication/reorderbuffer.h         |  36 ++
 3 files changed, 693 insertions(+), 91 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..160b167 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index ddfba27..f18a31a 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -236,6 +236,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +245,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +381,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -773,6 +786,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -865,6 +910,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -988,7 +1036,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1024,6 +1072,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1038,6 +1089,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1321,6 +1375,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1346,8 +1409,93 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
  */
 static void
 ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1355,9 +1503,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
-
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
 
 	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
@@ -1496,63 +1641,75 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report an specific error which we can ignore.  We
+ * might have already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the abort we will
+ * stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
-		return;
-	}
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 
-	/* build data to be able to lookup the CommandIds of catalog tuples */
+	/*
+	 * build data to be able to lookup the CommandIds of catalog tuples
+	 */
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, txn->xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1568,15 +1725,20 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 	PG_TRY();
 	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 		ReorderBufferChange *change;
 		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction("stream");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* start streaming this chunk of transaction */
+		if (streaming)
+			rb->stream_start(rb, txn);
+		else
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1584,6 +1746,19 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1593,8 +1768,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * use as a normal record. It'll be cleaned up at the end
 					 * of INSERT processing.
 					 */
-					if (specinsert == NULL)
-						elog(ERROR, "invalid ordering of speculative insertion changes");
 					Assert(specinsert->data.tp.oldtuple == NULL);
 					change = specinsert;
 					change->action = REORDER_BUFFER_CHANGE_INSERT;
@@ -1660,7 +1833,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						if (streaming)
+						{
+							rb->stream_change(rb, txn, relation, change);
+
+							/* Remember that we have sent some data for this xid. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_change(rb, txn, relation, change);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1681,8 +1862,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						 * freed/reused while restoring spooled data from
 						 * disk.
 						 */
-						Assert(change->data.tp.newtuple != NULL);
-
 						dlist_delete(&change->node);
 						ReorderBufferToastAppendChunk(rb, txn, relation,
 													  change);
@@ -1700,7 +1879,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1758,7 +1937,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						if (streaming)
+						{
+							rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+							/* Remember that we have sent some data. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_truncate(rb, txn, nrelations, relations, change);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1767,10 +1954,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					if (streaming)
+						rb->stream_message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
+					else
+						rb->message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1801,9 +1994,9 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+										  txn->xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1823,7 +2016,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+											  txn->xid);
 					}
 
 					break;
@@ -1863,14 +2057,46 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			/*
+			 * Set the last last of the stream as the final lsn before calling
+			 * stream stop.
+			 */
+			if (!XLogRecPtrIsInvalid(prev_lsn))
+				txn->final_lsn = prev_lsn;
+			rb->stream_stop(rb, txn);
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+		{
+			txn->command_id = command_id;
+
+			/* Avoid copying if it's already copied. */
+			if (snapshot_now->copied)
+				txn->snapshot_now = snapshot_now;
+			else
+				txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+														  txn, command_id);
+		}
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1889,14 +2115,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1916,18 +2150,122 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		if (streaming)
+		{
+			/* Discard the changes that we just streamed. */
+			ReorderBufferTruncateTXN(rb, txn);
 
-		PG_RE_THROW();
+			/* Re-throw only if it's not an abort. */
+			if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+			{
+				MemoryContextSwitchTo(ecxt);
+				PG_RE_THROW();
+			}
+			else
+			{
+				/* remember the command ID and snapshot for the streaming run */
+				txn->command_id = command_id;
+
+				/* Avoid copying if it's already copied. */
+				if (snapshot_now->copied)
+					txn->snapshot_now = snapshot_now;
+				else
+					txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+															  txn, command_id);
+				/*
+				 * Set the last last of the stream as the final lsn before
+				 * calling stream stop.
+				 */
+				txn->final_lsn = prev_lsn;
+				rb->stream_stop(rb, txn);
+
+				FlushErrorState();
+			}
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
 /*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * We currently can only decode a transaction's contents when its commit
+ * record is read because that's the only place where we know about cache
+ * invalidations. Thus, once a toplevel commit is read, we iterate over the top
+ * and subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -1951,6 +2289,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2020,6 +2365,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2155,8 +2507,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2164,6 +2525,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2175,19 +2537,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2216,6 +2587,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2306,6 +2678,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2410,6 +2789,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
  *
@@ -2429,15 +2840,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2757,6 +3199,102 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e102840..6d65986 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -171,6 +171,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -190,6 +191,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -225,6 +244,16 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	final_lsn;
 
 	/*
+	 * Have we sent any changes for this transaction in output plugin?
+	 */
+	bool		any_data_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
+	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
 	XLogRecPtr	end_lsn;
@@ -255,6 +284,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
-- 
1.8.3.1

v12-0003-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=v12-0003-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From de2a3f48e527fd65c1ff7092a691070f0ff783f9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v12 03/10] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 +++++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  57 +++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..64f651f 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index bce6d37..3a95fb2 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index e3da7d3..ec40755 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +204,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -860,6 +908,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->final_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 3b7ca7f..f24e246 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..0d0a94a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index af35287..e102840 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -354,6 +354,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -393,6 +439,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v12-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchapplication/octet-stream; name=v12-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchDownload
From fc8043cb461e77195a3d08279c7fd66b1309948a Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:32:34 +0530
Subject: [PATCH v12 04/10] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml               |  5 ++-
 src/backend/access/heap/heapam.c                | 41 +++++++++++++++++++++++++
 src/backend/access/index/genam.c                | 40 ++++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c |  8 ++---
 src/backend/utils/time/snapmgr.c                | 25 +++++++++++++--
 src/include/utils/snapmgr.h                     |  4 ++-
 6 files changed, 115 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 3a95fb2..3a54b35 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,7 +432,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
      includes writing to tables, performing DDL changes, and
      calling <literal>txid_current()</literal>.
     </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5a32e62..0a4d86d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1304,6 +1304,15 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(scan->rs_base.rs_rd) ||
+		  RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	HEAPDEBUG_1;				/* heap_getnext( info ) */
@@ -1423,6 +1432,14 @@ heap_fetch(Relation relation,
 	bool		valid;
 
 	/*
+	 * We don't expect direct calls to heap_fetch with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_fetch call during logical decoding");
+
+	/*
 	 * Fetch and pin the appropriate page of the relation.
 	 */
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
@@ -1537,6 +1554,14 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		valid;
 	bool		skip;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search_buffer with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_hot_search_buffer call during logical decoding");
+
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
 		*all_dead = first_call;
@@ -1686,6 +1711,14 @@ heap_get_latest_tid(TableScanDesc sscan,
 	Assert(ItemPointerIsValid(tid));
 
 	/*
+	 * We don't expect direct calls to heap_get_latest_tid with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_get_latest_tid call during logical decoding");
+
+	/*
 	 * Loop to chase down t_ctid links.  At top of loop, ctid is the tuple we
 	 * need to examine, and *tid is the TID we will return if ctid turns out
 	 * to be bogus.
@@ -5491,6 +5524,14 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	ItemId		lp = NULL;
 	HeapTupleHeader htup;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_hot_search call during logical decoding");
+
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..413a21f 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -481,6 +482,19 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  We can't directly use TransactionIdDidAbort as after crash
+	 * such transaction might not have been marked as aborted.  See detailed
+	 * comments at snapmgr.c where the variable is declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
@@ -517,6 +531,19 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  We can't directly use TransactionIdDidAbort as after crash
+	 * such transaction might not have been marked as aborted.  See detailed
+	 * comments at snapmgr.c where the variable is declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return result;
 }
 
@@ -643,6 +670,19 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  We can't directly use TransactionIdDidAbort as after crash
+	 * such transaction might not have been marked as aborted.  See detailed
+	 * comments at snapmgr.c where the variable is declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 03f6888..ddfba27 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -696,7 +696,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 			txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 		/* setup snapshot to allow catalog access */
-		SetupHistoricSnapshot(snapshot_now, NULL);
+		SetupHistoricSnapshot(snapshot_now, NULL, xid);
 		PG_TRY();
 		{
 			rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1552,7 +1552,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1803,7 +1803,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1823,7 +1823,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					}
 
 					break;
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c5..93a0c04 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -154,6 +154,13 @@ static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
 /*
+ * An xid value pointing to a possibly ongoing (sub)transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
+/*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2029,10 +2036,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
  *
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive.  This is to re-check XID status while accessing catalog.
+ *
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+					  TransactionId snapshot_xid)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -2041,8 +2052,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 
 	/* setup (cmin, cmax) lookup hash */
 	tuplecid_data = tuplecids;
-}
 
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
+	 */
+	if (TransactionIdIsValid(snapshot_xid) &&
+		!TransactionIdDidCommit(snapshot_xid))
+		CheckXidAlive = snapshot_xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
 /*
  * Make catalog snapshots behave normally again.
@@ -2052,6 +2072,7 @@ TeardownHistoricSnapshot(bool is_error)
 {
 	HistoricSnapshot = NULL;
 	tuplecid_data = NULL;
+	CheckXidAlive = InvalidTransactionId;
 }
 
 bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13c..12f737b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,8 +145,10 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+								  TransactionId snapshot_xid);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-- 
1.8.3.1

v12-0007-Track-statistics-for-streaming.patchapplication/octet-stream; name=v12-0007-Track-statistics-for-streaming.patchDownload
From 65365ef3ca81aa26cfd69da49224dfeb44bf605c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Tue, 11 Feb 2020 12:10:43 +0530
Subject: [PATCH v12 07/10] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                    | 25 +++++++++++++++++++
 src/backend/catalog/system_views.sql            |  5 +++-
 src/backend/replication/logical/reorderbuffer.c | 13 ++++++++++
 src/backend/replication/walsender.c             | 32 +++++++++++++++++++++----
 src/include/catalog/pg_proc.dat                 |  6 ++---
 src/include/replication/reorderbuffer.h         | 13 ++++++----
 src/include/replication/walsender_private.h     |  5 ++++
 src/test/regress/expected/rules.out             |  7 ++++--
 8 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 87586a7..cb294b5 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2024,6 +2024,31 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       may get spilled repeatedly, and this counter gets incremented on every
       such invocation.</entry>
     </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
+
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f681aaf..2ede8f3 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -787,7 +787,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index f18a31a..f1e7498 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -331,6 +331,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3290,6 +3294,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
 
+	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 6ee7fa2..21c4da0 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1292,7 +1292,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1313,7 +1313,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2356,6 +2357,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3194,7 +3198,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3251,6 +3255,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillTxns;
 		int64		spillCount;
 		int64		spillBytes;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3274,6 +3281,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		memset(nulls, 0, sizeof(nulls));
@@ -3360,6 +3370,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3608,11 +3623,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 07a86c7..72ff008 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5196,9 +5196,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6d65986..603f325 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -517,15 +517,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 366828f..3888b0c 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -85,6 +85,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 634f825..f9c30e8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1984,9 +1984,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
1.8.3.1

v12-0006-Add-support-for-streaming-to-built-in-replicatio.patchapplication/octet-stream; name=v12-0006-Add-support-for-streaming-to-built-in-replicatio.patchDownload
From 625e8fad4d4a1f9e7eb177d1f2edd7a3cb86cdf0 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 10 Feb 2020 11:00:48 +0530
Subject: [PATCH v12 06/10] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |    4 +-
 doc/src/sgml/ref/create_subscription.sgml          |   12 +
 src/backend/catalog/pg_subscription.c              |    1 +
 src/backend/commands/subscriptioncmds.c            |   45 +-
 src/backend/postmaster/pgstat.c                    |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    3 +
 src/backend/replication/logical/launcher.c         |    1 -
 src/backend/replication/logical/logical.c          |    4 +-
 src/backend/replication/logical/proto.c            |  140 ++-
 src/backend/replication/logical/worker.c           | 1028 ++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c        |  309 +++++-
 src/backend/replication/slotfuncs.c                |    7 +
 src/backend/replication/walsender.c                |    6 +
 src/include/catalog/pg_subscription.h              |    3 +
 src/include/pgstat.h                               |    6 +-
 src/include/replication/logicalproto.h             |   42 +-
 src/include/replication/walreceiver.h              |    1 +
 src/test/subscription/t/009_stream_simple.pl       |   86 ++
 src/test/subscription/t/010_stream_subxact.pl      |  102 ++
 src/test/subscription/t/011_stream_ddl.pl          |   95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |   82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |   84 ++
 22 files changed, 2034 insertions(+), 39 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 1b8bead..95b7c24 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -164,8 +164,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c24..3349cc4 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index f77a83b..7d7f721 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 119a9ce..54ca2d3 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -58,7 +58,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -89,6 +90,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -174,6 +177,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -317,6 +330,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -333,7 +348,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -411,6 +426,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -668,10 +690,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -696,6 +721,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -707,7 +739,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -745,7 +778,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -782,7 +815,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 462b4d7..32d85cb 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4105,6 +4105,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9..5257ab0 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e..8156a42 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,7 +14,6 @@
  *
  *-------------------------------------------------------------------------
  */
-
 #include "postgres.h"
 
 #include "access/heapam.h"
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ec40755..b5d854f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1146,7 +1146,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1191,7 +1191,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = txn->final_lsn;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd..5242ac0 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ad4a732..cc89188 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -31,6 +52,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -61,6 +83,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -68,6 +91,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -107,12 +131,59 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	off_t		offset;			/* offset in the file */
+}			SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
+static bool handle_streamed_transaction(const char action, StringInfo s);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -164,6 +235,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -530,6 +637,318 @@ apply_handle_origin(StringInfo s)
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+			return;
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	/* XXX Should this be allocated in another memory context? */
+
+	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	ensure_transaction();
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+
+	pfree(buffer);
+	pfree(s2.data);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -542,6 +961,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -557,6 +979,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -592,6 +1017,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -696,6 +1124,9 @@ apply_handle_update(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -833,6 +1264,9 @@ apply_handle_delete(StringInfo s)
 	bool		found;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -932,6 +1366,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1023,6 +1460,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1120,6 +1573,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 }
 
 /*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
+/*
  * Apply main loop.
  */
 static void
@@ -1135,6 +1604,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1583,6 +2055,561 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	CloseTransientFile(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	CloseTransientFile(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -1748,6 +2775,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7525082..11e249e 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -44,17 +44,45 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
 
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
+
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 
-/* Entry in the map used to remember which relation schemas we sent. */
+/*
+ * Entry in the map used to remember which relation schemas we sent.
+ *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
+	TransactionId	xid;		/* transaction that created the record */
 	bool		schema_sent;	/* did we send the schema? */
+	List	   *streamed_txns;	/* streamed toplevel transactions with
+								 * this schema */
 	bool		replicate_valid;
 	PublicationActions pubactions;
 } RelationSyncEntry;
@@ -63,11 +91,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -83,15 +117,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -137,6 +180,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -149,6 +209,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -171,7 +232,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -191,6 +253,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -258,9 +341,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (!relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (!schema_sent)
 	{
 		TupleDesc	desc;
 		int			i;
@@ -286,19 +408,26 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 				continue;
 
 			OutputPluginPrepareWrite(ctx, false);
-			logicalrep_write_typ(ctx->out, att->atttypid);
+			logicalrep_write_typ(ctx->out, xid, att->atttypid);
 			OutputPluginWrite(ctx, false);
 		}
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_rel(ctx->out, relation);
+		logicalrep_write_rel(ctx->out, xid, relation);
 		OutputPluginWrite(ctx, false);
-		relentry->schema_sent = true;
+		relentry->xid = change->txn->xid;
+
+		if (in_streaming)
+			set_schema_sent_in_streamed_txn(relentry, topxid);
+		else
+			relentry->schema_sent = true;
 	}
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -307,6 +436,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -335,14 +468,14 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
 	{
 		case REORDER_BUFFER_CHANGE_INSERT:
 			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_insert(ctx->out, relation,
+			logicalrep_write_insert(ctx->out, xid, relation,
 									&change->data.tp.newtuple->tuple);
 			OutputPluginWrite(ctx, true);
 			break;
@@ -352,7 +485,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				&change->data.tp.oldtuple->tuple : NULL;
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple,
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
 										&change->data.tp.newtuple->tuple);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -361,7 +494,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			if (change->data.tp.oldtuple)
 			{
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation,
+				logicalrep_write_delete(ctx->out, xid, relation,
 										&change->data.tp.oldtuple->tuple);
 				OutputPluginWrite(ctx, true);
 			}
@@ -387,6 +520,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -407,13 +544,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -487,6 +625,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -523,6 +746,34 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  */
 static RelationSyncEntry *
@@ -597,6 +848,36 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -631,7 +912,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2c9d5de..30db2c2 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -147,6 +147,13 @@ create_logical_replication_slot(char *name, char *plugin,
 									logical_read_local_xlog_page, NULL, NULL,
 									NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+
 	/* build initial snapshot, might take a while */
 	DecodingContextFindStartpoint(ctx);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index abb533b..6ee7fa2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -968,6 +968,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndUpdateProgress);
 
 		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
+		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
 		 * messages or send keepalives. As we possibly need to wait for
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d4..617b909 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 3a65a51..ed2e43a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -945,7 +945,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 2cc2dc4..277e44c 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -85,25 +89,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e08afc6..0ebd140 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -169,6 +169,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v12-0008-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v12-0008-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From 210dd338ba977043e2588c7ac9315d2ed1b40242 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v12 08/10] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 6da7f71..086d0c7 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -70,7 +70,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 34ab11e..ad3ed13 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 81520a7..0c9c6b3 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v12-0009-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v12-0009-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From 090eb648bbd5e5a0bbaa04dc61476009e8d3bec1 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v12 09/10] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v12-0010-Bugfix-handling-of-incomplete-toast-tuple.patchapplication/octet-stream; name=v12-0010-Bugfix-handling-of-incomplete-toast-tuple.patchDownload
From 72bc3bb31c31ee94e6aca9577632155d6657e23f Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 28 Feb 2020 11:07:46 +0530
Subject: [PATCH v12 10/10] Bugfix handling of incomplete toast tuple

---
 src/backend/access/heap/heapam.c                |   3 +
 src/backend/replication/logical/decode.c        |  17 ++-
 src/backend/replication/logical/reorderbuffer.c | 145 +++++++++++++-----------
 src/include/access/heapam_xlog.h                |   1 +
 src/include/replication/reorderbuffer.h         |  17 ++-
 5 files changed, 110 insertions(+), 73 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0a4d86d..b3ab8c6 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2018,6 +2018,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 693ac01..b7d507c 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -729,7 +729,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -796,7 +798,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -853,7 +856,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -889,7 +893,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -994,7 +998,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1032,7 +1036,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index f1e7498..b53417e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -654,7 +654,7 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
@@ -668,6 +668,28 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Otherwise, if
+	 * we have toast insert bit set and this is insert/update then clear the
+	 * bit.
+	 */
+	if (toast_insert)
+		txn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(txn) &&
+			 ((change->action == REORDER_BUFFER_CHANGE_INSERT) ||
+			 (change->action == REORDER_BUFFER_CHANGE_UPDATE)))
+		txn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * If this is a speculative insert then set the corresponding bit.
+	 * Otherwise, if we have speculative insert bit set and this is spec confirm
+	 * record then clear the bit.
+	 */
+	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)
+		txn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM)
+		txn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
@@ -700,7 +722,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1867,8 +1889,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						 * disk.
 						 */
 						dlist_delete(&change->node);
-						ReorderBufferToastAppendChunk(rb, txn, relation,
-													  change);
+							ReorderBufferToastAppendChunk(rb, txn, relation,
+														  change);
 					}
 
 			change_done:
@@ -2458,7 +2480,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2507,7 +2529,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2530,6 +2552,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2545,7 +2568,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 	/* if subxact, and streaming supported, use the toplevel instead */
 	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+		toptxn = txn->toptxn;
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2553,12 +2576,16 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+		if (toptxn)
+			toptxn->size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+		if (toptxn)
+			toptxn->size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2625,7 +2652,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 	memcpy(change->data.inval.invalidations, msgs,
 		   sizeof(SharedInvalidationMessage) * nmsgs);
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -2812,15 +2839,16 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		/*
+		 * if the current transaction is larger and doesn't have incomplete data
+		 * remember it.
+		 */
+		if (((!largest) || (txn->size > largest->size)) &&
+			((txn->size > 0) && !rbtxn_has_toast_insert(txn) &&
+			!rbtxn_has_spec_insert(txn)))
+		largest = txn;
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2838,66 +2866,51 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
 	ReorderBufferTXN *txn;
 
-	/* bail out if we haven't exceeded the memory limit */
-	if (rb->size < logical_decoding_work_mem * 1024L)
-		return;
-
-	/*
-	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by streaming, if supported. Otherwise spill to disk.
-	 */
-	if (ReorderBufferCanStream(rb))
+	/* Loop until we reach under the memory limit. */
+	while (rb->size >= logical_decoding_work_mem * 1024L)
 	{
 		/*
-		 * Pick the largest toplevel transaction and evict it from memory by
-		 * streaming the already decoded part.
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTopTXN(rb);
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn && !txn->toptxn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
-		ReorderBufferStreamTXN(rb, txn);
-	}
-	else
-	{
 		/*
-		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * After eviction, the transaction should have no entries in memory, and
+		 * should use 0 bytes for changes.
+		 *
+		 * XXX Checking the size is fine for both cases - spill to disk and
+		 * streaming. But for streaming we should really check nentries_mem for
+		 * all subtransactions too.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
-
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
-
-		ReorderBufferSerializeTXN(rb, txn);
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
 	}
 
-	/*
-	 * After eviction, the transaction should have no entries in memory, and
-	 * should use 0 bytes for changes.
-	 *
-	 * XXX Checking the size is fine for both cases - spill to disk and
-	 * streaming. But for streaming we should really check nentries_mem for
-	 * all subtransactions too.
-	 */
-	Assert(txn->size == 0);
-	Assert(txn->nentries_mem == 0);
-
-	/*
-	 * And furthermore, evicting the transaction should get us below the
-	 * memory limit again - it is not possible that we're still exceeding the
-	 * memory limit after evicting the transaction.
-	 *
-	 * This follows from the simple fact that the selected transaction is at
-	 * least as large as the most recent change (which caused us to go over
-	 * the memory limit). So by evicting it we're definitely back below the
-	 * memory limit.
-	 */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 603f325..9a4f886 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -172,6 +172,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -191,6 +193,17 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -545,7 +558,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

#228Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dilip Kumar (#227)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Hi,

I started looking at this patch series again, hoping to get it moving
for PG13. There's been a tremendous amount of work done since I last
worked on it, and a lot was discussed on this thread, so it'll take a
while to get familiar with the new code ...

The first thing I realized that WAL-logging of assignments in v12 does
both the "old" logging (using dedicated message) and "new" with
toplevel-XID embedded in the first message. Yes, the patch was wrong,
because it eliminated all calls to ProcArrayApplyXidAssignment() and so
it was trivial to crash the replica due to KnownAssignedXids overflow.
But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
right fix.

I actually proposed doing this (having both ways to log assignments) so
that there's no regression risk with (wal_level < logical). But IIRC
Andres objected to it, argumenting that we should not log the same piece
of information in two very different ways at the same time (IIRC it was
discussed on the FOSDEM dev meeting, so I don't have a link to share).
And I do agree with him ...

The question is, why couldn't the replica use the same assignment info
we already write for logical decoding? The main challenge is that now
the assignment can be sent in many different xlog messages, from a bunch
of resource managers (essentially, any xlog message with a xid can have
embedded XID of the toplevel xact). So the handling would either need to
happen in every rmgr, or we need to move it before we call the rmgr.

For exampple, we might do this e.g. in StartupXLOG() I think, per the
attached patch (FWIW this particular fix was written by Masahiko Sawada,
not me). This does the trick for me - I'm no longer able to reproduce
the KnownAssignedXids overflow.

The one difference is that we used to call ProcArrayApplyXidAssignment
for larger groups of XIDs, as sent in the assignment message. Now we
call it for each individual assignment. I don't know if this is an
issue, but I suppose we might introduce some sort of local caching
(accumulate the assignments into a local array, call the function only
when we have enough of them).

Aside from that, I think there's a minor bug in xact.c - the patch adds
a "assigned" field to TransactionStateData, but then it fails to add a
default value into TopTransactionStateData. We probably interpret NULL
as false, but then there's nothing for the pointer. I suspect it might
leave some random garbage there, leading to strange things later.

Another thing I noticed is LogicalDecodingProcessRecord() extracts the
toplevel XID using a macro

txid = XLogRecGetTopXid(record);

but then it just starts accessing the fields directly again in the
ReorderBufferAssignChild call. I think we should do this instead:

ReorderBufferAssignChild(ctx->reorder,
txid,
XLogRecGetXid(record),
buf.origptr);

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#229Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#228)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

D'oh! As usual I forgot to actually attach the patch I mentioned. So
here it is ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

xid-assignment-v12-fix.patchtext/plain; charset=us-asciiDownload
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7574,6 +7574,19 @@ StartupXLOG(void)
                    LWLockRelease(XidGenLock);
                }
+               /*
+                * Assign subtransaction xids to the top level xid if the
+                * record has that information. This is required at most
+                * once per subtransactions.
+                */
+               if (TransactionIdIsValid(xlogreader->toplevel_xid) &&
+                   standbyState >= STANDBY_INITIALIZED)
+               {
+                   Assert(XLogStandbyInfoActive());
+                   ProcArrayApplyXidAssignment(xlogreader->toplevel_xid,
+                                               1, &record->xl_xid);
+               }
+
                /*
                 * Before replaying this record, check if this record causes
                 * the current timeline to change. The record is already
#230Dilip Kumar
dilipbalaut@gmail.com
In reply to: Tomas Vondra (#228)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

I started looking at this patch series again, hoping to get it moving
for PG13.

Nice.

There's been a tremendous amount of work done since I last

worked on it, and a lot was discussed on this thread, so it'll take a
while to get familiar with the new code ...

The first thing I realized that WAL-logging of assignments in v12 does
both the "old" logging (using dedicated message) and "new" with
toplevel-XID embedded in the first message. Yes, the patch was wrong,
because it eliminated all calls to ProcArrayApplyXidAssignment() and so
it was trivial to crash the replica due to KnownAssignedXids overflow.
But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
right fix.

I actually proposed doing this (having both ways to log assignments) so
that there's no regression risk with (wal_level < logical). But IIRC
Andres objected to it, argumenting that we should not log the same piece
of information in two very different ways at the same time (IIRC it was
discussed on the FOSDEM dev meeting, so I don't have a link to share).
And I do agree with him ...

The question is, why couldn't the replica use the same assignment info
we already write for logical decoding? The main challenge is that now
the assignment can be sent in many different xlog messages, from a bunch
of resource managers (essentially, any xlog message with a xid can have
embedded XID of the toplevel xact). So the handling would either need to
happen in every rmgr, or we need to move it before we call the rmgr.

For exampple, we might do this e.g. in StartupXLOG() I think, per the
attached patch (FWIW this particular fix was written by Masahiko Sawada,
not me). This does the trick for me - I'm no longer able to reproduce
the KnownAssignedXids overflow.

The one difference is that we used to call ProcArrayApplyXidAssignment
for larger groups of XIDs, as sent in the assignment message. Now we
call it for each individual assignment. I don't know if this is an
issue, but I suppose we might introduce some sort of local caching
(accumulate the assignments into a local array, call the function only
when we have enough of them).

Thanks for the pointers, I will think over these points.

Aside from that, I think there's a minor bug in xact.c - the patch adds
a "assigned" field to TransactionStateData, but then it fails to add a
default value into TopTransactionStateData. We probably interpret NULL
as false, but then there's nothing for the pointer. I suspect it might
leave some random garbage there, leading to strange things later.

Actually, we will never access that field for the
TopTransactionStateData, right?
See below code, we have a check that only if IsSubTransaction(), then
we access the "assigned" filed.

+bool
+IsSubTransactionAssignmentPending(void)
+{
+ if (!XLogLogicalInfoActive())
+ return false;
+
+ /* we need to be in a transaction state */
+ if (!IsTransactionState())
+ return false;
+
+ /* it has to be a subtransaction */
+ if (!IsSubTransaction())
+ return false;
+
+ /* the subtransaction has to have a XID assigned */
+ if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+ return false;
+
+ /* and it needs to have 'assigned' */
+ return !CurrentTransactionState->assigned;
+
+}

Another thing I noticed is LogicalDecodingProcessRecord() extracts the
toplevel XID using a macro

txid = XLogRecGetTopXid(record);

but then it just starts accessing the fields directly again in the
ReorderBufferAssignChild call. I think we should do this instead:

ReorderBufferAssignChild(ctx->reorder,
txid,
XLogRecGetXid(record),
buf.origptr);

Make sense. I will change this in the patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#231Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#228)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

I started looking at this patch series again, hoping to get it moving
for PG13.

It is good to keep moving this forward, but there are quite a few
problems with the design which need a broader discussion. Some of
what I recall are:
a. Handling of abort of concurrent transactions. There is some code
in the patch which might work, but there is not much discussion when
it was posted.
b. Handling of partial tuples (while streaming, we came to know that
toast tuple is not complete or speculative insert is incomplete). For
this also, we have proposed a few solutions which need further
discussion. One of those is implemented in the patch series.
c. We might also need some handling for replication origins.
d. Try to minimize the performance overhead of WAL logging for
invalidations. We discussed different solutions for this and
implemented one of those.
e. How to skip already streamed transactions.

There might be a few more which I can't recall now. Apart from this,
I haven't done any detailed review of subscriber-side implementation
where we write streamed transactions to file. All of this will need
much more discussion and review before we can say it is ready to
commit, so I thought it might be better to pick it up for PG14 and
focus on other things that have a better chance for PG13 especially
because all the problems were not solved/discussed before last CF.
However, it is a good idea to keep moving this and have a discussion
on some of these issues.

There's been a tremendous amount of work done since I last
worked on it, and a lot was discussed on this thread, so it'll take a
while to get familiar with the new code ...

The first thing I realized that WAL-logging of assignments in v12 does
both the "old" logging (using dedicated message) and "new" with
toplevel-XID embedded in the first message. Yes, the patch was wrong,
because it eliminated all calls to ProcArrayApplyXidAssignment() and so
it was trivial to crash the replica due to KnownAssignedXids overflow.
But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
right fix.

I actually proposed doing this (having both ways to log assignments) so
that there's no regression risk with (wal_level < logical). But IIRC
Andres objected to it, argumenting that we should not log the same piece
of information in two very different ways at the same time (IIRC it was
discussed on the FOSDEM dev meeting, so I don't have a link to share).
And I do agree with him ...

So, aren't we worried about the overhead of the amount of WAL and
performance impact for the transactions? We might want to check the
pgbench read-write test to see if that will add any significant
overhead.

The question is, why couldn't the replica use the same assignment info
we already write for logical decoding?

I haven't thought about it in detail, but we can think on those lines
if the performance overhead is in the acceptable range.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#232Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#231)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Mar 4, 2020 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

The first thing I realized that WAL-logging of assignments in v12 does
both the "old" logging (using dedicated message) and "new" with
toplevel-XID embedded in the first message. Yes, the patch was wrong,
because it eliminated all calls to ProcArrayApplyXidAssignment() and so
it was trivial to crash the replica due to KnownAssignedXids overflow.
But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
right fix.

I actually proposed doing this (having both ways to log assignments) so
that there's no regression risk with (wal_level < logical). But IIRC
Andres objected to it, argumenting that we should not log the same piece
of information in two very different ways at the same time (IIRC it was
discussed on the FOSDEM dev meeting, so I don't have a link to share).
And I do agree with him ...

So, aren't we worried about the overhead of the amount of WAL and
performance impact for the transactions? We might want to check the
pgbench read-write test to see if that will add any significant
overhead.

I have briefly looked at the original patch and it seems the
additional overhead is only when subtransactions are involved, so
ideally, it shouldn't impact default pgbench, but there is no harm in
checking. It might be that we need to build a custom script with
subtransactions involved to measure the impact, but I think it is
worth checking

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#233Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#232)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Mar 4, 2020 at 2:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 4, 2020 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

The first thing I realized that WAL-logging of assignments in v12 does
both the "old" logging (using dedicated message) and "new" with
toplevel-XID embedded in the first message. Yes, the patch was wrong,
because it eliminated all calls to ProcArrayApplyXidAssignment() and so
it was trivial to crash the replica due to KnownAssignedXids overflow.
But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
right fix.

I actually proposed doing this (having both ways to log assignments) so
that there's no regression risk with (wal_level < logical). But IIRC
Andres objected to it, argumenting that we should not log the same piece
of information in two very different ways at the same time (IIRC it was
discussed on the FOSDEM dev meeting, so I don't have a link to share).
And I do agree with him ...

So, aren't we worried about the overhead of the amount of WAL and
performance impact for the transactions? We might want to check the
pgbench read-write test to see if that will add any significant
overhead.

I have briefly looked at the original patch and it seems the
additional overhead is only when subtransactions are involved, so
ideally, it shouldn't impact default pgbench, but there is no harm in
checking. It might be that we need to build a custom script with
subtransactions involved to measure the impact, but I think it is
worth checking

I agree. I will test the same and post the results.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#234Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#231)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Mar 04, 2020 at 10:28:32AM +0530, Amit Kapila wrote:

On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

I started looking at this patch series again, hoping to get it moving
for PG13.

It is good to keep moving this forward, but there are quite a few
problems with the design which need a broader discussion. Some of
what I recall are:
a. Handling of abort of concurrent transactions. There is some code
in the patch which might work, but there is not much discussion when
it was posted.
b. Handling of partial tuples (while streaming, we came to know that
toast tuple is not complete or speculative insert is incomplete). For
this also, we have proposed a few solutions which need further
discussion. One of those is implemented in the patch series.
c. We might also need some handling for replication origins.
d. Try to minimize the performance overhead of WAL logging for
invalidations. We discussed different solutions for this and
implemented one of those.
e. How to skip already streamed transactions.

There might be a few more which I can't recall now. Apart from this,
I haven't done any detailed review of subscriber-side implementation
where we write streamed transactions to file. All of this will need
much more discussion and review before we can say it is ready to
commit, so I thought it might be better to pick it up for PG14 and
focus on other things that have a better chance for PG13 especially
because all the problems were not solved/discussed before last CF.
However, it is a good idea to keep moving this and have a discussion
on some of these issues.

Sure, there's a lot to discuss. And it's possible (likely) it's not
feasible to get this into PG13. But I think it's still worth discussing
it, instead of just punting it into the next CF right away.

There's been a tremendous amount of work done since I last
worked on it, and a lot was discussed on this thread, so it'll take a
while to get familiar with the new code ...

The first thing I realized that WAL-logging of assignments in v12 does
both the "old" logging (using dedicated message) and "new" with
toplevel-XID embedded in the first message. Yes, the patch was wrong,
because it eliminated all calls to ProcArrayApplyXidAssignment() and so
it was trivial to crash the replica due to KnownAssignedXids overflow.
But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
right fix.

I actually proposed doing this (having both ways to log assignments) so
that there's no regression risk with (wal_level < logical). But IIRC
Andres objected to it, argumenting that we should not log the same piece
of information in two very different ways at the same time (IIRC it was
discussed on the FOSDEM dev meeting, so I don't have a link to share).
And I do agree with him ...

So, aren't we worried about the overhead of the amount of WAL and
performance impact for the transactions? We might want to check the
pgbench read-write test to see if that will add any significant
overhead.

Well, sure. I agree we need to see how this affects performance, and
I'll do some benchmarks (I think I did that when submitting the patch,
but I don't recall the numbers / details).

Isn't it a bit strange to log stuff twice, though, if we worry about
performance? Surely that's more expensive than logging it just once. Of
course, it might be useful if most systems need just the "old" way.

I know it's going to be a bit hand-wavy, but I think embedding the
assignments into existing WAL messages is about the cheapest way to log
this. I would not expect this to be mesurably more expensive than what
we have now, but I might be wrong.

The question is, why couldn't the replica use the same assignment info
we already write for logical decoding?

I haven't thought about it in detail, but we can think on those lines
if the performance overhead is in the acceptable range.

OK, let me do some measurements ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#235Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dilip Kumar (#230)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Mar 04, 2020 at 09:13:49AM +0530, Dilip Kumar wrote:

On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

I started looking at this patch series again, hoping to get it moving
for PG13.

Nice.

There's been a tremendous amount of work done since I last

worked on it, and a lot was discussed on this thread, so it'll take a
while to get familiar with the new code ...

The first thing I realized that WAL-logging of assignments in v12 does
both the "old" logging (using dedicated message) and "new" with
toplevel-XID embedded in the first message. Yes, the patch was wrong,
because it eliminated all calls to ProcArrayApplyXidAssignment() and so
it was trivial to crash the replica due to KnownAssignedXids overflow.
But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
right fix.

I actually proposed doing this (having both ways to log assignments) so
that there's no regression risk with (wal_level < logical). But IIRC
Andres objected to it, argumenting that we should not log the same piece
of information in two very different ways at the same time (IIRC it was
discussed on the FOSDEM dev meeting, so I don't have a link to share).
And I do agree with him ...

The question is, why couldn't the replica use the same assignment info
we already write for logical decoding? The main challenge is that now
the assignment can be sent in many different xlog messages, from a bunch
of resource managers (essentially, any xlog message with a xid can have
embedded XID of the toplevel xact). So the handling would either need to
happen in every rmgr, or we need to move it before we call the rmgr.

For exampple, we might do this e.g. in StartupXLOG() I think, per the
attached patch (FWIW this particular fix was written by Masahiko Sawada,
not me). This does the trick for me - I'm no longer able to reproduce
the KnownAssignedXids overflow.

The one difference is that we used to call ProcArrayApplyXidAssignment
for larger groups of XIDs, as sent in the assignment message. Now we
call it for each individual assignment. I don't know if this is an
issue, but I suppose we might introduce some sort of local caching
(accumulate the assignments into a local array, call the function only
when we have enough of them).

Thanks for the pointers, I will think over these points.

Aside from that, I think there's a minor bug in xact.c - the patch adds
a "assigned" field to TransactionStateData, but then it fails to add a
default value into TopTransactionStateData. We probably interpret NULL
as false, but then there's nothing for the pointer. I suspect it might
leave some random garbage there, leading to strange things later.

Actually, we will never access that field for the
TopTransactionStateData, right?
See below code, we have a check that only if IsSubTransaction(), then
we access the "assigned" filed.

+bool
+IsSubTransactionAssignmentPending(void)
+{
+ if (!XLogLogicalInfoActive())
+ return false;
+
+ /* we need to be in a transaction state */
+ if (!IsTransactionState())
+ return false;
+
+ /* it has to be a subtransaction */
+ if (!IsSubTransaction())
+ return false;
+
+ /* the subtransaction has to have a XID assigned */
+ if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+ return false;
+
+ /* and it needs to have 'assigned' */
+ return !CurrentTransactionState->assigned;
+
+}

The problem is not with the "assigned" field, really. AFAICS we probably
initialize it to false because we interpret NULL as false. My concern
was that we essentially leave the last pointer not initialized. That
seems like a bug, not sure if it breaks something in practice.

Another thing I noticed is LogicalDecodingProcessRecord() extracts the
toplevel XID using a macro

txid = XLogRecGetTopXid(record);

but then it just starts accessing the fields directly again in the
ReorderBufferAssignChild call. I think we should do this instead:

ReorderBufferAssignChild(ctx->reorder,
txid,
XLogRecGetXid(record),
buf.origptr);

Make sense. I will change this in the patch.

+1, thanks

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#236Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#234)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Mar 5, 2020 at 11:20 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Wed, Mar 04, 2020 at 10:28:32AM +0530, Amit Kapila wrote:

Sure, there's a lot to discuss. And it's possible (likely) it's not
feasible to get this into PG13. But I think it's still worth discussing
it, instead of just punting it into the next CF right away.

That makes sense to me.

There's been a tremendous amount of work done since I last
worked on it, and a lot was discussed on this thread, so it'll take a
while to get familiar with the new code ...

The first thing I realized that WAL-logging of assignments in v12 does
both the "old" logging (using dedicated message) and "new" with
toplevel-XID embedded in the first message. Yes, the patch was wrong,
because it eliminated all calls to ProcArrayApplyXidAssignment() and so
it was trivial to crash the replica due to KnownAssignedXids overflow.
But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
right fix.

I actually proposed doing this (having both ways to log assignments) so
that there's no regression risk with (wal_level < logical). But IIRC
Andres objected to it, argumenting that we should not log the same piece
of information in two very different ways at the same time (IIRC it was
discussed on the FOSDEM dev meeting, so I don't have a link to share).
And I do agree with him ...

So, aren't we worried about the overhead of the amount of WAL and
performance impact for the transactions? We might want to check the
pgbench read-write test to see if that will add any significant
overhead.

Well, sure. I agree we need to see how this affects performance, and
I'll do some benchmarks (I think I did that when submitting the patch,
but I don't recall the numbers / details).

Isn't it a bit strange to log stuff twice, though, if we worry about
performance? Surely that's more expensive than logging it just once. Of
course, it might be useful if most systems need just the "old" way.

I know it's going to be a bit hand-wavy, but I think embedding the
assignments into existing WAL messages is about the cheapest way to log
this. I would not expect this to be mesurably more expensive than what
we have now, but I might be wrong.

I agree that this shouldn't be much expensive, but it is better to be
sure in that regard.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#237Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#230)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Mar 4, 2020 at 9:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

The first thing I realized that WAL-logging of assignments in v12 does
both the "old" logging (using dedicated message) and "new" with
toplevel-XID embedded in the first message. Yes, the patch was wrong,
because it eliminated all calls to ProcArrayApplyXidAssignment() and so
it was trivial to crash the replica due to KnownAssignedXids overflow.
But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
right fix.

I actually proposed doing this (having both ways to log assignments) so
that there's no regression risk with (wal_level < logical). But IIRC
Andres objected to it, argumenting that we should not log the same piece
of information in two very different ways at the same time (IIRC it was
discussed on the FOSDEM dev meeting, so I don't have a link to share).
And I do agree with him ...

The question is, why couldn't the replica use the same assignment info
we already write for logical decoding? The main challenge is that now
the assignment can be sent in many different xlog messages, from a bunch
of resource managers (essentially, any xlog message with a xid can have
embedded XID of the toplevel xact). So the handling would either need to
happen in every rmgr, or we need to move it before we call the rmgr.

For exampple, we might do this e.g. in StartupXLOG() I think, per the
attached patch (FWIW this particular fix was written by Masahiko Sawada,
not me). This does the trick for me - I'm no longer able to reproduce
the KnownAssignedXids overflow.

The one difference is that we used to call ProcArrayApplyXidAssignment
for larger groups of XIDs, as sent in the assignment message. Now we
call it for each individual assignment. I don't know if this is an
issue, but I suppose we might introduce some sort of local caching
(accumulate the assignments into a local array, call the function only
when we have enough of them).

Thanks for the pointers, I will think over these points.

I have looked at the solution proposed and I would like to share my
findings. I think calling ProcArrayApplyXidAssignment for each
subtransaction is not a good idea for a couple of reasons:
(a) It will just beat the purpose of maintaining KnowAssignedXids
array which is to avoid looking at pg_subtrans in
TransactionIdIsInProgress() on standby. Basically, if we remove it
for each subXid, it will consider the KnowAssignedXids to be
overflowed and check pg_subtrans frequently.
(b) Calling ProcArrayApplyXidAssignment() for each subtransaction can
be costly from the perspective of concurrency because it acquires
ProcArrayLock in Exclusive mode, so concurrently running transactions
might start blocking at this lock. Also, I see that
SubTransSetParent() makes the page dirty, so it might lead to more
writes if we spread out setting that by calling it separately for each
sub-transaction.

Apart from this, I don't see how the proposed fix is correct because
as far as I can see it tries to remove the Xid before we even record
it via RecordKnownAssignedTransactionIds(). It seems after patch
RecordKnownAssignedTransactionIds() will be called after
ProcArrayApplyXidAssignment(), how could that be correct.

Thoughts?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#238Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#237)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Mar 28, 2020 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 4, 2020 at 9:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

The first thing I realized that WAL-logging of assignments in v12 does
both the "old" logging (using dedicated message) and "new" with
toplevel-XID embedded in the first message. Yes, the patch was wrong,
because it eliminated all calls to ProcArrayApplyXidAssignment() and so
it was trivial to crash the replica due to KnownAssignedXids overflow.
But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
right fix.

I actually proposed doing this (having both ways to log assignments) so
that there's no regression risk with (wal_level < logical). But IIRC
Andres objected to it, argumenting that we should not log the same piece
of information in two very different ways at the same time (IIRC it was
discussed on the FOSDEM dev meeting, so I don't have a link to share).
And I do agree with him ...

The question is, why couldn't the replica use the same assignment info
we already write for logical decoding? The main challenge is that now
the assignment can be sent in many different xlog messages, from a bunch
of resource managers (essentially, any xlog message with a xid can have
embedded XID of the toplevel xact). So the handling would either need to
happen in every rmgr, or we need to move it before we call the rmgr.

For exampple, we might do this e.g. in StartupXLOG() I think, per the
attached patch (FWIW this particular fix was written by Masahiko Sawada,
not me). This does the trick for me - I'm no longer able to reproduce
the KnownAssignedXids overflow.

The one difference is that we used to call ProcArrayApplyXidAssignment
for larger groups of XIDs, as sent in the assignment message. Now we
call it for each individual assignment. I don't know if this is an
issue, but I suppose we might introduce some sort of local caching
(accumulate the assignments into a local array, call the function only
when we have enough of them).

Thanks for the pointers, I will think over these points.

I have looked at the solution proposed and I would like to share my
findings. I think calling ProcArrayApplyXidAssignment for each
subtransaction is not a good idea for a couple of reasons:
(a) It will just beat the purpose of maintaining KnowAssignedXids
array which is to avoid looking at pg_subtrans in
TransactionIdIsInProgress() on standby. Basically, if we remove it
for each subXid, it will consider the KnowAssignedXids to be
overflowed and check pg_subtrans frequently.

Right, I also think this is a problem with this solution. I think we
may try to avoid this by caching this information. But, then we will
have to maintain this in some dimensional array which stores
sub-transaction ids per top transaction or we can maintain a list of
sub-transaction for each transaction. I haven't thought about how
much complexity this solution will add.

(b) Calling ProcArrayApplyXidAssignment() for each subtransaction can
be costly from the perspective of concurrency because it acquires
ProcArrayLock in Exclusive mode, so concurrently running transactions
might start blocking at this lock.

Right

Also, I see that

SubTransSetParent() makes the page dirty, so it might lead to more
writes if we spread out setting that by calling it separately for each
sub-transaction.

Right.

Apart from this, I don't see how the proposed fix is correct because
as far as I can see it tries to remove the Xid before we even record
it via RecordKnownAssignedTransactionIds(). It seems after patch
RecordKnownAssignedTransactionIds() will be called after
ProcArrayApplyXidAssignment(), how could that be correct.

Valid point.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#239Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#238)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Mar 28, 2020 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Mar 28, 2020 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have looked at the solution proposed and I would like to share my
findings. I think calling ProcArrayApplyXidAssignment for each
subtransaction is not a good idea for a couple of reasons:
(a) It will just beat the purpose of maintaining KnowAssignedXids
array which is to avoid looking at pg_subtrans in
TransactionIdIsInProgress() on standby. Basically, if we remove it
for each subXid, it will consider the KnowAssignedXids to be
overflowed and check pg_subtrans frequently.

Right, I also think this is a problem with this solution. I think we
may try to avoid this by caching this information. But, then we will
have to maintain this in some dimensional array which stores
sub-transaction ids per top transaction or we can maintain a list of
sub-transaction for each transaction. I haven't thought about how
much complexity this solution will add.

How about if instead of writing an XLOG_XACT_ASSIGNMENT WAL, we set a
flag in TransactionStateData and then log that as special information
whenever we write next WAL record for a new subtransaction? Then
during recovery, we can only call ProcArrayApplyXidAssignment when we
find that special flag is set in a WAL record. One idea could be to
use a flag bit in XLogRecord.xl_info. If that is feasible then the
solution can work as it is now, without any overhead or change in the
way we maintain KnownAssignedXids.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#240Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#239)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Mar 28, 2020 at 03:29:34PM +0530, Amit Kapila wrote:

On Sat, Mar 28, 2020 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Mar 28, 2020 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have looked at the solution proposed and I would like to share my
findings. I think calling ProcArrayApplyXidAssignment for each
subtransaction is not a good idea for a couple of reasons:
(a) It will just beat the purpose of maintaining KnowAssignedXids
array which is to avoid looking at pg_subtrans in
TransactionIdIsInProgress() on standby. Basically, if we remove it
for each subXid, it will consider the KnowAssignedXids to be
overflowed and check pg_subtrans frequently.

Right, I also think this is a problem with this solution. I think we
may try to avoid this by caching this information. But, then we will
have to maintain this in some dimensional array which stores
sub-transaction ids per top transaction or we can maintain a list of
sub-transaction for each transaction. I haven't thought about how
much complexity this solution will add.

How about if instead of writing an XLOG_XACT_ASSIGNMENT WAL, we set a
flag in TransactionStateData and then log that as special information
whenever we write next WAL record for a new subtransaction? Then
during recovery, we can only call ProcArrayApplyXidAssignment when we
find that special flag is set in a WAL record. One idea could be to
use a flag bit in XLogRecord.xl_info. If that is feasible then the
solution can work as it is now, without any overhead or change in the
way we maintain KnownAssignedXids.

Ummm, how is that different from what the patch is doing now? I mean, we
only write the top-level XID for the first WAL record in each subxact,
right? Or what would be the difference with your approach?

Anyway, I think you're right the ProcArrayApplyXidAssignment call was
done too early, but I think that can be fixed by moving it until after
the RecordKnownAssignedTransactionIds call, no? Essentially, right
before rm_redo().

You're right calling ProcArrayApplyXidAssignment() may be an issue,
because it exclusively acquires the ProcArrayLock. I've actually hinted
that might be an issue in my original message, suggesting we might add a
local cache of assigned XIDs (a small static array, doing essentially
the same thing we used to do on the upstream node). I haven't done that
in my WIP patch to keep it simple, but AFACS it'd work.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#241Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#240)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sun, Mar 29, 2020 at 6:29 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Sat, Mar 28, 2020 at 03:29:34PM +0530, Amit Kapila wrote:

On Sat, Mar 28, 2020 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

How about if instead of writing an XLOG_XACT_ASSIGNMENT WAL, we set a
flag in TransactionStateData and then log that as special information
whenever we write next WAL record for a new subtransaction? Then
during recovery, we can only call ProcArrayApplyXidAssignment when we
find that special flag is set in a WAL record. One idea could be to
use a flag bit in XLogRecord.xl_info. If that is feasible then the
solution can work as it is now, without any overhead or change in the
way we maintain KnownAssignedXids.

Ummm, how is that different from what the patch is doing now? I mean, we
only write the top-level XID for the first WAL record in each subxact,
right? Or what would be the difference with your approach?

We have to do what the patch is currently doing and additionally, we
will set this flag after PGPROC_MAX_CACHED_SUBXIDS which would allow
us to call ProcArrayApplyXidAssignment during WAL replay only after
PGPROC_MAX_CACHED_SUBXIDS number of subxacts. It will help us in
clearing the KnownAssignedXids at the same time as we do now, so no
additional performance overhead.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#242Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#241)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sun, Mar 29, 2020 at 11:19:21AM +0530, Amit Kapila wrote:

On Sun, Mar 29, 2020 at 6:29 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Sat, Mar 28, 2020 at 03:29:34PM +0530, Amit Kapila wrote:

On Sat, Mar 28, 2020 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

How about if instead of writing an XLOG_XACT_ASSIGNMENT WAL, we set a
flag in TransactionStateData and then log that as special information
whenever we write next WAL record for a new subtransaction? Then
during recovery, we can only call ProcArrayApplyXidAssignment when we
find that special flag is set in a WAL record. One idea could be to
use a flag bit in XLogRecord.xl_info. If that is feasible then the
solution can work as it is now, without any overhead or change in the
way we maintain KnownAssignedXids.

Ummm, how is that different from what the patch is doing now? I mean, we
only write the top-level XID for the first WAL record in each subxact,
right? Or what would be the difference with your approach?

We have to do what the patch is currently doing and additionally, we
will set this flag after PGPROC_MAX_CACHED_SUBXIDS which would allow
us to call ProcArrayApplyXidAssignment during WAL replay only after
PGPROC_MAX_CACHED_SUBXIDS number of subxacts. It will help us in
clearing the KnownAssignedXids at the same time as we do now, so no
additional performance overhead.

Hmmm. So we'd still log assignment twice? Or would we keep just the
immediate assignments (embedded into xlog records), and cache the
subxids on the replica somehow?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#243Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#242)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sun, Mar 29, 2020 at 9:01 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Sun, Mar 29, 2020 at 11:19:21AM +0530, Amit Kapila wrote:

On Sun, Mar 29, 2020 at 6:29 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Ummm, how is that different from what the patch is doing now? I mean, we
only write the top-level XID for the first WAL record in each subxact,
right? Or what would be the difference with your approach?

We have to do what the patch is currently doing and additionally, we
will set this flag after PGPROC_MAX_CACHED_SUBXIDS which would allow
us to call ProcArrayApplyXidAssignment during WAL replay only after
PGPROC_MAX_CACHED_SUBXIDS number of subxacts. It will help us in
clearing the KnownAssignedXids at the same time as we do now, so no
additional performance overhead.

Hmmm. So we'd still log assignment twice? Or would we keep just the
immediate assignments (embedded into xlog records), and cache the
subxids on the replica somehow?

I think we need to cache the subxids on the replica somehow but I
don't have a very good idea for it. Basically, there are two ways to
do it (a) Change the KnownAssignedXids in some way so that we can
easily find this information without losing on the current benefits of
it. I can't think of a good way to do that and even if we come up
with something, it could easily be a lot of work, (b) Cache the
subxids for a particular transaction in local memory along with
KnownAssignedXids. This is doable but now we have two data-structures
(one in shared memory and other in local memory) managing the same
information in different ways.

Do you have any other ideas?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#244Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#243)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Mar 30, 2020 at 11:47:57AM +0530, Amit Kapila wrote:

On Sun, Mar 29, 2020 at 9:01 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Sun, Mar 29, 2020 at 11:19:21AM +0530, Amit Kapila wrote:

On Sun, Mar 29, 2020 at 6:29 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Ummm, how is that different from what the patch is doing now? I mean, we
only write the top-level XID for the first WAL record in each subxact,
right? Or what would be the difference with your approach?

We have to do what the patch is currently doing and additionally, we
will set this flag after PGPROC_MAX_CACHED_SUBXIDS which would allow
us to call ProcArrayApplyXidAssignment during WAL replay only after
PGPROC_MAX_CACHED_SUBXIDS number of subxacts. It will help us in
clearing the KnownAssignedXids at the same time as we do now, so no
additional performance overhead.

Hmmm. So we'd still log assignment twice? Or would we keep just the
immediate assignments (embedded into xlog records), and cache the
subxids on the replica somehow?

I think we need to cache the subxids on the replica somehow but I
don't have a very good idea for it. Basically, there are two ways to
do it (a) Change the KnownAssignedXids in some way so that we can
easily find this information without losing on the current benefits of
it. I can't think of a good way to do that and even if we come up
with something, it could easily be a lot of work, (b) Cache the
subxids for a particular transaction in local memory along with
KnownAssignedXids. This is doable but now we have two data-structures
(one in shared memory and other in local memory) managing the same
information in different ways.

Do you have any other ideas?

I don't follow. Why couldn't we have a simple cache on the standby? It
could be either a simple array or a hash table (with the top-level xid
as hash key)?

I think the single array would be sufficient, but the hash table would
allow keeping the apply logic more or less as it's today. See the
attached patch that adds such cache - I do admit I haven't tested this,
but hopefully it's a sufficient illustration of the idea.

It does not handle cleanup of the cache, but I think that should not be
difficult - we simply need to remove entries for transactions that got
committed or rolled back. And do something about transactions without an
explicit commit/rollback record, but that can be done by also handling
XLOG_RUNNING_XACTS (by removing anything preceding oldestRunningXid).

I don't think this is particularly complicated or a lot of code, and I
don't see why would it require data structures in shared memory. Only
the walreceiver on standby needs to worry about this, no?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

xid-assignment-v13-fix.patchtext/plain; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1951103b26..b85c046b41 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7300,6 +7300,19 @@ StartupXLOG(void)
 					TransactionIdIsValid(record->xl_xid))
 					RecordKnownAssignedTransactionIds(record->xl_xid);
 
+				/*
+				* Assign subtransaction xids to the top level xid if the
+				* record has that information. This is required at most
+				* once per subtransactions.
+				*/
+				if (TransactionIdIsValid(xlogreader->toplevel_xid) &&
+					standbyState >= STANDBY_INITIALIZED)
+				{
+					Assert(XLogStandbyInfoActive());
+					ProcArrayCacheXidAssignment(xlogreader->toplevel_xid,
+												record->xl_xid);
+				}
+
 				/* Now apply the WAL record itself */
 				RmgrTable[record->xl_rmid].rm_redo(xlogreader);
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index cfb88db4a4..efddd0e1e6 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -958,6 +958,53 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
 	LWLockRelease(ProcArrayLock);
 }
 
+static HTAB *xidAssignmentsHash = NULL;
+
+typedef struct XidAssignmentEntry
+{
+	/* the hash lookup key MUST BE FIRST */
+	TransactionId	topXid;
+
+	int				nsubxids;
+	TransactionId	subxids[PGPROC_MAX_CACHED_SUBXIDS];
+} XidAssignmentEntry;
+
+void
+ProcArrayCacheXidAssignment(TransactionId topXid, TransactionId subXid)
+{
+	XidAssignmentEntry *entry;
+	bool				found;
+
+	if (xidAssignmentsHash == NULL)
+	{
+		/* First time through: initialize the hash table */
+		HASHCTL		ctl;
+
+		MemSet(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(TransactionId);
+		ctl.entrysize = sizeof(XidAssignmentEntry);
+		xidAssignmentsHash = hash_create("XID assignment cache", 256,
+										 &ctl, HASH_ELEM | HASH_BLOBS);
+	}
+
+	/* Look for an existing entry */
+	entry = (XidAssignmentEntry *) hash_search(xidAssignmentsHash,
+											 (void *) &topXid,
+											 HASH_ENTER, &found);
+
+	if (!found)
+		entry->nsubxids = 0;
+
+	entry->subxids[entry->nsubxids++] = subXid;
+
+	/* after reaching the limit, apply the assignments for this top XID */
+	if (entry->nsubxids == PGPROC_MAX_CACHED_SUBXIDS)
+	{
+		ProcArrayApplyXidAssignment(topXid, entry->nsubxids, entry->subxids);
+		entry->nsubxids = 0;
+	}
+}
+
 /*
  * TransactionIdIsInProgress -- is given transaction running in some backend
  *
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index a5c7d0c064..17d89bdd7e 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -67,6 +67,7 @@ extern void ProcArrayInitRecovery(TransactionId initializedUptoXID);
 extern void ProcArrayApplyRecoveryInfo(RunningTransactions running);
 extern void ProcArrayApplyXidAssignment(TransactionId topxid,
 										int nsubxids, TransactionId *subxids);
+extern void ProcArrayCacheXidAssignment(TransactionId topxid, TransactionId subxid);
 
 extern void RecordKnownAssignedTransactionIds(TransactionId xid);
 extern void ExpireTreeKnownAssignedTransactionIds(TransactionId xid,
#245Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#244)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Mar 30, 2020 at 8:58 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Mar 30, 2020 at 11:47:57AM +0530, Amit Kapila wrote:

I think we need to cache the subxids on the replica somehow but I
don't have a very good idea for it. Basically, there are two ways to
do it (a) Change the KnownAssignedXids in some way so that we can
easily find this information without losing on the current benefits of
it. I can't think of a good way to do that and even if we come up
with something, it could easily be a lot of work, (b) Cache the
subxids for a particular transaction in local memory along with
KnownAssignedXids. This is doable but now we have two data-structures
(one in shared memory and other in local memory) managing the same
information in different ways.

Do you have any other ideas?

I don't follow. Why couldn't we have a simple cache on the standby? It
could be either a simple array or a hash table (with the top-level xid
as hash key)?

I think having something like we discussed or what you have in the
patch won't be sufficient to clean the KnownAssignedXid array. The
point is that we won't write a WAL for xid-subxid association for
unlogged relations in the "Immediately WAL-log assignments" patch,
however, the KnownAssignedXid would have both kinds of Xids as we
autofill it with gaps (see RecordKnownAssignedTransactionIds). I
think if my understanding is correct to make it work we might need
major surgery in the code or have to maintain KnownAssignedXid array
differently.

I don't think this is particularly complicated or a lot of code, and I
don't see why would it require data structures in shared memory. Only
the walreceiver on standby needs to worry about this, no?

Not a new data structure in shared memory, but we already have a
KnownTransactionId structure in shared memory. So, after having a
local cache, we will have xidAssignmentsHash and KnownTransactionId
maintaining the same information in different ways. And, we need to
ensure both are cleaned up properly. That was what I was pointing
above related to maintaining two structures. However, I think before
discussing more on this, we need to think about the above problem.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#246Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#245)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Apr 07, 2020 at 12:17:44PM +0530, Amit Kapila wrote:

On Mon, Mar 30, 2020 at 8:58 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Mar 30, 2020 at 11:47:57AM +0530, Amit Kapila wrote:

I think we need to cache the subxids on the replica somehow but I
don't have a very good idea for it. Basically, there are two ways to
do it (a) Change the KnownAssignedXids in some way so that we can
easily find this information without losing on the current benefits of
it. I can't think of a good way to do that and even if we come up
with something, it could easily be a lot of work, (b) Cache the
subxids for a particular transaction in local memory along with
KnownAssignedXids. This is doable but now we have two data-structures
(one in shared memory and other in local memory) managing the same
information in different ways.

Do you have any other ideas?

I don't follow. Why couldn't we have a simple cache on the standby? It
could be either a simple array or a hash table (with the top-level xid
as hash key)?

I think having something like we discussed or what you have in the
patch won't be sufficient to clean the KnownAssignedXid array. The
point is that we won't write a WAL for xid-subxid association for
unlogged relations in the "Immediately WAL-log assignments" patch,
however, the KnownAssignedXid would have both kinds of Xids as we
autofill it with gaps (see RecordKnownAssignedTransactionIds). I
think if my understanding is correct to make it work we might need
major surgery in the code or have to maintain KnownAssignedXid array
differently.

Hmm, that's a good point. If I understand correctly, the issue is
that if we create new subxact, write something into an unlogged table,
and then create new subxact, the XID of the first subxact will be "known
assigned" but we won't know it's a subxact or to which parent xact it
belongs (because there will be no WAL records that could encode it).

I wonder if there's a simple solution (e.g. when creating the second
subxact we might notice the xid-subxid assignment was not logged, and
write some "dummy" WAL record). But I admit it seems a bit ugly.

I don't think this is particularly complicated or a lot of code, and I
don't see why would it require data structures in shared memory. Only
the walreceiver on standby needs to worry about this, no?

Not a new data structure in shared memory, but we already have a
KnownTransactionId structure in shared memory. So, after having a
local cache, we will have xidAssignmentsHash and KnownTransactionId
maintaining the same information in different ways. And, we need to
ensure both are cleaned up properly. That was what I was pointing
above related to maintaining two structures. However, I think before
discussing more on this, we need to think about the above problem.

Sure.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#247Dilip Kumar
dilipbalaut@gmail.com
In reply to: Tomas Vondra (#246)
10 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Apr 8, 2020 at 6:29 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Apr 07, 2020 at 12:17:44PM +0530, Amit Kapila wrote:

On Mon, Mar 30, 2020 at 8:58 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Mar 30, 2020 at 11:47:57AM +0530, Amit Kapila wrote:

I think we need to cache the subxids on the replica somehow but I
don't have a very good idea for it. Basically, there are two ways to
do it (a) Change the KnownAssignedXids in some way so that we can
easily find this information without losing on the current benefits of
it. I can't think of a good way to do that and even if we come up
with something, it could easily be a lot of work, (b) Cache the
subxids for a particular transaction in local memory along with
KnownAssignedXids. This is doable but now we have two data-structures
(one in shared memory and other in local memory) managing the same
information in different ways.

Do you have any other ideas?

I don't follow. Why couldn't we have a simple cache on the standby? It
could be either a simple array or a hash table (with the top-level xid
as hash key)?

I think having something like we discussed or what you have in the
patch won't be sufficient to clean the KnownAssignedXid array. The
point is that we won't write a WAL for xid-subxid association for
unlogged relations in the "Immediately WAL-log assignments" patch,
however, the KnownAssignedXid would have both kinds of Xids as we
autofill it with gaps (see RecordKnownAssignedTransactionIds). I
think if my understanding is correct to make it work we might need
major surgery in the code or have to maintain KnownAssignedXid array
differently.

Hmm, that's a good point. If I understand correctly, the issue is
that if we create new subxact, write something into an unlogged table,
and then create new subxact, the XID of the first subxact will be "known
assigned" but we won't know it's a subxact or to which parent xact it
belongs (because there will be no WAL records that could encode it).

I wonder if there's a simple solution (e.g. when creating the second
subxact we might notice the xid-subxid assignment was not logged, and
write some "dummy" WAL record). But I admit it seems a bit ugly.

I don't think this is particularly complicated or a lot of code, and I
don't see why would it require data structures in shared memory. Only
the walreceiver on standby needs to worry about this, no?

Not a new data structure in shared memory, but we already have a
KnownTransactionId structure in shared memory. So, after having a
local cache, we will have xidAssignmentsHash and KnownTransactionId
maintaining the same information in different ways. And, we need to
ensure both are cleaned up properly. That was what I was pointing
above related to maintaining two structures. However, I think before
discussing more on this, we need to think about the above problem.

I have rebased the patch on the latest head. I haven't yet changed
anything for xid assignment thing because it is not yet concluded.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v13-0009-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v13-0009-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From fbae7eda7cdf0ac9904be1541f196588555241ad Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v13 09/10] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v13-0008-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v13-0008-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From 9a8d3be5ee695b2e5fd572304ff3b4ceb40909ae Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v13 08/10] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 6da7f71..086d0c7 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -70,7 +70,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 34ab11e..ad3ed13 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 81520a7..0c9c6b3 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v13-0010-Bugfix-handling-of-incomplete-toast-tuple.patchapplication/octet-stream; name=v13-0010-Bugfix-handling-of-incomplete-toast-tuple.patchDownload
From 909afae31c404497dfab8adc5095f985319d3425 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 28 Feb 2020 11:07:46 +0530
Subject: [PATCH v13 10/10] Bugfix handling of incomplete toast tuple

---
 src/backend/access/heap/heapam.c                |   3 +
 src/backend/replication/logical/decode.c        |  17 ++-
 src/backend/replication/logical/reorderbuffer.c | 182 +++++++++++++++---------
 src/include/access/heapam_xlog.h                |   1 +
 src/include/replication/reorderbuffer.h         |  20 ++-
 5 files changed, 147 insertions(+), 76 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c1586ef..ffa2f0f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2017,6 +2017,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 4f958e9..e82387c 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -729,7 +729,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -796,7 +798,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -853,7 +856,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -889,7 +893,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -989,7 +993,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1027,7 +1031,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index cfa36b4..fc1c0fb 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -654,11 +654,14 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
-	ReorderBufferTXN *txn;
+	ReorderBufferTXN *txn, *toptxn;
+	bool	can_stream = false;
 
-	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+	toptxn = txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
 
 	change->lsn = lsn;
 	change->txn = txn;
@@ -668,9 +671,50 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Otherwise, if
+	 * we have toast insert bit set and this is insert/update then clear the
+	 * bit.
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(txn) &&
+			 ((change->action == REORDER_BUFFER_CHANGE_INSERT) ||
+			 (change->action == REORDER_BUFFER_CHANGE_UPDATE)))
+	{
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+		can_stream = true;
+	}
+
+	/*
+	 * If this is a speculative insert then set the corresponding bit.
+	 * Otherwise, if we have speculative insert bit set and this is spec confirm
+	 * record then clear the bit.
+	 */
+	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM)
+	{
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+		can_stream = true;
+	}
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/*
+	 * If streaming is enable and we have serialized this transaction because
+	 * it had incomplete tuple.  So if now we have got the complete tuple we
+	 * can stream it.
+	 */
+	if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
+		&& !rbtxn_has_toast_insert(txn) && !rbtxn_has_spec_insert(txn))
+	{
+		ReorderBufferStreamTXN(rb, toptxn);
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);		
+	}
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -700,7 +744,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1862,8 +1906,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						 * disk.
 						 */
 						dlist_delete(&change->node);
-						ReorderBufferToastAppendChunk(rb, txn, relation,
-													  change);
+							ReorderBufferToastAppendChunk(rb, txn, relation,
+														  change);
 					}
 
 			change_done:
@@ -2453,7 +2497,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2502,7 +2546,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2525,6 +2569,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2539,8 +2584,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2548,12 +2598,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2620,7 +2678,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 	memcpy(change->data.inval.invalidations, msgs,
 		   sizeof(SharedInvalidationMessage) * nmsgs);
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -2807,15 +2865,16 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		/*
+		 * if the current transaction is larger and doesn't have incomplete data
+		 * remember it.
+		 */
+		if (((!largest) || (txn->total_size > largest->total_size)) &&
+			((txn->size > 0) && !rbtxn_has_toast_insert(txn) &&
+			!rbtxn_has_spec_insert(txn)))
+		largest = txn;
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2833,66 +2892,51 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
 	ReorderBufferTXN *txn;
 
-	/* bail out if we haven't exceeded the memory limit */
-	if (rb->size < logical_decoding_work_mem * 1024L)
-		return;
-
-	/*
-	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by streaming, if supported. Otherwise spill to disk.
-	 */
-	if (ReorderBufferCanStream(rb))
+	/* Loop until we reach under the memory limit. */
+	while (rb->size >= logical_decoding_work_mem * 1024L)
 	{
 		/*
-		 * Pick the largest toplevel transaction and evict it from memory by
-		 * streaming the already decoded part.
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTopTXN(rb);
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn && !txn->toptxn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		ReorderBufferStreamTXN(rb, txn);
-	}
-	else
-	{
-		/*
-		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
-		 */
-		txn = ReorderBufferLargestTXN(rb);
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
-		ReorderBufferSerializeTXN(rb, txn);
+		/*
+		 * After eviction, the transaction should have no entries in memory, and
+		 * should use 0 bytes for changes.
+		 *
+		 * XXX Checking the size is fine for both cases - spill to disk and
+		 * streaming. But for streaming we should really check nentries_mem for
+		 * all subtransactions too.
+		 */
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
 	}
 
-	/*
-	 * After eviction, the transaction should have no entries in memory, and
-	 * should use 0 bytes for changes.
-	 *
-	 * XXX Checking the size is fine for both cases - spill to disk and
-	 * streaming. But for streaming we should really check nentries_mem for
-	 * all subtransactions too.
-	 */
-	Assert(txn->size == 0);
-	Assert(txn->nentries_mem == 0);
-
-	/*
-	 * And furthermore, evicting the transaction should get us below the
-	 * memory limit again - it is not possible that we're still exceeding the
-	 * memory limit after evicting the transaction.
-	 *
-	 * This follows from the simple fact that the selected transaction is at
-	 * least as large as the most recent change (which caused us to go over
-	 * the memory limit). So by evicting it we're definitely back below the
-	 * memory limit.
-	 */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 603f325..bfc2141 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -172,6 +172,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -191,6 +193,17 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -355,6 +368,9 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -545,7 +561,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v13-0001-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=v13-0001-Immediately-WAL-log-assignments.patchDownload
From 314a9919eff3f7399a2ba05f3d9ae192854e7d0d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 20 Mar 2020 15:03:01 +0530
Subject: [PATCH v13 01/10] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 46 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 +++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 39 ++++++++++++++-------------
 src/include/access/xact.h                |  3 +++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 +++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 99 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6b1ae1f..c5842d3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -190,6 +190,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -222,6 +223,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5111,6 +5113,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6013,3 +6016,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 4259309..afae08d 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		curinsert_flags |= XLOG_INCLUDE_XID;
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 79ff976..7b5257f 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1189,6 +1189,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1227,6 +1228,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3a..0946179 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel_xid is valid, we need to assign the subxact to the
+	 * toplevel transaction. We need to do this for all records, hence we
+	 * do it before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(r->toplevel_xid))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04ba..8645b38 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f60ed2d..6d439d0 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -229,6 +229,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 4582196..756f6df 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -150,6 +150,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -285,6 +287,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v13-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchapplication/octet-stream; name=v13-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchDownload
From 6ea334436c9caa80ea1534780c5d498723ebadbb Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 10:55:19 +0530
Subject: [PATCH v13 04/10] Gracefully handle concurrent aborts of uncommitted 
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml               |  5 ++-
 src/backend/access/heap/heapam.c                | 41 +++++++++++++++++++++++++
 src/backend/access/index/genam.c                | 40 ++++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c |  8 ++---
 src/backend/utils/time/snapmgr.c                | 25 +++++++++++++--
 src/include/utils/snapmgr.h                     |  4 ++-
 6 files changed, 115 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 65244b1..b59a6c8 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,7 +432,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
      includes writing to tables, performing DDL changes, and
      calling <literal>pg_current_xact_id()</literal>.
     </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c4a5aa6..c1586ef 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1303,6 +1303,15 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(scan->rs_base.rs_rd) ||
+		  RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	HEAPDEBUG_1;				/* heap_getnext( info ) */
@@ -1422,6 +1431,14 @@ heap_fetch(Relation relation,
 	bool		valid;
 
 	/*
+	 * We don't expect direct calls to heap_fetch with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_fetch call during logical decoding");
+
+	/*
 	 * Fetch and pin the appropriate page of the relation.
 	 */
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
@@ -1536,6 +1553,14 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		valid;
 	bool		skip;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search_buffer with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_hot_search_buffer call during logical decoding");
+
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
 		*all_dead = first_call;
@@ -1685,6 +1710,14 @@ heap_get_latest_tid(TableScanDesc sscan,
 	Assert(ItemPointerIsValid(tid));
 
 	/*
+	 * We don't expect direct calls to heap_get_latest_tid with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_get_latest_tid call during logical decoding");
+
+	/*
 	 * Loop to chase down t_ctid links.  At top of loop, ctid is the tuple we
 	 * need to examine, and *tid is the TID we will return if ctid turns out
 	 * to be bogus.
@@ -5490,6 +5523,14 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	ItemId		lp = NULL;
 	HeapTupleHeader htup;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_hot_search call during logical decoding");
+
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..413a21f 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -481,6 +482,19 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  We can't directly use TransactionIdDidAbort as after crash
+	 * such transaction might not have been marked as aborted.  See detailed
+	 * comments at snapmgr.c where the variable is declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
@@ -517,6 +531,19 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  We can't directly use TransactionIdDidAbort as after crash
+	 * such transaction might not have been marked as aborted.  See detailed
+	 * comments at snapmgr.c where the variable is declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return result;
 }
 
@@ -643,6 +670,19 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out.  We can't directly use TransactionIdDidAbort as after crash
+	 * such transaction might not have been marked as aborted.  See detailed
+	 * comments at snapmgr.c where the variable is declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 183a0e9..e299142 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -696,7 +696,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 			txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 		/* setup snapshot to allow catalog access */
-		SetupHistoricSnapshot(snapshot_now, NULL);
+		SetupHistoricSnapshot(snapshot_now, NULL, xid);
 		PG_TRY();
 		{
 			rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1547,7 +1547,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1798,7 +1798,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1818,7 +1818,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					}
 
 					break;
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c5..93a0c04 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -154,6 +154,13 @@ static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
 /*
+ * An xid value pointing to a possibly ongoing (sub)transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
+/*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2029,10 +2036,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
  *
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive.  This is to re-check XID status while accessing catalog.
+ *
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+					  TransactionId snapshot_xid)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -2041,8 +2052,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 
 	/* setup (cmin, cmax) lookup hash */
 	tuplecid_data = tuplecids;
-}
 
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
+	 */
+	if (TransactionIdIsValid(snapshot_xid) &&
+		!TransactionIdDidCommit(snapshot_xid))
+		CheckXidAlive = snapshot_xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
 /*
  * Make catalog snapshots behave normally again.
@@ -2052,6 +2072,7 @@ TeardownHistoricSnapshot(bool is_error)
 {
 	HistoricSnapshot = NULL;
 	tuplecid_data = NULL;
+	CheckXidAlive = InvalidTransactionId;
 }
 
 bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13c..12f737b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,8 +145,10 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+								  TransactionId snapshot_xid);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-- 
1.8.3.1

v13-0002-Issue-individual-invalidations-with-wal_level-lo.patchapplication/octet-stream; name=v13-0002-Issue-individual-invalidations-with-wal_level-lo.patchDownload
From 97b37f9390442504dba085ce896ca5a0aad6b61b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 18 Nov 2019 16:26:33 +0530
Subject: [PATCH v13 02/10] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.

The new invalidations are written to WAL immediately, without any
such caching. Perhaps it would be possible to add similar caching,
e.g. at the command level, or something like that?
---
 src/backend/access/rmgrdesc/xactdesc.c          |  40 +++++++++
 src/backend/access/transam/xact.c               |   7 ++
 src/backend/replication/logical/decode.c        |  18 ++++
 src/backend/replication/logical/reorderbuffer.c | 110 +++++++++++++++++++++---
 src/backend/utils/cache/inval.c                 |  59 +++++++++++--
 src/include/access/xact.h                       |  13 ++-
 src/include/replication/reorderbuffer.h         |  11 +++
 7 files changed, 239 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index fbc5942..17c06f7 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c5842d3..cf78ffc 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6013,6 +6013,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0946179..4f958e9 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				if (!TransactionIdIsValid(xid))
+					break;
+
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->nmsgs, invals->msgs);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4594cf9..183a0e9 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -220,7 +220,7 @@ static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
 static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
-static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
 
 /*
  * ---------------------------------------
@@ -455,6 +455,11 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 				pfree(change->data.msg.message);
 			change->data.msg.message = NULL;
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			if (change->data.inval.invalidations)
+				pfree(change->data.inval.invalidations);
+			change->data.inval.invalidations = NULL;
+			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			if (change->data.snapshot)
 			{
@@ -1814,17 +1819,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation messages locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					ReorderBufferExecuteInvalidations(
+							change->data.inval.ninvalidations,
+							change->data.inval.invalidations);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -1866,7 +1878,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1892,7 +1905,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2204,6 +2218,39 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 
 /*
  * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn, int nmsgs,
+							 SharedInvalidationMessage *msgs)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	/* XXX Should we even write invalidations without valid XID? */
+	if (xid == InvalidTransactionId)
+		return;
+
+	Assert(xid != InvalidTransactionId);
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.ninvalidations = nmsgs;
+	change->data.inval.invalidations = (SharedInvalidationMessage *)
+		MemoryContextAlloc(rb->context,
+						   sizeof(SharedInvalidationMessage) * nmsgs);
+	memcpy(change->data.inval.invalidations, msgs,
+		   sizeof(SharedInvalidationMessage) * nmsgs);
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Setup the invalidation of the toplevel transaction.
  *
  * This needs to be done before ReorderBufferCommit is called!
  */
@@ -2234,12 +2281,12 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  * in the changestream but we don't know which those are.
  */
 static void
-ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
 {
 	int			i;
 
-	for (i = 0; i < txn->ninvalidations; i++)
-		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	for (i = 0; i < nmsgs; i++)
+		LocalExecuteInvalidationMessage(&msgs[i]);
 }
 
 /*
@@ -2591,6 +2638,24 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				char	   *data;
+				Size		inval_size = sizeof(SharedInvalidationMessage) *
+										change->data.inval.ninvalidations;
+
+				sz += inval_size;
+
+				ReorderBufferSerializeReserve(rb, sz);
+				data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+
+				/* might have been reallocated above */
+				ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+				memcpy(data, change->data.inval.invalidations, inval_size);
+				data += inval_size;
+
+				break;
+			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -2736,6 +2801,11 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			sz += sizeof(SharedInvalidationMessage) *
+					change->data.inval.ninvalidations;
+			break;
+
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -3004,6 +3074,20 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				Size	inval_size = sizeof(SharedInvalidationMessage) *
+									change->data.inval.ninvalidations;
+
+				change->data.inval.invalidations =
+						MemoryContextAlloc(rb->context,
+										   change->data.msg.message_size);
+				/* read the message */
+				memcpy(change->data.inval.invalidations, data, inval_size);
+				data += inval_size;
+
+				break;
+			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	oldsnap;
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..47be680 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *  support the decoding of the in-progress transaction.  As of now it was
+ *  enough to log invalidation only at commit because we are only decoding the
+ *  transaction at the commit time.   We only need to log the catalog cache and
+ *  relcache invalidation.  There can not be any active MVCC scan in logical
+ *  decoding so we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1068,8 +1077,8 @@ AtEOSubXact_Inval(bool isCommit)
 
 /*
  * CommandEndInvalidationMessages
- *		Process queued-up invalidation messages at end of one command
- *		in a transaction.
+ *              Process queued-up invalidation messages at end of one command
+ *              in a transaction.
  *
  * Here, we send no messages to the shared queue, since we don't know yet if
  * we will commit.  We do need to locally process the CurrentCmdInvalidMsgs
@@ -1078,8 +1087,8 @@ AtEOSubXact_Inval(bool isCommit)
  * of the prior-cmds list.
  *
  * Note:
- *		This should be called during CommandCounterIncrement(),
- *		after we have advanced the command ID.
+ *              This should be called during CommandCounterIncrement(),
+ *              after we have advanced the command ID.
  */
 void
 CommandEndInvalidationMessages(void)
@@ -1090,7 +1099,10 @@ CommandEndInvalidationMessages(void)
 	 * just quietly return if no state to work on.
 	 */
 	if (transInvalInfo == NULL)
-		return;
+			return;
+
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
 
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
@@ -1501,3 +1513,40 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage	*invalMessages;
+	int	nmsgs = 0;
+
+	if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+	{
+		ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+								 MakeSharedInvalidMessagesArray);
+		invalMessages = SharedInvalidMessagesArray;
+		nmsgs  = numSharedInvalidMessagesArray;
+		SharedInvalidMessagesArray = NULL;
+		numSharedInvalidMessagesArray = 0;
+	}
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, sizeof(xlrec));
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b38..b822c5e 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,17 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4..af35287 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,14 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations;		/* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation
+														 * message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +468,8 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  int nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
1.8.3.1

v13-0003-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=v13-0003-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From 07523af1992bc0a53ed25f176eff1add2571b9f4 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v13 03/10] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 +++++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  57 +++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..64f651f 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index bad3bfe..65244b1 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5adf253..497d8a9 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +204,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -862,6 +910,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->final_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 3b7ca7f..f24e246 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..0d0a94a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index af35287..e102840 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -354,6 +354,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -393,6 +439,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v13-0005-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v13-0005-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From b44551942dd95833421fe85b4225def6da48bc2b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri, 10 Jan 2020 18:03:27 -0300
Subject: [PATCH v13 05/10] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c     |  38 +-
 src/backend/replication/logical/reorderbuffer.c | 710 +++++++++++++++++++++---
 src/include/replication/reorderbuffer.h         |  36 ++
 3 files changed, 693 insertions(+), 91 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..160b167 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e299142..7383f14 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -236,6 +236,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +245,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +381,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -773,6 +786,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -865,6 +910,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -988,7 +1036,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1024,6 +1072,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1038,6 +1089,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1316,6 +1370,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1341,8 +1404,93 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
  */
 static void
 ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
-
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
 
 	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
@@ -1491,63 +1636,75 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report an specific error which we can ignore.  We
+ * might have already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the abort we will
+ * stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
-		return;
-	}
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 
-	/* build data to be able to lookup the CommandIds of catalog tuples */
+	/*
+	 * build data to be able to lookup the CommandIds of catalog tuples
+	 */
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, txn->xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1563,15 +1720,20 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 	PG_TRY();
 	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 		ReorderBufferChange *change;
 		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction("stream");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* start streaming this chunk of transaction */
+		if (streaming)
+			rb->stream_start(rb, txn);
+		else
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1579,6 +1741,19 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1588,8 +1763,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * use as a normal record. It'll be cleaned up at the end
 					 * of INSERT processing.
 					 */
-					if (specinsert == NULL)
-						elog(ERROR, "invalid ordering of speculative insertion changes");
 					Assert(specinsert->data.tp.oldtuple == NULL);
 					change = specinsert;
 					change->action = REORDER_BUFFER_CHANGE_INSERT;
@@ -1655,7 +1828,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						if (streaming)
+						{
+							rb->stream_change(rb, txn, relation, change);
+
+							/* Remember that we have sent some data for this xid. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_change(rb, txn, relation, change);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1676,8 +1857,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						 * freed/reused while restoring spooled data from
 						 * disk.
 						 */
-						Assert(change->data.tp.newtuple != NULL);
-
 						dlist_delete(&change->node);
 						ReorderBufferToastAppendChunk(rb, txn, relation,
 													  change);
@@ -1695,7 +1874,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1753,7 +1932,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						if (streaming)
+						{
+							rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+							/* Remember that we have sent some data. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_truncate(rb, txn, nrelations, relations, change);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1762,10 +1949,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					if (streaming)
+						rb->stream_message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
+					else
+						rb->message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1796,9 +1989,9 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+										  txn->xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1818,7 +2011,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+											  txn->xid);
 					}
 
 					break;
@@ -1858,14 +2052,46 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			/*
+			 * Set the last last of the stream as the final lsn before calling
+			 * stream stop.
+			 */
+			if (!XLogRecPtrIsInvalid(prev_lsn))
+				txn->final_lsn = prev_lsn;
+			rb->stream_stop(rb, txn);
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+		{
+			txn->command_id = command_id;
+
+			/* Avoid copying if it's already copied. */
+			if (snapshot_now->copied)
+				txn->snapshot_now = snapshot_now;
+			else
+				txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+														  txn, command_id);
+		}
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1884,14 +2110,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,18 +2145,122 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		if (streaming)
+		{
+			/* Discard the changes that we just streamed. */
+			ReorderBufferTruncateTXN(rb, txn);
 
-		PG_RE_THROW();
+			/* Re-throw only if it's not an abort. */
+			if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+			{
+				MemoryContextSwitchTo(ecxt);
+				PG_RE_THROW();
+			}
+			else
+			{
+				/* remember the command ID and snapshot for the streaming run */
+				txn->command_id = command_id;
+
+				/* Avoid copying if it's already copied. */
+				if (snapshot_now->copied)
+					txn->snapshot_now = snapshot_now;
+				else
+					txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+															  txn, command_id);
+				/*
+				 * Set the last last of the stream as the final lsn before
+				 * calling stream stop.
+				 */
+				txn->final_lsn = prev_lsn;
+				rb->stream_stop(rb, txn);
+
+				FlushErrorState();
+			}
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
 /*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * We currently can only decode a transaction's contents when its commit
+ * record is read because that's the only place where we know about cache
+ * invalidations. Thus, once a toplevel commit is read, we iterate over the top
+ * and subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -1946,6 +2284,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2015,6 +2360,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * When the (sub)transaction was streamed, notify the remote node
+	 * about the abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2150,8 +2502,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2159,6 +2520,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2170,19 +2532,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2211,6 +2582,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2301,6 +2673,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2405,6 +2784,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
  *
@@ -2424,15 +2835,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2752,6 +3194,102 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	volatile Snapshot snapshot_now;
+	volatile CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e102840..6d65986 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -171,6 +171,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -190,6 +191,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -225,6 +244,16 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	final_lsn;
 
 	/*
+	 * Have we sent any changes for this transaction in output plugin?
+	 */
+	bool		any_data_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
+	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
 	XLogRecPtr	end_lsn;
@@ -255,6 +284,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
-- 
1.8.3.1

v13-0006-Add-support-for-streaming-to-built-in-replicatio.patchapplication/octet-stream; name=v13-0006-Add-support-for-streaming-to-built-in-replicatio.patchDownload
From c74472651fd234c3a5fa57d67e0e5fc657f39212 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 11:35:35 +0530
Subject: [PATCH v13 06/10] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |    4 +-
 doc/src/sgml/ref/create_subscription.sgml          |   12 +
 src/backend/catalog/pg_subscription.c              |    1 +
 src/backend/commands/subscriptioncmds.c            |   45 +-
 src/backend/postmaster/pgstat.c                    |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    3 +
 src/backend/replication/logical/launcher.c         |    1 -
 src/backend/replication/logical/logical.c          |    4 +-
 src/backend/replication/logical/proto.c            |  140 ++-
 src/backend/replication/logical/worker.c           | 1026 ++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c        |  315 +++++-
 src/backend/replication/slotfuncs.c                |    6 +
 src/backend/replication/walsender.c                |    6 +
 src/include/catalog/pg_subscription.h              |    3 +
 src/include/pgstat.h                               |    6 +-
 src/include/replication/logicalproto.h             |   42 +-
 src/include/replication/walreceiver.h              |    1 +
 src/test/subscription/t/009_stream_simple.pl       |   86 ++
 src/test/subscription/t/010_stream_subxact.pl      |  102 ++
 src/test/subscription/t/011_stream_ddl.pl          |   95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |   82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |   84 ++
 22 files changed, 2033 insertions(+), 43 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 1b8bead..95b7c24 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -164,8 +164,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c24..3349cc4 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731..f28482f 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 7f15667..65b6b76 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 50eea2e..4ef4fd4 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4133,6 +4133,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9..5257ab0 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e..8156a42 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,7 +14,6 @@
  *
  *-------------------------------------------------------------------------
  */
-
 #include "postgres.h"
 
 #include "access/heapam.h"
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 497d8a9..dfc681d 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1148,7 +1148,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1193,7 +1193,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = txn->final_lsn;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd..5242ac0 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a12..3dc5f83 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,6 +54,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -64,6 +86,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +94,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -110,12 +134,57 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool	in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;						/* XID of the subxact */
+	off_t           offset;					/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +256,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -553,6 +658,318 @@ apply_handle_origin(StringInfo s)
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+			return;
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	/* XXX Should this be allocated in another memory context? */
+
+	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	ensure_transaction();
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+
+	pfree(buffer);
+	pfree(s2.data);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -565,6 +982,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1000,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1039,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1157,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1302,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1675,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1816,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1478,6 +1929,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 }
 
 /*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
+/*
  * Apply main loop.
  */
 static void
@@ -1493,6 +1960,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1941,6 +2411,561 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	CloseTransientFile(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	CloseTransientFile(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3131,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 5fbf2d4..b0fba26 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,54 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +119,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +145,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +208,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +237,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +260,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +281,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +370,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -307,19 +426,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
 		relentry->map = convert_tuples_by_name(indesc, outdesc);
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -343,17 +468,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -362,6 +489,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -390,7 +521,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -410,7 +541,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -434,7 +565,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -454,7 +585,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -479,6 +610,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -507,13 +642,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -587,6 +723,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -623,6 +844,34 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -755,6 +1004,36 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -789,7 +1068,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index f776de3..9121420 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -156,6 +156,12 @@ create_logical_replication_slot(char *name, char *plugin,
 									NULL);
 
 	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 122d884..759ca5c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1004,6 +1004,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndUpdateProgress);
 
 		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
+		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
 		 * messages or send keepalives. As we possibly need to wait for
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d4..617b909 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8041d9..3b3e1fd 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -981,7 +981,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561..89158ed 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f1aa6e9..70d39f8 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v13-0007-Track-statistics-for-streaming.patchapplication/octet-stream; name=v13-0007-Track-statistics-for-streaming.patchDownload
From 86a29e7801aaae74b8d2b6b5cd9134e7c89ca06e Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 2 Apr 2020 13:19:29 +0530
Subject: [PATCH v13 07/10] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                    | 25 +++++++++++++++++++
 src/backend/catalog/system_views.sql            |  5 +++-
 src/backend/replication/logical/reorderbuffer.c | 13 ++++++++++
 src/backend/replication/walsender.c             | 32 +++++++++++++++++++++----
 src/include/catalog/pg_proc.dat                 |  6 ++---
 src/include/replication/reorderbuffer.h         | 13 ++++++----
 src/include/replication/walsender_private.h     |  5 ++++
 src/test/regress/expected/rules.out             |  7 ++++--
 8 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c50b721..8063ae8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2063,6 +2063,31 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       may get spilled repeatedly, and this counter gets incremented on every
       such invocation.</entry>
     </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
+
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index d406ea8..65d650d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -788,7 +788,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 7383f14..cfa36b4 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -331,6 +331,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3285,6 +3289,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
 
+	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 759ca5c..1656b4d 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1333,7 +1333,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1354,7 +1354,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2399,6 +2400,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3240,7 +3244,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3297,6 +3301,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillTxns;
 		int64		spillCount;
 		int64		spillBytes;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3320,6 +3327,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		memset(nulls, 0, sizeof(nulls));
@@ -3406,6 +3416,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3654,11 +3669,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4bce3ad..9fb1ffe 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5237,9 +5237,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6d65986..603f325 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -517,15 +517,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 366828f..3888b0c 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -85,6 +85,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ac31840..68e2deb 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2005,9 +2005,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_slru| SELECT s.name,
     s.blks_zeroed,
-- 
1.8.3.1

#248Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Dilip Kumar (#247)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Apr 9, 2020 at 2:40 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have rebased the patch on the latest head. I haven't yet changed
anything for xid assignment thing because it is not yet concluded.

Some review comments from 0001-Immediately-WAL-log-*.patch,

+bool
+IsSubTransactionAssignmentPending(void)
+{
+ if (!XLogLogicalInfoActive())
+ return false;
+
+ /* we need to be in a transaction state */
+ if (!IsTransactionState())
+ return false;
+
+ /* it has to be a subtransaction */
+ if (!IsSubTransaction())
+ return false;
+
+ /* the subtransaction has to have a XID assigned */
+ if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+ return false;
+
+ /* and it needs to have 'assigned' */
+ return !CurrentTransactionState->assigned;
+
+}
IMHO, it's important to reduce the complexity of this function since
it's been called for every WAL insertion. During the lifespan of a
transaction, any of these if conditions will only be evaluated if
previous conditions are true. So, we can maintain some state machine
to avoid multiple evaluation of a condition inside a transaction. But,
if the overhead is not much, it's not worth I guess.

+#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
This looks wrong. We should change the name of this Macro or we can
add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.

@@ -195,6 +197,10 @@ XLogResetInsertion(void)
{
int i;

+ /* reset the subxact assignment flag (if needed) */
+ if (curinsert_flags & XLOG_INCLUDE_XID)
+ MarkSubTransactionAssigned();
The comment looks contradictory.

XLogSetRecordFlags(uint8 flags)
{
Assert(begininsert_called);
- curinsert_flags = flags;
+ curinsert_flags |= flags;
}
I didn't understand why we need this change in this patch.

+ txid = XLogRecGetTopXid(record);
+
+ /*
+ * If the toplevel_xid is valid, we need to assign the subxact to the
+ * toplevel transaction. We need to do this for all records, hence we
+ * do it before the switch.
+ */
s/toplevel_xid/toplevel xid or s/toplevel_xid/txid
  if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
- info != XLOG_XACT_ASSIGNMENT)
+ !TransactionIdIsValid(r->toplevel_xid))
Perhaps, XLogRecGetTopXid() can be used.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

#249Dilip Kumar
dilipbalaut@gmail.com
In reply to: Kuntal Ghosh (#248)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

On Thu, Apr 9, 2020 at 2:40 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have rebased the patch on the latest head. I haven't yet changed
anything for xid assignment thing because it is not yet concluded.

Some review comments from 0001-Immediately-WAL-log-*.patch,

+bool
+IsSubTransactionAssignmentPending(void)
+{
+ if (!XLogLogicalInfoActive())
+ return false;
+
+ /* we need to be in a transaction state */
+ if (!IsTransactionState())
+ return false;
+
+ /* it has to be a subtransaction */
+ if (!IsSubTransaction())
+ return false;
+
+ /* the subtransaction has to have a XID assigned */
+ if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+ return false;
+
+ /* and it needs to have 'assigned' */
+ return !CurrentTransactionState->assigned;
+
+}
IMHO, it's important to reduce the complexity of this function since
it's been called for every WAL insertion. During the lifespan of a
transaction, any of these if conditions will only be evaluated if
previous conditions are true. So, we can maintain some state machine
to avoid multiple evaluation of a condition inside a transaction. But,
if the overhead is not much, it's not worth I guess.

Yeah maybe, in some cases we can avoid checking multiple conditions by
maintaining that state. But, that state will have to be at the
transaction level. But, I am not sure how much worth it will be to
add one extra condition to skip a few if checks and it will also add
the code complexity. And, in some cases where logical decoding is not
enabled, it may add one extra check? I mean first check the state and
that will take you to the first if check.

+#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
This looks wrong. We should change the name of this Macro or we can
add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.

I think this is in sync with below code (SizeOfXlogOrigin), SO doen't
make much sense to add different terminology no?
#define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))

@@ -195,6 +197,10 @@ XLogResetInsertion(void)
{
int i;

+ /* reset the subxact assignment flag (if needed) */
+ if (curinsert_flags & XLOG_INCLUDE_XID)
+ MarkSubTransactionAssigned();
The comment looks contradictory.

XLogSetRecordFlags(uint8 flags)
{
Assert(begininsert_called);
- curinsert_flags = flags;
+ curinsert_flags |= flags;
}
I didn't understand why we need this change in this patch.

I think it's changed so that below code can use it, but we have
directly set the flag. I think I will change in the next version.

@@ -748,6 +754,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
scratch += sizeof(replorigin_session_origin);
}

+ /* followed by toplevel XID, if not already included in previous record */
+ if (IsSubTransactionAssignmentPending())
+ {
+ TransactionId xid = GetTopTransactionIdIfAny();
+
+ /* update the flag (later used by XLogInsertRecord) */
+ curinsert_flags |= XLOG_INCLUDE_XID;
+ txid = XLogRecGetTopXid(record);
+
+ /*
+ * If the toplevel_xid is valid, we need to assign the subxact to the
+ * toplevel transaction. We need to do this for all records, hence we
+ * do it before the switch.
+ */
s/toplevel_xid/toplevel xid or s/toplevel_xid/txid

Okay, we can change

if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
- info != XLOG_XACT_ASSIGNMENT)
+ !TransactionIdIsValid(r->toplevel_xid))
Perhaps, XLogRecGetTopXid() can be used.

ok

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#250Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Dilip Kumar (#249)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

+#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
This looks wrong. We should change the name of this Macro or we can
add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.

I think this is in sync with below code (SizeOfXlogOrigin), SO doen't
make much sense to add different terminology no?
#define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))

In that case, we can rename this, for example, SizeOfXLogTransactionId.

Some review comments from 0002-Issue-individual-*.path,

+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr lsn, int nmsgs,
+ SharedInvalidationMessage *msgs)
+{
+ MemoryContext oldcontext;
+ ReorderBufferChange *change;
+
+ /* XXX Should we even write invalidations without valid XID? */
+ if (xid == InvalidTransactionId)
+ return;
+
+ Assert(xid != InvalidTransactionId);

It seems we don't call the function if xid is not valid. In fact,

@@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
  }
  case XLOG_XACT_ASSIGNMENT:
  break;
+ case XLOG_XACT_INVALIDATIONS:
+ {
+ TransactionId xid;
+ xl_xact_invalidations *invals;
+
+ xid = XLogRecGetXid(r);
+ invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+ if (!TransactionIdIsValid(xid))
+ break;
+
+ ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+ invals->nmsgs, invals->msgs);

Why should we insert an WAL record for such cases?

+ * When wal_level=logical, write invalidations into WAL at each command end to
+ *  support the decoding of the in-progress transaction.  As of now it was
+ *  enough to log invalidation only at commit because we are only decoding the
+ *  transaction at the commit time.   We only need to log the catalog cache and
+ *  relcache invalidation.  There can not be any active MVCC scan in logical
+ *  decoding so we don't need to log the snapshot invalidation.
The alignment is not right.
 /*
  * CommandEndInvalidationMessages
- * Process queued-up invalidation messages at end of one command
- * in a transaction.
+ *              Process queued-up invalidation messages at end of one command
+ *              in a transaction.
Looks unnecessary changes.
  * Note:
- * This should be called during CommandCounterIncrement(),
- * after we have advanced the command ID.
+ *              This should be called during CommandCounterIncrement(),
+ *              after we have advanced the command ID.
  */
Looks unnecessary changes.
  if (transInvalInfo == NULL)
- return;
+ return;
Looks unnecessary changes.
+ /* prepare record */
+ memset(&xlrec, 0, sizeof(xlrec));
We should use MinSizeOfXactInvalidations, no?

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

#251Dilip Kumar
dilipbalaut@gmail.com
In reply to: Kuntal Ghosh (#250)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Apr 13, 2020 at 6:12 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

+#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
This looks wrong. We should change the name of this Macro or we can
add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.

I think this is in sync with below code (SizeOfXlogOrigin), SO doen't
make much sense to add different terminology no?
#define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))

In that case, we can rename this, for example, SizeOfXLogTransactionId.

Make sense.

Some review comments from 0002-Issue-individual-*.path,

+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr lsn, int nmsgs,
+ SharedInvalidationMessage *msgs)
+{
+ MemoryContext oldcontext;
+ ReorderBufferChange *change;
+
+ /* XXX Should we even write invalidations without valid XID? */
+ if (xid == InvalidTransactionId)
+ return;
+
+ Assert(xid != InvalidTransactionId);

It seems we don't call the function if xid is not valid. In fact,

@@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
}
case XLOG_XACT_ASSIGNMENT:
break;
+ case XLOG_XACT_INVALIDATIONS:
+ {
+ TransactionId xid;
+ xl_xact_invalidations *invals;
+
+ xid = XLogRecGetXid(r);
+ invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+ if (!TransactionIdIsValid(xid))
+ break;
+
+ ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+ invals->nmsgs, invals->msgs);

Why should we insert an WAL record for such cases?

I think we can avoid this. I will analyze and send update in my next patch.

+ * When wal_level=logical, write invalidations into WAL at each command end to
+ *  support the decoding of the in-progress transaction.  As of now it was
+ *  enough to log invalidation only at commit because we are only decoding the
+ *  transaction at the commit time.   We only need to log the catalog cache and
+ *  relcache invalidation.  There can not be any active MVCC scan in logical
+ *  decoding so we don't need to log the snapshot invalidation.
The alignment is not right.

Will fix.

/*
* CommandEndInvalidationMessages
- * Process queued-up invalidation messages at end of one command
- * in a transaction.
+ *              Process queued-up invalidation messages at end of one command
+ *              in a transaction.
Looks unnecessary changes.

Will fix.

* Note:
- * This should be called during CommandCounterIncrement(),
- * after we have advanced the command ID.
+ *              This should be called during CommandCounterIncrement(),
+ *              after we have advanced the command ID.
*/
Looks unnecessary changes.

Will fix.

if (transInvalInfo == NULL)
- return;
+ return;
Looks unnecessary changes.
+ /* prepare record */
+ memset(&xlrec, 0, sizeof(xlrec));
We should use MinSizeOfXactInvalidations, no?

Right.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#252Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Dilip Kumar (#251)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Apr 13, 2020 at 6:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Skipping 0003 for now. Review comments from 0004-Gracefully-handle-*.patch

@@ -5490,6 +5523,14 @@ heap_finish_speculative(Relation relation,
ItemPointer tid)
ItemId lp = NULL;
HeapTupleHeader htup;

+ /*
+ * We don't expect direct calls to heap_hot_search with
+ * valid CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ elog(ERROR, "unexpected heap_hot_search call during logical decoding");
The call is to heap_finish_speculative.

@@ -481,6 +482,19 @@ systable_getnext(SysScanDesc sysscan)
}
}

+ if (TransactionIdIsValid(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));
s/transaction aborted/transaction aborted concurrently perhaps? Also,
can we move this check at the begining of the function? If the
condition fails, we can skip the sys scan.

Some of the checks looks repetative in the same file. Should we
declare them as inline functions?

Review comments from 0005-Implement-streaming*.patch

+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+ dlist_iter iter;
...
+#endif
+}

We can implement the same as following:
#ifdef USE_ASSERT_CHECKING
static void
AssertChangeLsnOrder(ReorderBufferTXN *txn)
{
dlist_iter iter;
...
}
#else
#define AssertChangeLsnOrder(txn) ((void)true)
#endif

+ * if it is aborted we will report an specific error which we can ignore. We
s/an specific/a specific

+ * Set the last last of the stream as the final lsn before calling
+ * stream stop.
s/last last/last
  PG_CATCH();
  {
+ MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+ ErrorData  *errdata = CopyErrorData();
When we don't re-throw, the errdata should be freed by calling
FreeErrorData(errdata), right?
+ /*
+ * Set the last last of the stream as the final lsn before
+ * calling stream stop.
+ */
+ txn->final_lsn = prev_lsn;
+ rb->stream_stop(rb, txn);
+
+ FlushErrorState();
+ }
stream_stop() can still throw some error, right? In that case, we
should flush the error state before calling stream_stop().
+ /*
+ * Remember the command ID and snapshot if transaction is streaming
+ * otherwise free the snapshot if we have copied it.
+ */
+ if (streaming)
+ {
+ txn->command_id = command_id;
+
+ /* Avoid copying if it's already copied. */
+ if (snapshot_now->copied)
+ txn->snapshot_now = snapshot_now;
+ else
+ txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+  txn, command_id);
+ }
+ else if (snapshot_now->copied)
+ ReorderBufferFreeSnap(rb, snapshot_now);
Hmm, it seems this part needs an assumption that after copying the
snapshot, no subsequent step can throw any error. If they do, then we
can again create a copy of the snapshot in catch block, which will
leak some memory. Is my understanding correct?
+ }
+ else
+ {
+ ReorderBufferCleanupTXN(rb, txn);
+ PG_RE_THROW();
+ }
Shouldn't we switch back to previously created error memory context
before re-throwing?
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+ volatile Snapshot snapshot_now;
+ volatile CommandId command_id = FirstCommandId;
In the modified ReorderBufferCommit(), why is it necessary to declare
the above two variable as volatile? There is no try-catch block here.

@@ -1946,6 +2284,13 @@ ReorderBufferAbort(ReorderBuffer *rb,
TransactionId xid, XLogRecPtr lsn)
if (txn == NULL)
return;

+ /*
+ * When the (sub)transaction was streamed, notify the remote node
+ * about the abort only if we have sent any data for this transaction.
+ */
+ if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+ rb->stream_abort(rb, txn, lsn);
+
s/When/If
+ /*
+ * When the (sub)transaction was streamed, notify the remote node
+ * about the abort.
+ */
+ if (rbtxn_is_streamed(txn))
+ rb->stream_abort(rb, txn, lsn);
s/When/If. And, in this case, if we've not sent any data, why should
we send the abort message (similar to the previous one)?
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
Should we put any assert (not necessarily here) to validate the above comment?
+ txn = ReorderBufferLargestTopTXN(rb);
+
+ /* we know there has to be one, because the size is not zero */
+ Assert(txn && !txn->toptxn);
+ Assert(txn->size > 0);
+ Assert(rb->size >= txn->size);
The same three assertions are already there in ReorderBufferLargestTopTXN().
+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+ LogicalDecodingContext *ctx = rb->private_data;
+
+ return ctx->streaming;
+}
Potential inline function.
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+ volatile Snapshot snapshot_now;
+ volatile CommandId command_id;
Here also, do we need to declare these two variables as volatile?

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

#253Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#246)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Apr 8, 2020 at 6:29 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Apr 07, 2020 at 12:17:44PM +0530, Amit Kapila wrote:

On Mon, Mar 30, 2020 at 8:58 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I think having something like we discussed or what you have in the
patch won't be sufficient to clean the KnownAssignedXid array. The
point is that we won't write a WAL for xid-subxid association for
unlogged relations in the "Immediately WAL-log assignments" patch,
however, the KnownAssignedXid would have both kinds of Xids as we
autofill it with gaps (see RecordKnownAssignedTransactionIds). I
think if my understanding is correct to make it work we might need
major surgery in the code or have to maintain KnownAssignedXid array
differently.

Hmm, that's a good point. If I understand correctly, the issue is
that if we create new subxact, write something into an unlogged table,
and then create new subxact, the XID of the first subxact will be "known
assigned" but we won't know it's a subxact or to which parent xact it
belongs (because there will be no WAL records that could encode it).

Yeah, there could be multiple such missing subxacts.

I wonder if there's a simple solution (e.g. when creating the second
subxact we might notice the xid-subxid assignment was not logged, and
write some "dummy" WAL record).

That WAL record can have multiple xids.

But I admit it seems a bit ugly.

Yeah, I guess it could be tricky as well because while assembling some
WAL record, we need to generate an additional dummy record or might
need to add additional information to the current record being formed.
I think the handling of such WAL records during hot-standby and in
logical decoding could vary. During logical decoding, currently, we
don't form an association for subtransactions if it doesn't have any
changes (see ReorderBufferCommitChild) and now with this new type of
record, I think we need to ensure that we don't form such association.

I think after quite some changes, tweaks and a lot of testing, we
might be able to remove XLOG_XACT_ASSIGNMENT but I am not sure if it
is worth doing along with this patch. I think it would have been good
to do this if we are adding any visible overhead with this patch and
or it is easy to do that. However, none of that seems to be true, so
it might be better to write good comments in the code indicating what
all we need to do to remove XLOG_XACT_ASSIGNMENT so that if we feel it
is important to do in future we can do so. I am not against spending
effort on this but I don't see the urgency of doing it along with this
patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#254Amit Kapila
amit.kapila16@gmail.com
In reply to: Kuntal Ghosh (#250)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Apr 13, 2020 at 6:12 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

+#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
This looks wrong. We should change the name of this Macro or we can
add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.

I think this is in sync with below code (SizeOfXlogOrigin), SO doen't
make much sense to add different terminology no?
#define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))

In that case, we can rename this, for example, SizeOfXLogTransactionId.

Some review comments from 0002-Issue-individual-*.path,

+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr lsn, int nmsgs,
+ SharedInvalidationMessage *msgs)
+{
+ MemoryContext oldcontext;
+ ReorderBufferChange *change;
+
+ /* XXX Should we even write invalidations without valid XID? */
+ if (xid == InvalidTransactionId)
+ return;
+
+ Assert(xid != InvalidTransactionId);

It seems we don't call the function if xid is not valid. In fact,

You have a valid point. Also, it is not clear if we are first
checking (xid == InvalidTransactionId) and returning from the
function, how can even Assert hit.

@@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
}
case XLOG_XACT_ASSIGNMENT:
break;
+ case XLOG_XACT_INVALIDATIONS:
+ {
+ TransactionId xid;
+ xl_xact_invalidations *invals;
+
+ xid = XLogRecGetXid(r);
+ invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+ if (!TransactionIdIsValid(xid))
+ break;
+
+ ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+ invals->nmsgs, invals->msgs);

Why should we insert an WAL record for such cases?

Right, if there is any such case, we should avoid it.

One more point about this patch, the commit message needs to be updated:

The new invalidations are written to WAL immediately, without any

such caching. Perhaps it would be possible to add similar caching,

e.g. at the command level, or something like that?

I think the above part of commit message is not right as the patch
already does such a caching now at the command level.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#255Dilip Kumar
dilipbalaut@gmail.com
In reply to: Kuntal Ghosh (#252)
10 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Apr 13, 2020 at 11:43 PM Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

On Mon, Apr 13, 2020 at 6:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Skipping 0003 for now. Review comments from 0004-Gracefully-handle-*.patch

@@ -5490,6 +5523,14 @@ heap_finish_speculative(Relation relation,
ItemPointer tid)
ItemId lp = NULL;
HeapTupleHeader htup;

+ /*
+ * We don't expect direct calls to heap_hot_search with
+ * valid CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ elog(ERROR, "unexpected heap_hot_search call during logical decoding");
The call is to heap_finish_speculative.

Fixed

@@ -481,6 +482,19 @@ systable_getnext(SysScanDesc sysscan)
}
}

+ if (TransactionIdIsValid(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));
s/transaction aborted/transaction aborted concurrently perhaps? Also,
can we move this check at the begining of the function? If the
condition fails, we can skip the sys scan.

We must check this after we get the tuple because our goal is, not to
decode based on the wrong tuple. And, if we move the check before
then, what if the transaction aborted after the check. Once we get
the tuple and if the transaction is alive by that time then it doesn't
matter even if it aborts because we have got the right tuple already.

Some of the checks looks repetative in the same file. Should we
declare them as inline functions?

Review comments from 0005-Implement-streaming*.patch

+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+ dlist_iter iter;
...
+#endif
+}

We can implement the same as following:
#ifdef USE_ASSERT_CHECKING
static void
AssertChangeLsnOrder(ReorderBufferTXN *txn)
{
dlist_iter iter;
...
}
#else
#define AssertChangeLsnOrder(txn) ((void)true)
#endif

I am not sure, this doesn't look clean. Moreover, the other similar
functions are defined in the same way. e.g. AssertTXNLsnOrder.

+ * if it is aborted we will report an specific error which we can ignore. We
s/an specific/a specific

Done

+ * Set the last last of the stream as the final lsn before calling
+ * stream stop.
s/last last/last
PG_CATCH();
{
+ MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+ ErrorData  *errdata = CopyErrorData();
When we don't re-throw, the errdata should be freed by calling
FreeErrorData(errdata), right?

Done

+ /*
+ * Set the last last of the stream as the final lsn before
+ * calling stream stop.
+ */
+ txn->final_lsn = prev_lsn;
+ rb->stream_stop(rb, txn);
+
+ FlushErrorState();
+ }
stream_stop() can still throw some error, right? In that case, we
should flush the error state before calling stream_stop().

Done

+ /*
+ * Remember the command ID and snapshot if transaction is streaming
+ * otherwise free the snapshot if we have copied it.
+ */
+ if (streaming)
+ {
+ txn->command_id = command_id;
+
+ /* Avoid copying if it's already copied. */
+ if (snapshot_now->copied)
+ txn->snapshot_now = snapshot_now;
+ else
+ txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+  txn, command_id);
+ }
+ else if (snapshot_now->copied)
+ ReorderBufferFreeSnap(rb, snapshot_now);
Hmm, it seems this part needs an assumption that after copying the
snapshot, no subsequent step can throw any error. If they do, then we
can again create a copy of the snapshot in catch block, which will
leak some memory. Is my understanding correct?

Actually, In CATCH we copy only if the error is
ERRCODE_TRANSACTION_ROLLBACK. And, that can occur during systable
scan. Basically, in TRY block we copy snapshot after we have streamed
all the changes i.e. systable scan is done, now if there is any error
that will not be ERRCODE_TRANSACTION_ROLLBACK. So we will not copy
again.

+ }
+ else
+ {
+ ReorderBufferCleanupTXN(rb, txn);
+ PG_RE_THROW();
+ }
Shouldn't we switch back to previously created error memory context
before re-throwing?

Fixed.

+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+ volatile Snapshot snapshot_now;
+ volatile CommandId command_id = FirstCommandId;
In the modified ReorderBufferCommit(), why is it necessary to declare
the above two variable as volatile? There is no try-catch block here.

Fixed

@@ -1946,6 +2284,13 @@ ReorderBufferAbort(ReorderBuffer *rb,
TransactionId xid, XLogRecPtr lsn)
if (txn == NULL)
return;

+ /*
+ * When the (sub)transaction was streamed, notify the remote node
+ * about the abort only if we have sent any data for this transaction.
+ */
+ if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+ rb->stream_abort(rb, txn, lsn);
+
s/When/If
+ /*
+ * When the (sub)transaction was streamed, notify the remote node
+ * about the abort.
+ */
+ if (rbtxn_is_streamed(txn))
+ rb->stream_abort(rb, txn, lsn);
s/When/If. And, in this case, if we've not sent any data, why should
we send the abort message (similar to the previous one)?

Fixed

+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
Should we put any assert (not necessarily here) to validate the above comment?

Because of toast handling, this assumption is changed now so I will
remove this note in that patch (0010).

+ txn = ReorderBufferLargestTopTXN(rb);
+
+ /* we know there has to be one, because the size is not zero */
+ Assert(txn && !txn->toptxn);
+ Assert(txn->size > 0);
+ Assert(rb->size >= txn->size);
The same three assertions are already there in ReorderBufferLargestTopTXN().
+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+ LogicalDecodingContext *ctx = rb->private_data;
+
+ return ctx->streaming;
+}
Potential inline function.

Done

+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+ volatile Snapshot snapshot_now;
+ volatile CommandId command_id;
Here also, do we need to declare these two variables as volatile?

Done

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v14-0007-Track-statistics-for-streaming.patchapplication/octet-stream; name=v14-0007-Track-statistics-for-streaming.patchDownload
From be833105e8703d7471db400def123b4c121ac71b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 2 Apr 2020 13:19:29 +0530
Subject: [PATCH v14 07/10] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                    | 25 +++++++++++++++++++
 src/backend/catalog/system_views.sql            |  5 +++-
 src/backend/replication/logical/reorderbuffer.c | 13 ++++++++++
 src/backend/replication/walsender.c             | 32 +++++++++++++++++++++----
 src/include/catalog/pg_proc.dat                 |  6 ++---
 src/include/replication/reorderbuffer.h         | 13 ++++++----
 src/include/replication/walsender_private.h     |  5 ++++
 src/test/regress/expected/rules.out             |  7 ++++--
 8 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c50b721..8063ae8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2063,6 +2063,31 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       may get spilled repeatedly, and this counter gets incremented on every
       such invocation.</entry>
     </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
+
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index d406ea8..65d650d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -788,7 +788,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 801fdc5..845d820 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -331,6 +331,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3282,6 +3286,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
 
+	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 759ca5c..1656b4d 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1333,7 +1333,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1354,7 +1354,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2399,6 +2400,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3240,7 +3244,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3297,6 +3301,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillTxns;
 		int64		spillCount;
 		int64		spillBytes;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3320,6 +3327,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		memset(nulls, 0, sizeof(nulls));
@@ -3406,6 +3416,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3654,11 +3669,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4bce3ad..9fb1ffe 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5237,9 +5237,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6d65986..603f325 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -517,15 +517,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 366828f..3888b0c 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -85,6 +85,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ac31840..68e2deb 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2005,9 +2005,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_slru| SELECT s.name,
     s.blks_zeroed,
-- 
1.8.3.1

v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patchapplication/octet-stream; name=v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patchDownload
From 0054a3e3cf50db92317e1fec0583f3777b1f8a32 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 28 Feb 2020 11:07:46 +0530
Subject: [PATCH v14 10/10] Bugfix handling of incomplete toast tuple

---
 src/backend/access/heap/heapam.c                |   3 +
 src/backend/replication/logical/decode.c        |  17 ++-
 src/backend/replication/logical/reorderbuffer.c | 182 +++++++++++++++---------
 src/include/access/heapam_xlog.h                |   1 +
 src/include/replication/reorderbuffer.h         |  24 +++-
 5 files changed, 147 insertions(+), 80 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 86c2190..8fca8cb 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2017,6 +2017,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 69c1f45..c841687 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -727,7 +727,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -794,7 +796,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -851,7 +854,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -887,7 +891,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -987,7 +991,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1025,7 +1029,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 845d820..f299c64 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -654,11 +654,14 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
-	ReorderBufferTXN *txn;
+	ReorderBufferTXN *txn, *toptxn;
+	bool	can_stream = false;
 
-	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+	toptxn = txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
 
 	change->lsn = lsn;
 	change->txn = txn;
@@ -668,9 +671,50 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Otherwise, if
+	 * we have toast insert bit set and this is insert/update then clear the
+	 * bit.
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(txn) &&
+			 ((change->action == REORDER_BUFFER_CHANGE_INSERT) ||
+			 (change->action == REORDER_BUFFER_CHANGE_UPDATE)))
+	{
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+		can_stream = true;
+	}
+
+	/*
+	 * If this is a speculative insert then set the corresponding bit.
+	 * Otherwise, if we have speculative insert bit set and this is spec confirm
+	 * record then clear the bit.
+	 */
+	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM)
+	{
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+		can_stream = true;
+	}
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/*
+	 * If streaming is enable and we have serialized this transaction because
+	 * it had incomplete tuple.  So if now we have got the complete tuple we
+	 * can stream it.
+	 */
+	if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
+		&& !rbtxn_has_toast_insert(txn) && !rbtxn_has_spec_insert(txn))
+	{
+		ReorderBufferStreamTXN(rb, toptxn);
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
+	}
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -700,7 +744,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1862,8 +1906,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						 * disk.
 						 */
 						dlist_delete(&change->node);
-						ReorderBufferToastAppendChunk(rb, txn, relation,
-													  change);
+							ReorderBufferToastAppendChunk(rb, txn, relation,
+														  change);
 					}
 
 			change_done:
@@ -2456,7 +2500,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2505,7 +2549,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2528,6 +2572,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2542,8 +2587,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2551,12 +2601,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2617,7 +2675,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 	memcpy(change->data.inval.invalidations, msgs,
 		   sizeof(SharedInvalidationMessage) * nmsgs);
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -2804,15 +2862,16 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		/*
+		 * if the current transaction is larger and doesn't have incomplete data
+		 * remember it.
+		 */
+		if (((!largest) || (txn->total_size > largest->total_size)) &&
+			((txn->size > 0) && !rbtxn_has_toast_insert(txn) &&
+			!rbtxn_has_spec_insert(txn)))
+		largest = txn;
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2830,66 +2889,51 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
 	ReorderBufferTXN *txn;
 
-	/* bail out if we haven't exceeded the memory limit */
-	if (rb->size < logical_decoding_work_mem * 1024L)
-		return;
-
-	/*
-	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by streaming, if supported. Otherwise spill to disk.
-	 */
-	if (ReorderBufferCanStream(rb))
+	/* Loop until we reach under the memory limit. */
+	while (rb->size >= logical_decoding_work_mem * 1024L)
 	{
 		/*
-		 * Pick the largest toplevel transaction and evict it from memory by
-		 * streaming the already decoded part.
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTopTXN(rb);
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn && !txn->toptxn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		ReorderBufferStreamTXN(rb, txn);
-	}
-	else
-	{
-		/*
-		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
-		 */
-		txn = ReorderBufferLargestTXN(rb);
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
-		ReorderBufferSerializeTXN(rb, txn);
+		/*
+		 * After eviction, the transaction should have no entries in memory, and
+		 * should use 0 bytes for changes.
+		 *
+		 * XXX Checking the size is fine for both cases - spill to disk and
+		 * streaming. But for streaming we should really check nentries_mem for
+		 * all subtransactions too.
+		 */
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
 	}
 
-	/*
-	 * After eviction, the transaction should have no entries in memory, and
-	 * should use 0 bytes for changes.
-	 *
-	 * XXX Checking the size is fine for both cases - spill to disk and
-	 * streaming. But for streaming we should really check nentries_mem for
-	 * all subtransactions too.
-	 */
-	Assert(txn->size == 0);
-	Assert(txn->nentries_mem == 0);
-
-	/*
-	 * And furthermore, evicting the transaction should get us below the
-	 * memory limit again - it is not possible that we're still exceeding the
-	 * memory limit after evicting the transaction.
-	 *
-	 * This follows from the simple fact that the selected transaction is at
-	 * least as large as the most recent change (which caused us to go over
-	 * the memory limit). So by evicting it we're definitely back below the
-	 * memory limit.
-	 */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 603f325..ba2ab71 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -172,6 +172,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -191,6 +193,17 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -199,10 +212,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -355,6 +364,9 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -545,7 +557,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v14-0001-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=v14-0001-Immediately-WAL-log-assignments.patchDownload
From 4e1d4a1d40ae55d5144f73ed7f607b1ad22a3106 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 20 Mar 2020 15:03:01 +0530
Subject: [PATCH v14 01/10] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 46 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 +++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 39 ++++++++++++++-------------
 src/include/access/xact.h                |  3 +++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 +++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 99 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6b1ae1f..c5842d3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -190,6 +190,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -222,6 +223,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5111,6 +5113,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6013,3 +6016,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 4259309..3c49954 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 79ff976..7b5257f 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1189,6 +1189,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1227,6 +1228,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3a..122c581 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel xid is valid, we need to assign the subxact to the
+	 * toplevel xact. We need to do this for all records, hence we do it before
+	 * the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(XLogRecGetTopXid(r)))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04ba..8645b38 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f60ed2d..6d439d0 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -229,6 +229,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 4582196..756f6df 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -150,6 +150,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -285,6 +287,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v14-0003-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=v14-0003-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From 9416517a4e1396f2ce5ac040ada88ba59e4f183d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v14 03/10] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 +++++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  57 +++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..64f651f 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index bad3bfe..65244b1 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5adf253..497d8a9 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +204,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -862,6 +910,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->final_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 3b7ca7f..f24e246 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..0d0a94a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index af35287..e102840 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -354,6 +354,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -393,6 +439,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v14-0002-Issue-individual-invalidations-with.patchapplication/octet-stream; name=v14-0002-Issue-individual-invalidations-with.patchDownload
From b0ed16bb5a75a08860b18f0212c4445b60dcb382 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Tue, 14 Apr 2020 11:11:37 +0530
Subject: [PATCH v14 02/10] Issue individual invalidations with 
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.
---
 src/backend/access/rmgrdesc/xactdesc.c          |  40 +++++++++
 src/backend/access/transam/xact.c               |   7 ++
 src/backend/replication/logical/decode.c        |  16 ++++
 src/backend/replication/logical/reorderbuffer.c | 104 +++++++++++++++++++++---
 src/backend/utils/cache/inval.c                 |  49 +++++++++++
 src/include/access/xact.h                       |  13 ++-
 src/include/replication/reorderbuffer.h         |  11 +++
 7 files changed, 226 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index fbc5942..17c06f7 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c5842d3..cf78ffc 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6013,6 +6013,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 122c581..69c1f45 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,22 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				Assert(TransactionIdIsValid(xid));
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->nmsgs, invals->msgs);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4594cf9..0d5bb73 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -220,7 +220,7 @@ static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
 static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
-static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
 
 /*
  * ---------------------------------------
@@ -455,6 +455,11 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 				pfree(change->data.msg.message);
 			change->data.msg.message = NULL;
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			if (change->data.inval.invalidations)
+				pfree(change->data.inval.invalidations);
+			change->data.inval.invalidations = NULL;
+			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			if (change->data.snapshot)
 			{
@@ -1814,17 +1819,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation messages locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					ReorderBufferExecuteInvalidations(
+							change->data.inval.ninvalidations,
+							change->data.inval.invalidations);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -1866,7 +1878,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1892,7 +1905,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2204,6 +2218,33 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 
 /*
  * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn, int nmsgs,
+							 SharedInvalidationMessage *msgs)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.ninvalidations = nmsgs;
+	change->data.inval.invalidations = (SharedInvalidationMessage *)
+		MemoryContextAlloc(rb->context,
+						   sizeof(SharedInvalidationMessage) * nmsgs);
+	memcpy(change->data.inval.invalidations, msgs,
+		   sizeof(SharedInvalidationMessage) * nmsgs);
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Setup the invalidation of the toplevel transaction.
  *
  * This needs to be done before ReorderBufferCommit is called!
  */
@@ -2234,12 +2275,12 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  * in the changestream but we don't know which those are.
  */
 static void
-ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
 {
 	int			i;
 
-	for (i = 0; i < txn->ninvalidations; i++)
-		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	for (i = 0; i < nmsgs; i++)
+		LocalExecuteInvalidationMessage(&msgs[i]);
 }
 
 /*
@@ -2591,6 +2632,24 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				char	   *data;
+				Size		inval_size = sizeof(SharedInvalidationMessage) *
+										change->data.inval.ninvalidations;
+
+				sz += inval_size;
+
+				ReorderBufferSerializeReserve(rb, sz);
+				data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+
+				/* might have been reallocated above */
+				ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+				memcpy(data, change->data.inval.invalidations, inval_size);
+				data += inval_size;
+
+				break;
+			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -2736,6 +2795,11 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			sz += sizeof(SharedInvalidationMessage) *
+					change->data.inval.ninvalidations;
+			break;
+
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -3004,6 +3068,20 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				Size	inval_size = sizeof(SharedInvalidationMessage) *
+									change->data.inval.ninvalidations;
+
+				change->data.inval.invalidations =
+						MemoryContextAlloc(rb->context,
+										   change->data.msg.message_size);
+				/* read the message */
+				memcpy(change->data.inval.invalidations, data, inval_size);
+				data += inval_size;
+
+				break;
+			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	oldsnap;
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..cba5b6c 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transaction.  As of now it was
+ *	enough to log invalidation only at commit because we are only decoding the
+ *	transaction at the commit time.   We only need to log the catalog cache and
+ *	relcache invalidation.  There can not be any active MVCC scan in logical
+ *	decoding so we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
 	if (transInvalInfo == NULL)
 		return;
 
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
@@ -1501,3 +1513,40 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage	*invalMessages;
+	int	nmsgs = 0;
+
+	if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+	{
+		ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+								 MakeSharedInvalidMessagesArray);
+		invalMessages = SharedInvalidMessagesArray;
+		nmsgs  = numSharedInvalidMessagesArray;
+		SharedInvalidMessagesArray = NULL;
+		numSharedInvalidMessagesArray = 0;
+	}
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b38..b822c5e 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,17 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4..af35287 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,14 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations;		/* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation
+														 * message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +468,8 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  int nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
1.8.3.1

v14-0005-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v14-0005-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From 75427813941c3e8d1af01f7f2800da9994746c5b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri, 10 Jan 2020 18:03:27 -0300
Subject: [PATCH v14 05/10] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c     |  38 +-
 src/backend/replication/logical/reorderbuffer.c | 713 +++++++++++++++++++++---
 src/include/replication/reorderbuffer.h         |  36 ++
 3 files changed, 696 insertions(+), 91 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..160b167 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2302875..801fdc5 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -236,6 +236,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +245,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +381,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -773,6 +786,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -865,6 +910,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -988,7 +1036,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1024,6 +1072,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1038,6 +1089,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1316,6 +1370,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1341,8 +1404,93 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
  */
 static void
 ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
-
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
 
 	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
@@ -1491,63 +1636,75 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error that we can ignore.  We
+ * might have already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the abort we will
+ * stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
-		return;
-	}
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 
-	/* build data to be able to lookup the CommandIds of catalog tuples */
+	/*
+	 * build data to be able to lookup the CommandIds of catalog tuples
+	 */
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, txn->xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1563,15 +1720,20 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 	PG_TRY();
 	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 		ReorderBufferChange *change;
 		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction("stream");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* start streaming this chunk of transaction */
+		if (streaming)
+			rb->stream_start(rb, txn);
+		else
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1579,6 +1741,19 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1588,8 +1763,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * use as a normal record. It'll be cleaned up at the end
 					 * of INSERT processing.
 					 */
-					if (specinsert == NULL)
-						elog(ERROR, "invalid ordering of speculative insertion changes");
 					Assert(specinsert->data.tp.oldtuple == NULL);
 					change = specinsert;
 					change->action = REORDER_BUFFER_CHANGE_INSERT;
@@ -1655,7 +1828,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						if (streaming)
+						{
+							rb->stream_change(rb, txn, relation, change);
+
+							/* Remember that we have sent some data for this xid. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_change(rb, txn, relation, change);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1676,8 +1857,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						 * freed/reused while restoring spooled data from
 						 * disk.
 						 */
-						Assert(change->data.tp.newtuple != NULL);
-
 						dlist_delete(&change->node);
 						ReorderBufferToastAppendChunk(rb, txn, relation,
 													  change);
@@ -1695,7 +1874,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1753,7 +1932,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						if (streaming)
+						{
+							rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+							/* Remember that we have sent some data. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_truncate(rb, txn, nrelations, relations, change);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1762,10 +1949,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					if (streaming)
+						rb->stream_message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
+					else
+						rb->message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1796,9 +1989,9 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+										  txn->xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1818,7 +2011,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+											  txn->xid);
 					}
 
 					break;
@@ -1858,14 +2052,46 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			/*
+			 * Set the last of the stream as the final lsn before calling
+			 * stream stop.
+			 */
+			if (!XLogRecPtrIsInvalid(prev_lsn))
+				txn->final_lsn = prev_lsn;
+			rb->stream_stop(rb, txn);
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+		{
+			txn->command_id = command_id;
+
+			/* Avoid copying if it's already copied. */
+			if (snapshot_now->copied)
+				txn->snapshot_now = snapshot_now;
+			else
+				txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+														  txn, command_id);
+		}
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1884,14 +2110,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,18 +2145,125 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		if (streaming)
+		{
+			/* Discard the changes that we just streamed. */
+			ReorderBufferTruncateTXN(rb, txn);
 
-		PG_RE_THROW();
+			/* Re-throw only if it's not an abort. */
+			if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+			{
+				MemoryContextSwitchTo(ecxt);
+				PG_RE_THROW();
+			}
+			else
+			{
+				FlushErrorState();
+				FreeErrorData(errdata);
+				errdata = NULL;
+
+				/* remember the command ID and snapshot for the streaming run */
+				txn->command_id = command_id;
+
+				/* Avoid copying if it's already copied. */
+				if (snapshot_now->copied)
+					txn->snapshot_now = snapshot_now;
+				else
+					txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+															  txn, command_id);
+				/*
+				 * Set the last last of the stream as the final lsn before
+				 * calling stream stop.
+				 */
+				txn->final_lsn = prev_lsn;
+				rb->stream_stop(rb, txn);
+			}
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
 /*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * We currently can only decode a transaction's contents when its commit
+ * record is read because that's the only place where we know about cache
+ * invalidations. Thus, once a toplevel commit is read, we iterate over the top
+ * and subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot snapshot_now;
+	CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -1946,6 +2287,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2015,6 +2363,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2150,8 +2505,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2159,6 +2523,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2170,19 +2535,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2211,6 +2585,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2295,6 +2670,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2399,6 +2781,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
  *
@@ -2418,15 +2832,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2746,6 +3191,102 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot snapshot_now;
+	CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e102840..6d65986 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -171,6 +171,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -190,6 +191,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -225,6 +244,16 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	final_lsn;
 
 	/*
+	 * Have we sent any changes for this transaction in output plugin?
+	 */
+	bool		any_data_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
+	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
 	XLogRecPtr	end_lsn;
@@ -255,6 +284,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
-- 
1.8.3.1

v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchapplication/octet-stream; name=v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchDownload
From 0d995d97213750c321f7ccfbe6d184ccd335fb52 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 10:55:19 +0530
Subject: [PATCH v14 04/10] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml               |  5 ++-
 src/backend/access/heap/heapam.c                | 41 +++++++++++++++++++++++++
 src/backend/access/index/genam.c                | 38 +++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c |  8 ++---
 src/backend/utils/time/snapmgr.c                | 25 +++++++++++++--
 src/include/utils/snapmgr.h                     |  4 ++-
 6 files changed, 113 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 65244b1..b59a6c8 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,7 +432,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
      includes writing to tables, performing DDL changes, and
      calling <literal>pg_current_xact_id()</literal>.
     </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c4a5aa6..86c2190 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1303,6 +1303,15 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(scan->rs_base.rs_rd) ||
+		  RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	HEAPDEBUG_1;				/* heap_getnext( info ) */
@@ -1422,6 +1431,14 @@ heap_fetch(Relation relation,
 	bool		valid;
 
 	/*
+	 * We don't expect direct calls to heap_fetch with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_fetch call during logical decoding");
+
+	/*
 	 * Fetch and pin the appropriate page of the relation.
 	 */
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
@@ -1536,6 +1553,14 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		valid;
 	bool		skip;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search_buffer with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_hot_search_buffer call during logical decoding");
+
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
 		*all_dead = first_call;
@@ -1685,6 +1710,14 @@ heap_get_latest_tid(TableScanDesc sscan,
 	Assert(ItemPointerIsValid(tid));
 
 	/*
+	 * We don't expect direct calls to heap_get_latest_tid with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_get_latest_tid call during logical decoding");
+
+	/*
 	 * Loop to chase down t_ctid links.  At top of loop, ctid is the tuple we
 	 * need to examine, and *tid is the TID we will return if ctid turns out
 	 * to be bogus.
@@ -5490,6 +5523,14 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	ItemId		lp = NULL;
 	HeapTupleHeader htup;
 
+	/*
+	 * We don't expect direct calls to heap_finish_speculative with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_finish_speculative call during logical decoding");
+
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..97a1075 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -433,6 +434,25 @@ systable_beginscan(Relation heapRelation,
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we error
+ * out.  We can't directly use TransactionIdDidAbort as after crash such
+ * transaction might not have been marked as aborted.  See detailed  comments at
+ * snapmgr.c where the variable is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +501,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +543,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -643,6 +675,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 0d5bb73..2302875 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -696,7 +696,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 			txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 		/* setup snapshot to allow catalog access */
-		SetupHistoricSnapshot(snapshot_now, NULL);
+		SetupHistoricSnapshot(snapshot_now, NULL, xid);
 		PG_TRY();
 		{
 			rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1547,7 +1547,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1798,7 +1798,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1818,7 +1818,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					}
 
 					break;
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c5..93a0c04 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -154,6 +154,13 @@ static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
 /*
+ * An xid value pointing to a possibly ongoing (sub)transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
+/*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2029,10 +2036,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
  *
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive.  This is to re-check XID status while accessing catalog.
+ *
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+					  TransactionId snapshot_xid)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -2041,8 +2052,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 
 	/* setup (cmin, cmax) lookup hash */
 	tuplecid_data = tuplecids;
-}
 
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
+	 */
+	if (TransactionIdIsValid(snapshot_xid) &&
+		!TransactionIdDidCommit(snapshot_xid))
+		CheckXidAlive = snapshot_xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
 /*
  * Make catalog snapshots behave normally again.
@@ -2052,6 +2072,7 @@ TeardownHistoricSnapshot(bool is_error)
 {
 	HistoricSnapshot = NULL;
 	tuplecid_data = NULL;
+	CheckXidAlive = InvalidTransactionId;
 }
 
 bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13c..12f737b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,8 +145,10 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+								  TransactionId snapshot_xid);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-- 
1.8.3.1

v14-0006-Add-support-for-streaming-to-built-in-replicatio.patchapplication/octet-stream; name=v14-0006-Add-support-for-streaming-to-built-in-replicatio.patchDownload
From 884ef0f05d48ec80692769b92cd648d989cdb2e0 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 11:35:35 +0530
Subject: [PATCH v14 06/10] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |    4 +-
 doc/src/sgml/ref/create_subscription.sgml          |   12 +
 src/backend/catalog/pg_subscription.c              |    1 +
 src/backend/commands/subscriptioncmds.c            |   45 +-
 src/backend/postmaster/pgstat.c                    |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    3 +
 src/backend/replication/logical/launcher.c         |    1 -
 src/backend/replication/logical/logical.c          |    4 +-
 src/backend/replication/logical/proto.c            |  140 ++-
 src/backend/replication/logical/worker.c           | 1026 ++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c        |  315 +++++-
 src/backend/replication/slotfuncs.c                |    6 +
 src/backend/replication/walsender.c                |    6 +
 src/include/catalog/pg_subscription.h              |    3 +
 src/include/pgstat.h                               |    6 +-
 src/include/replication/logicalproto.h             |   42 +-
 src/include/replication/walreceiver.h              |    1 +
 src/test/subscription/t/009_stream_simple.pl       |   86 ++
 src/test/subscription/t/010_stream_subxact.pl      |  102 ++
 src/test/subscription/t/011_stream_ddl.pl          |   95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |   82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |   84 ++
 22 files changed, 2033 insertions(+), 43 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 1b8bead..95b7c24 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -164,8 +164,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c24..3349cc4 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731..f28482f 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 7f15667..65b6b76 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 50eea2e..4ef4fd4 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4133,6 +4133,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9..5257ab0 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e..8156a42 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,7 +14,6 @@
  *
  *-------------------------------------------------------------------------
  */
-
 #include "postgres.h"
 
 #include "access/heapam.h"
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 497d8a9..dfc681d 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1148,7 +1148,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1193,7 +1193,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = txn->final_lsn;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd..5242ac0 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a12..3dc5f83 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,6 +54,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -64,6 +86,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +94,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -110,12 +134,57 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool	in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;						/* XID of the subxact */
+	off_t           offset;					/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +256,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -553,6 +658,318 @@ apply_handle_origin(StringInfo s)
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+			return;
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	/* XXX Should this be allocated in another memory context? */
+
+	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	ensure_transaction();
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+
+	pfree(buffer);
+	pfree(s2.data);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -565,6 +982,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1000,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1039,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1157,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1302,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1675,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1816,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1478,6 +1929,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 }
 
 /*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
+/*
  * Apply main loop.
  */
 static void
@@ -1493,6 +1960,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1941,6 +2411,561 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	CloseTransientFile(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	CloseTransientFile(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3131,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 5fbf2d4..b0fba26 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,54 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +119,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +145,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +208,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +237,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +260,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +281,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +370,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -307,19 +426,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
 		relentry->map = convert_tuples_by_name(indesc, outdesc);
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -343,17 +468,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -362,6 +489,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -390,7 +521,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -410,7 +541,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -434,7 +565,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -454,7 +585,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -479,6 +610,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -507,13 +642,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -587,6 +723,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -623,6 +844,34 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -755,6 +1004,36 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -789,7 +1068,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index f776de3..9121420 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -156,6 +156,12 @@ create_logical_replication_slot(char *name, char *plugin,
 									NULL);
 
 	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 122d884..759ca5c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1004,6 +1004,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndUpdateProgress);
 
 		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
+		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
 		 * messages or send keepalives. As we possibly need to wait for
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d4..617b909 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8041d9..3b3e1fd 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -981,7 +981,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561..89158ed 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f1aa6e9..70d39f8 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From cb85389c81638ea84e8f6cea006df7648ec6deb6 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v14 08/10] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 6da7f71..086d0c7 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -70,7 +70,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 34ab11e..ad3ed13 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 81520a7..0c9c6b3 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From 31a85235bcb18038a86b2e57a9cdcf2d893de27b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v14 09/10] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

#256Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#254)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Apr 14, 2020 at 2:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Apr 13, 2020 at 6:12 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

+#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
This looks wrong. We should change the name of this Macro or we can
add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.

I think this is in sync with below code (SizeOfXlogOrigin), SO doen't
make much sense to add different terminology no?
#define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))

In that case, we can rename this, for example, SizeOfXLogTransactionId.

Some review comments from 0002-Issue-individual-*.path,

+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr lsn, int nmsgs,
+ SharedInvalidationMessage *msgs)
+{
+ MemoryContext oldcontext;
+ ReorderBufferChange *change;
+
+ /* XXX Should we even write invalidations without valid XID? */
+ if (xid == InvalidTransactionId)
+ return;
+
+ Assert(xid != InvalidTransactionId);

It seems we don't call the function if xid is not valid. In fact,

You have a valid point. Also, it is not clear if we are first
checking (xid == InvalidTransactionId) and returning from the
function, how can even Assert hit.

I have changed to code, now we only have an assert.

@@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
}
case XLOG_XACT_ASSIGNMENT:
break;
+ case XLOG_XACT_INVALIDATIONS:
+ {
+ TransactionId xid;
+ xl_xact_invalidations *invals;
+
+ xid = XLogRecGetXid(r);
+ invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+ if (!TransactionIdIsValid(xid))
+ break;
+
+ ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+ invals->nmsgs, invals->msgs);

Why should we insert an WAL record for such cases?

Right, if there is any such case, we should avoid it.

I think we don't have any such case because we are logging at the
command end. So I have created an assert instead of the check.

One more point about this patch, the commit message needs to be updated:

The new invalidations are written to WAL immediately, without any

such caching. Perhaps it would be possible to add similar caching,

e.g. at the command level, or something like that?

I think the above part of commit message is not right as the patch
already does such a caching now at the command level.

Right, I have removed that.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#257Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#256)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

@@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
}
case XLOG_XACT_ASSIGNMENT:
break;
+ case XLOG_XACT_INVALIDATIONS:
+ {
+ TransactionId xid;
+ xl_xact_invalidations *invals;
+
+ xid = XLogRecGetXid(r);
+ invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+ if (!TransactionIdIsValid(xid))
+ break;
+
+ ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+ invals->nmsgs, invals->msgs);

Why should we insert an WAL record for such cases?

Right, if there is any such case, we should avoid it.

I think we don't have any such case because we are logging at the
command end. So I have created an assert instead of the check.

Have you tried to ensure this in some way? One idea could be to add
an Assert (to check if transaction id is assigned) in the new code
where you are writing WAL for this action and then run make
check-world and or make installcheck-world.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#258Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#257)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Apr 14, 2020 at 3:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

@@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
}
case XLOG_XACT_ASSIGNMENT:
break;
+ case XLOG_XACT_INVALIDATIONS:
+ {
+ TransactionId xid;
+ xl_xact_invalidations *invals;
+
+ xid = XLogRecGetXid(r);
+ invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+ if (!TransactionIdIsValid(xid))
+ break;
+
+ ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+ invals->nmsgs, invals->msgs);

Why should we insert an WAL record for such cases?

Right, if there is any such case, we should avoid it.

I think we don't have any such case because we are logging at the
command end. So I have created an assert instead of the check.

Have you tried to ensure this in some way? One idea could be to add
an Assert (to check if transaction id is assigned) in the new code
where you are writing WAL for this action and then run make
check-world and or make installcheck-world.

Yeah, I had already tested that.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#259Erik Rijkers
er@xs4all.nl
In reply to: Dilip Kumar (#255)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 2020-04-14 12:10, Dilip Kumar wrote:

v14-0001-Immediately-WAL-log-assignments.patch +
v14-0002-Issue-individual-invalidations-with.patch +
v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch +
v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
v14-0007-Track-statistics-for-streaming.patch +
v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch +
v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch

applied on top of 8128b0c (a few hours ago)

Hi,

I haven't followed this thread and maybe this instabilty is
known/expected; just thought I'd let you know.

When doing running a pgbench run over logical replication (cascading
down two replicas), I get this segmentation fault.

2020-04-14 17:27:28.135 CEST [8118] DETAIL: Streaming transactions
committing after 0/5FA2A38, reading WAL from 0/5FA2A00.
2020-04-14 17:27:28.135 CEST [8118] LOG: logical decoding found
consistent point at 0/5FA2A00
2020-04-14 17:27:28.135 CEST [8118] DETAIL: There are no running
transactions.
2020-04-14 17:27:28.138 CEST [8006] LOG: server process (PID 8118) was
terminated by signal 11: Segmentation fault
2020-04-14 17:27:28.138 CEST [8006] DETAIL: Failed process was running:
COMMIT
2020-04-14 17:27:28.138 CEST [8006] LOG: terminating any other active
server processes
2020-04-14 17:27:28.138 CEST [8163] WARNING: terminating connection
because of crash of another server process
2020-04-14 17:27:28.138 CEST [8163] DETAIL: The postmaster has
commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2020-04-14 17:27:28.138 CEST [8163] HINT: In a moment you should be
able to reconnect to the database and repeat your command.

This error happens somewhat buried away in my test-stuff; I can dig it
out and make it into a repeatable test if you need it. (debian
stretch/gcc 9.3.0)

Erik Rijkers

#260Dilip Kumar
dilipbalaut@gmail.com
In reply to: Erik Rijkers (#259)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers <er@xs4all.nl> wrote:

On 2020-04-14 12:10, Dilip Kumar wrote:

v14-0001-Immediately-WAL-log-assignments.patch +
v14-0002-Issue-individual-invalidations-with.patch +
v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch +
v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
v14-0007-Track-statistics-for-streaming.patch +
v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch +
v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch

applied on top of 8128b0c (a few hours ago)

Hi,

I haven't followed this thread and maybe this instabilty is
known/expected; just thought I'd let you know.

When doing running a pgbench run over logical replication (cascading
down two replicas), I get this segmentation fault.

Thanks for the testing. Is it possible to share the call stack?

2020-04-14 17:27:28.135 CEST [8118] DETAIL: Streaming transactions
committing after 0/5FA2A38, reading WAL from 0/5FA2A00.
2020-04-14 17:27:28.135 CEST [8118] LOG: logical decoding found
consistent point at 0/5FA2A00
2020-04-14 17:27:28.135 CEST [8118] DETAIL: There are no running
transactions.
2020-04-14 17:27:28.138 CEST [8006] LOG: server process (PID 8118) was
terminated by signal 11: Segmentation fault
2020-04-14 17:27:28.138 CEST [8006] DETAIL: Failed process was running:
COMMIT
2020-04-14 17:27:28.138 CEST [8006] LOG: terminating any other active
server processes
2020-04-14 17:27:28.138 CEST [8163] WARNING: terminating connection
because of crash of another server process
2020-04-14 17:27:28.138 CEST [8163] DETAIL: The postmaster has
commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2020-04-14 17:27:28.138 CEST [8163] HINT: In a moment you should be
able to reconnect to the database and repeat your command.

This error happens somewhat buried away in my test-stuff; I can dig it
out and make it into a repeatable test if you need it. (debian
stretch/gcc 9.3.0)

Yeah, that will be great.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#261Dilip Kumar
dilipbalaut@gmail.com
In reply to: Erik Rijkers (#259)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers <er@xs4all.nl> wrote:

On 2020-04-14 12:10, Dilip Kumar wrote:

v14-0001-Immediately-WAL-log-assignments.patch +
v14-0002-Issue-individual-invalidations-with.patch +
v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch +
v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
v14-0007-Track-statistics-for-streaming.patch +
v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch +
v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch

applied on top of 8128b0c (a few hours ago)

Hi Erik,

While setting up the cascading replication I have hit one issue on
base code[1]/messages/by-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w@mail.gmail.com. After fixing that I have got one crash with streaming
on patch. I am not sure whether you are facing any of these 2 issues
or any other issue. If your issue is not any of these then plese
share the callstack and steps to reproduce.

[1]: /messages/by-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

bugfix_in_schema_sent.patchapplication/octet-stream; name=bugfix_in_schema_sent.patchDownload
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 59a09b9..811706a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -999,7 +999,10 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
#262Erik Rijkers
er@xs4all.nl
In reply to: Dilip Kumar (#261)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 2020-04-16 11:33, Dilip Kumar wrote:

On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers <er@xs4all.nl> wrote:

On 2020-04-14 12:10, Dilip Kumar wrote:

v14-0001-Immediately-WAL-log-assignments.patch +
v14-0002-Issue-individual-invalidations-with.patch +
v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch +
v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
v14-0007-Track-statistics-for-streaming.patch +
v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch +
v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch

applied on top of 8128b0c (a few hours ago)

I've added your new patch

[bugfix_replica_identity_full_on_subscriber.patch]

on top of all those above but the crash (apparently the same crash) that
I had earlier still occurs (and pretty soon).

server process (PID 1721) was terminated by signal 11: Segmentation
fault

I'll try to isolate it better and get a stacktrace

Show quoted text

Hi Erik,

While setting up the cascading replication I have hit one issue on
base code[1]. After fixing that I have got one crash with streaming
on patch. I am not sure whether you are facing any of these 2 issues
or any other issue. If your issue is not any of these then plese
share the callstack and steps to reproduce.

[1]
/messages/by-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#263Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Dilip Kumar (#255)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Few review comments from 0006-Add-support-for-streaming*.patch

+ subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
lseek can return (-)ve value in case of error, right?

+ /*
+ * We might need to create the tablespace's tempfile directory, if no
+ * one has yet done so.
+ *
+ * Don't check for error from mkdir; it could fail if the directory
+ * already exists (maybe someone else just did the same thing).  If
+ * it doesn't work then we'll bomb out when opening the file
+ */
+ mkdir(tempdirpath, S_IRWXU);
If that's the only reason, perhaps we can use something like following:

if (mkdir(tempdirpath, S_IRWXU) < 0 && errno != EEXIST)
throw error;

+
+ CloseTransientFile(stream_fd);
Might failed to close the file. We should handle the case.

Also, I think we need some implementations in dumpSubscription() to
dump the (streaming = 'on') option.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

#264Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dilip Kumar (#249)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Apr 13, 2020 at 05:20:39PM +0530, Dilip Kumar wrote:

On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

On Thu, Apr 9, 2020 at 2:40 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have rebased the patch on the latest head. I haven't yet changed
anything for xid assignment thing because it is not yet concluded.

Some review comments from 0001-Immediately-WAL-log-*.patch,

+bool
+IsSubTransactionAssignmentPending(void)
+{
+ if (!XLogLogicalInfoActive())
+ return false;
+
+ /* we need to be in a transaction state */
+ if (!IsTransactionState())
+ return false;
+
+ /* it has to be a subtransaction */
+ if (!IsSubTransaction())
+ return false;
+
+ /* the subtransaction has to have a XID assigned */
+ if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+ return false;
+
+ /* and it needs to have 'assigned' */
+ return !CurrentTransactionState->assigned;
+
+}
IMHO, it's important to reduce the complexity of this function since
it's been called for every WAL insertion. During the lifespan of a
transaction, any of these if conditions will only be evaluated if
previous conditions are true. So, we can maintain some state machine
to avoid multiple evaluation of a condition inside a transaction. But,
if the overhead is not much, it's not worth I guess.

Yeah maybe, in some cases we can avoid checking multiple conditions by
maintaining that state. But, that state will have to be at the
transaction level. But, I am not sure how much worth it will be to
add one extra condition to skip a few if checks and it will also add
the code complexity. And, in some cases where logical decoding is not
enabled, it may add one extra check? I mean first check the state and
that will take you to the first if check.

Perhaps. I think we should only do that if we can demonstrate it's an
issue in practice. Otherwise it's just unnecessary complexity.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#265Erik Rijkers
er@xs4all.nl
In reply to: Erik Rijkers (#262)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 2020-04-16 11:46, Erik Rijkers wrote:

On 2020-04-16 11:33, Dilip Kumar wrote:

On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers <er@xs4all.nl> wrote:

On 2020-04-14 12:10, Dilip Kumar wrote:

v14-0001-Immediately-WAL-log-assignments.patch +
v14-0002-Issue-individual-invalidations-with.patch +
v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch +
v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
v14-0007-Track-statistics-for-streaming.patch +
v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch +
v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch

applied on top of 8128b0c (a few hours ago)

I've added your new patch

[bugfix_replica_identity_full_on_subscriber.patch]

on top of all those above but the crash (apparently the same crash)
that I had earlier still occurs (and pretty soon).

server process (PID 1721) was terminated by signal 11: Segmentation
fault

I'll try to isolate it better and get a stacktrace

Hi Erik,

While setting up the cascading replication I have hit one issue on
base code[1]. After fixing that I have got one crash with streaming
on patch. I am not sure whether you are facing any of these 2 issues
or any other issue. If your issue is not any of these then plese
share the callstack and steps to reproduce.

I figured out a few things about this. Attached is a bash script
test.sh, to reproduce:

There is a variable CRASH_IT that determines whether the whole thing
will fail (with a segmentation fault) or not. As attached it has
CRASH_IT=0 and does not crash. When you change that to CRASH_IT=1, then
it will crash. It turns out that this just depends on a short wait
state (3 seconds, on my machine) between setting up de replication, and
the running of pgbench. It's possible that on very fast machines maybe
it does not occur; we've had such difference between hardware before.
This is a i5-3330S.

It deletes files so look it over before you run it. It may also depend
on some of my local set-up but I guess that should be easily fixed.

Can you let me know if you can reproduce the problem with this?

thanks,

Erik Rijkers

Show quoted text

[1]
/messages/by-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#266Erik Rijkers
er@xs4all.nl
In reply to: Erik Rijkers (#265)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 2020-04-18 11:07, Erik Rijkers wrote:

Hi Erik,

While setting up the cascading replication I have hit one issue on
base code[1]. After fixing that I have got one crash with streaming
on patch. I am not sure whether you are facing any of these 2 issues
or any other issue. If your issue is not any of these then plese
share the callstack and steps to reproduce.

I figured out a few things about this. Attached is a bash script
test.sh, to reproduce:

And the attached file, test.sh. (sorry)

Show quoted text

There is a variable CRASH_IT that determines whether the whole thing
will fail (with a segmentation fault) or not. As attached it has
CRASH_IT=0 and does not crash. When you change that to CRASH_IT=1,
then it will crash. It turns out that this just depends on a short
wait state (3 seconds, on my machine) between setting up de
replication, and the running of pgbench. It's possible that on very
fast machines maybe it does not occur; we've had such difference
between hardware before. This is a i5-3330S.

It deletes files so look it over before you run it. It may also
depend on some of my local set-up but I guess that should be easily
fixed.

Can you let me know if you can reproduce the problem with this?

thanks,

Erik Rijkers

[1]
/messages/by-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

test.shtext/x-shellscript; name=test.shDownload
#267Erik Rijkers
er@xs4all.nl
In reply to: Erik Rijkers (#266)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 2020-04-18 11:10, Erik Rijkers wrote:

On 2020-04-18 11:07, Erik Rijkers wrote:

Hi Erik,

While setting up the cascading replication I have hit one issue on
base code[1]. After fixing that I have got one crash with streaming
on patch. I am not sure whether you are facing any of these 2
issues
or any other issue. If your issue is not any of these then plese
share the callstack and steps to reproduce.

I figured out a few things about this. Attached is a bash script
test.sh, to reproduce:

And the attached file, test.sh. (sorry)

It turns out I must have been mistaken somewhere. I probably missed
bugfix_in_schema_sent.patch)

I have just now rebuilt all the instances on top of master with these
patches:

[v14-0001-Immediately-WAL-log-assignments.patch]
[v14-0002-Issue-individual-invalidations-with.patch]
[v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch]
[v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch]
[v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch]
[v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch]
[v14-0007-Track-statistics-for-streaming.patch]
[v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch]
[v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch]
[v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch]
[bugfix_in_schema_sent.patch]

(by the way: this build's regression tests 'ddl', 'toast', and
'spill' fail)

I seem now able to run all my test programs on these instances without
errors.

Sorry, I seem to have raised a false alarm (although there was initially
certainly a problem).

Erik Rijkers

Show quoted text

There is a variable CRASH_IT that determines whether the whole thing
will fail (with a segmentation fault) or not. As attached it has
CRASH_IT=0 and does not crash. When you change that to CRASH_IT=1,
then it will crash. It turns out that this just depends on a short
wait state (3 seconds, on my machine) between setting up de
replication, and the running of pgbench. It's possible that on very
fast machines maybe it does not occur; we've had such difference
between hardware before. This is a i5-3330S.

It deletes files so look it over before you run it. It may also
depend on some of my local set-up but I guess that should be easily
fixed.

Can you let me know if you can reproduce the problem with this?

thanks,

Erik Rijkers

[1]
/messages/by-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#268Dilip Kumar
dilipbalaut@gmail.com
In reply to: Erik Rijkers (#267)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Apr 18, 2020 at 6:12 PM Erik Rijkers <er@xs4all.nl> wrote:

On 2020-04-18 11:10, Erik Rijkers wrote:

On 2020-04-18 11:07, Erik Rijkers wrote:

Hi Erik,

While setting up the cascading replication I have hit one issue on
base code[1]. After fixing that I have got one crash with streaming
on patch. I am not sure whether you are facing any of these 2
issues
or any other issue. If your issue is not any of these then plese
share the callstack and steps to reproduce.

I figured out a few things about this. Attached is a bash script
test.sh, to reproduce:

And the attached file, test.sh. (sorry)

It turns out I must have been mistaken somewhere. I probably missed
bugfix_in_schema_sent.patch)

I have just now rebuilt all the instances on top of master with these
patches:

[v14-0001-Immediately-WAL-log-assignments.patch]
[v14-0002-Issue-individual-invalidations-with.patch]
[v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch]
[v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch]
[v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch]
[v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch]
[v14-0007-Track-statistics-for-streaming.patch]
[v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch]
[v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch]
[v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch]
[bugfix_in_schema_sent.patch]

(by the way: this build's regression tests 'ddl', 'toast', and
'spill' fail)

I seem now able to run all my test programs on these instances without
errors.

Sorry, I seem to have raised a false alarm (although there was initially
certainly a problem).

No problem, Thanks for confirming.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#269Dilip Kumar
dilipbalaut@gmail.com
In reply to: Erik Rijkers (#267)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Apr 18, 2020 at 6:12 PM Erik Rijkers <er@xs4all.nl> wrote:

On 2020-04-18 11:10, Erik Rijkers wrote:

On 2020-04-18 11:07, Erik Rijkers wrote:

Hi Erik,

While setting up the cascading replication I have hit one issue on
base code[1]. After fixing that I have got one crash with streaming
on patch. I am not sure whether you are facing any of these 2
issues
or any other issue. If your issue is not any of these then plese
share the callstack and steps to reproduce.

I figured out a few things about this. Attached is a bash script
test.sh, to reproduce:

And the attached file, test.sh. (sorry)

It turns out I must have been mistaken somewhere. I probably missed
bugfix_in_schema_sent.patch)

I have just now rebuilt all the instances on top of master with these
patches:

[v14-0001-Immediately-WAL-log-assignments.patch]
[v14-0002-Issue-individual-invalidations-with.patch]
[v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch]
[v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch]
[v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch]
[v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch]
[v14-0007-Track-statistics-for-streaming.patch]
[v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch]
[v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch]
[v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch]
[bugfix_in_schema_sent.patch]

(by the way: this build's regression tests 'ddl', 'toast', and
'spill' fail)

Yeah, this is a. known issue, actually, while streaming the
transaction the output message is changed. I have a plan to work on
this part.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#270Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#269)
11 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Apr 21, 2020 at 5:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Apr 18, 2020 at 6:12 PM Erik Rijkers <er@xs4all.nl> wrote:

On 2020-04-18 11:10, Erik Rijkers wrote:

On 2020-04-18 11:07, Erik Rijkers wrote:

Hi Erik,

While setting up the cascading replication I have hit one issue on
base code[1]. After fixing that I have got one crash with streaming
on patch. I am not sure whether you are facing any of these 2
issues
or any other issue. If your issue is not any of these then plese
share the callstack and steps to reproduce.

I figured out a few things about this. Attached is a bash script
test.sh, to reproduce:

And the attached file, test.sh. (sorry)

It turns out I must have been mistaken somewhere. I probably missed
bugfix_in_schema_sent.patch)

I have just now rebuilt all the instances on top of master with these
patches:

[v14-0001-Immediately-WAL-log-assignments.patch]
[v14-0002-Issue-individual-invalidations-with.patch]
[v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch]
[v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch]
[v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch]
[v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch]
[v14-0007-Track-statistics-for-streaming.patch]
[v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch]
[v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch]
[v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch]
[bugfix_in_schema_sent.patch]

(by the way: this build's regression tests 'ddl', 'toast', and
'spill' fail)

Yeah, this is a. known issue, actually, while streaming the
transaction the output message is changed. I have a plan to work on
this part.

I have fixed this part. Basically, now, I have created a separate
function to get the streaming changes
'pg_logical_slot_get_streaming_changes'. So the default function
pg_logical_slot_get_changes will work as it is and test decoding test
cases will not fail.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v15-0001-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=v15-0001-Immediately-WAL-log-assignments.patchDownload
From e856c586bb131e1047471ecfddc6b0f118f1b0f3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 20 Mar 2020 15:03:01 +0530
Subject: [PATCH v15 01/11] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 46 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 +++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 39 ++++++++++++++-------------
 src/include/access/xact.h                |  3 +++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 +++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 99 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3984dd3..c2604bb 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5118,6 +5120,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6020,3 +6023,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 4259309..3c49954 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 79ff976..7b5257f 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1189,6 +1189,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1227,6 +1228,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3a..122c581 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel xid is valid, we need to assign the subxact to the
+	 * toplevel xact. We need to do this for all records, hence we do it before
+	 * the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(XLogRecGetTopXid(r)))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04ba..8645b38 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f60ed2d..6d439d0 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -229,6 +229,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 4582196..756f6df 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -150,6 +150,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -285,6 +287,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v15-0002-Issue-individual-invalidations-with-wal_level-lo.patchapplication/octet-stream; name=v15-0002-Issue-individual-invalidations-with-wal_level-lo.patchDownload
From 36577ba865b02113645c76c92b214b606a634727 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Tue, 14 Apr 2020 11:11:37 +0530
Subject: [PATCH v15 02/11] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.
---
 src/backend/access/rmgrdesc/xactdesc.c          |  40 +++++++++
 src/backend/access/transam/xact.c               |   7 ++
 src/backend/replication/logical/decode.c        |  16 ++++
 src/backend/replication/logical/reorderbuffer.c | 104 +++++++++++++++++++++---
 src/backend/utils/cache/inval.c                 |  49 +++++++++++
 src/include/access/xact.h                       |  13 ++-
 src/include/replication/reorderbuffer.h         |  11 +++
 7 files changed, 226 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index fbc5942..17c06f7 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c2604bb..8e6b1a6 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6020,6 +6020,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 122c581..69c1f45 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,22 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				Assert(TransactionIdIsValid(xid));
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->nmsgs, invals->msgs);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4594cf9..0d5bb73 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -220,7 +220,7 @@ static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
 static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
-static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
 
 /*
  * ---------------------------------------
@@ -455,6 +455,11 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 				pfree(change->data.msg.message);
 			change->data.msg.message = NULL;
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			if (change->data.inval.invalidations)
+				pfree(change->data.inval.invalidations);
+			change->data.inval.invalidations = NULL;
+			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			if (change->data.snapshot)
 			{
@@ -1814,17 +1819,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation messages locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					ReorderBufferExecuteInvalidations(
+							change->data.inval.ninvalidations,
+							change->data.inval.invalidations);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -1866,7 +1878,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1892,7 +1905,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2204,6 +2218,33 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 
 /*
  * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn, int nmsgs,
+							 SharedInvalidationMessage *msgs)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.ninvalidations = nmsgs;
+	change->data.inval.invalidations = (SharedInvalidationMessage *)
+		MemoryContextAlloc(rb->context,
+						   sizeof(SharedInvalidationMessage) * nmsgs);
+	memcpy(change->data.inval.invalidations, msgs,
+		   sizeof(SharedInvalidationMessage) * nmsgs);
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Setup the invalidation of the toplevel transaction.
  *
  * This needs to be done before ReorderBufferCommit is called!
  */
@@ -2234,12 +2275,12 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  * in the changestream but we don't know which those are.
  */
 static void
-ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
 {
 	int			i;
 
-	for (i = 0; i < txn->ninvalidations; i++)
-		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	for (i = 0; i < nmsgs; i++)
+		LocalExecuteInvalidationMessage(&msgs[i]);
 }
 
 /*
@@ -2591,6 +2632,24 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				char	   *data;
+				Size		inval_size = sizeof(SharedInvalidationMessage) *
+										change->data.inval.ninvalidations;
+
+				sz += inval_size;
+
+				ReorderBufferSerializeReserve(rb, sz);
+				data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+
+				/* might have been reallocated above */
+				ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+				memcpy(data, change->data.inval.invalidations, inval_size);
+				data += inval_size;
+
+				break;
+			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -2736,6 +2795,11 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			sz += sizeof(SharedInvalidationMessage) *
+					change->data.inval.ninvalidations;
+			break;
+
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -3004,6 +3068,20 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				Size	inval_size = sizeof(SharedInvalidationMessage) *
+									change->data.inval.ninvalidations;
+
+				change->data.inval.invalidations =
+						MemoryContextAlloc(rb->context,
+										   change->data.msg.message_size);
+				/* read the message */
+				memcpy(change->data.inval.invalidations, data, inval_size);
+				data += inval_size;
+
+				break;
+			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	oldsnap;
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..cba5b6c 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transaction.  As of now it was
+ *	enough to log invalidation only at commit because we are only decoding the
+ *	transaction at the commit time.   We only need to log the catalog cache and
+ *	relcache invalidation.  There can not be any active MVCC scan in logical
+ *	decoding so we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
 	if (transInvalInfo == NULL)
 		return;
 
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
@@ -1501,3 +1513,40 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage	*invalMessages;
+	int	nmsgs = 0;
+
+	if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+	{
+		ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+								 MakeSharedInvalidMessagesArray);
+		invalMessages = SharedInvalidMessagesArray;
+		nmsgs  = numSharedInvalidMessagesArray;
+		SharedInvalidMessagesArray = NULL;
+		numSharedInvalidMessagesArray = 0;
+	}
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b38..b822c5e 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,17 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4..af35287 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,14 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations;		/* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation
+														 * message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +468,8 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  int nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
1.8.3.1

v15-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchapplication/octet-stream; name=v15-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchDownload
From 1ef5f6fbc9525a24f83e43cdeea9f46a13848cbd Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 10:55:19 +0530
Subject: [PATCH v15 04/11] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml               |  5 ++-
 src/backend/access/heap/heapam.c                | 41 +++++++++++++++++++++++++
 src/backend/access/index/genam.c                | 38 +++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c |  8 ++---
 src/backend/utils/time/snapmgr.c                | 25 +++++++++++++--
 src/include/utils/snapmgr.h                     |  4 ++-
 6 files changed, 113 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 65244b1..b59a6c8 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,7 +432,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
      includes writing to tables, performing DDL changes, and
      calling <literal>pg_current_xact_id()</literal>.
     </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0d4ed60..84884a4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,15 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(scan->rs_base.rs_rd) ||
+		  RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
@@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
 	bool		valid;
 
 	/*
+	 * We don't expect direct calls to heap_fetch with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_fetch call during logical decoding");
+
+	/*
 	 * Fetch and pin the appropriate page of the relation.
 	 */
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
@@ -1497,6 +1514,14 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		valid;
 	bool		skip;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search_buffer with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_hot_search_buffer call during logical decoding");
+
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
 		*all_dead = first_call;
@@ -1646,6 +1671,14 @@ heap_get_latest_tid(TableScanDesc sscan,
 	Assert(ItemPointerIsValid(tid));
 
 	/*
+	 * We don't expect direct calls to heap_get_latest_tid with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_get_latest_tid call during logical decoding");
+
+	/*
 	 * Loop to chase down t_ctid links.  At top of loop, ctid is the tuple we
 	 * need to examine, and *tid is the TID we will return if ctid turns out
 	 * to be bogus.
@@ -5451,6 +5484,14 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	ItemId		lp = NULL;
 	HeapTupleHeader htup;
 
+	/*
+	 * We don't expect direct calls to heap_finish_speculative with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_finish_speculative call during logical decoding");
+
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..97a1075 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -433,6 +434,25 @@ systable_beginscan(Relation heapRelation,
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we error
+ * out.  We can't directly use TransactionIdDidAbort as after crash such
+ * transaction might not have been marked as aborted.  See detailed  comments at
+ * snapmgr.c where the variable is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +501,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +543,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -643,6 +675,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 0d5bb73..2302875 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -696,7 +696,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 			txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 		/* setup snapshot to allow catalog access */
-		SetupHistoricSnapshot(snapshot_now, NULL);
+		SetupHistoricSnapshot(snapshot_now, NULL, xid);
 		PG_TRY();
 		{
 			rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1547,7 +1547,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1798,7 +1798,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1818,7 +1818,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					}
 
 					break;
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c5..93a0c04 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -154,6 +154,13 @@ static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
 /*
+ * An xid value pointing to a possibly ongoing (sub)transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
+/*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2029,10 +2036,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
  *
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive.  This is to re-check XID status while accessing catalog.
+ *
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+					  TransactionId snapshot_xid)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -2041,8 +2052,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 
 	/* setup (cmin, cmax) lookup hash */
 	tuplecid_data = tuplecids;
-}
 
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
+	 */
+	if (TransactionIdIsValid(snapshot_xid) &&
+		!TransactionIdDidCommit(snapshot_xid))
+		CheckXidAlive = snapshot_xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
 /*
  * Make catalog snapshots behave normally again.
@@ -2052,6 +2072,7 @@ TeardownHistoricSnapshot(bool is_error)
 {
 	HistoricSnapshot = NULL;
 	tuplecid_data = NULL;
+	CheckXidAlive = InvalidTransactionId;
 }
 
 bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13c..12f737b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,8 +145,10 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+								  TransactionId snapshot_xid);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-- 
1.8.3.1

v15-0003-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=v15-0003-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From a65eda69a3bd6810223ea321d24f8080451e3351 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v15 03/11] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 +++++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  57 +++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..64f651f 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index bad3bfe..65244b1 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5adf253..497d8a9 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +204,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -862,6 +910,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->final_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 3b7ca7f..f24e246 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..0d0a94a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index af35287..e102840 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -354,6 +354,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -393,6 +439,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v15-0005-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v15-0005-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From e10c22b9424d66d5bff25faa9c24c2e41c110276 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri, 10 Jan 2020 18:03:27 -0300
Subject: [PATCH v15 05/11] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c     |  38 +-
 src/backend/replication/logical/reorderbuffer.c | 713 +++++++++++++++++++++---
 src/include/replication/reorderbuffer.h         |  36 ++
 3 files changed, 696 insertions(+), 91 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..160b167 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2302875..801fdc5 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -236,6 +236,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +245,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +381,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -773,6 +786,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -865,6 +910,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -988,7 +1036,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1024,6 +1072,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1038,6 +1089,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1316,6 +1370,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1341,8 +1404,93 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
  */
 static void
 ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
-
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
 
 	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
@@ -1491,63 +1636,75 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error that we can ignore.  We
+ * might have already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the abort we will
+ * stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
-		return;
-	}
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 
-	/* build data to be able to lookup the CommandIds of catalog tuples */
+	/*
+	 * build data to be able to lookup the CommandIds of catalog tuples
+	 */
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, txn->xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1563,15 +1720,20 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 	PG_TRY();
 	{
+		XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 		ReorderBufferChange *change;
 		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction("stream");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* start streaming this chunk of transaction */
+		if (streaming)
+			rb->stream_start(rb, txn);
+		else
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1579,6 +1741,19 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1588,8 +1763,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * use as a normal record. It'll be cleaned up at the end
 					 * of INSERT processing.
 					 */
-					if (specinsert == NULL)
-						elog(ERROR, "invalid ordering of speculative insertion changes");
 					Assert(specinsert->data.tp.oldtuple == NULL);
 					change = specinsert;
 					change->action = REORDER_BUFFER_CHANGE_INSERT;
@@ -1655,7 +1828,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						if (streaming)
+						{
+							rb->stream_change(rb, txn, relation, change);
+
+							/* Remember that we have sent some data for this xid. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_change(rb, txn, relation, change);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1676,8 +1857,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						 * freed/reused while restoring spooled data from
 						 * disk.
 						 */
-						Assert(change->data.tp.newtuple != NULL);
-
 						dlist_delete(&change->node);
 						ReorderBufferToastAppendChunk(rb, txn, relation,
 													  change);
@@ -1695,7 +1874,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1753,7 +1932,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						if (streaming)
+						{
+							rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+							/* Remember that we have sent some data. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_truncate(rb, txn, nrelations, relations, change);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1762,10 +1949,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					if (streaming)
+						rb->stream_message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
+					else
+						rb->message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1796,9 +1989,9 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+										  txn->xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1818,7 +2011,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+											  txn->xid);
 					}
 
 					break;
@@ -1858,14 +2052,46 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			/*
+			 * Set the last of the stream as the final lsn before calling
+			 * stream stop.
+			 */
+			if (!XLogRecPtrIsInvalid(prev_lsn))
+				txn->final_lsn = prev_lsn;
+			rb->stream_stop(rb, txn);
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+		{
+			txn->command_id = command_id;
+
+			/* Avoid copying if it's already copied. */
+			if (snapshot_now->copied)
+				txn->snapshot_now = snapshot_now;
+			else
+				txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+														  txn, command_id);
+		}
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1884,14 +2110,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,18 +2145,125 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		if (streaming)
+		{
+			/* Discard the changes that we just streamed. */
+			ReorderBufferTruncateTXN(rb, txn);
 
-		PG_RE_THROW();
+			/* Re-throw only if it's not an abort. */
+			if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+			{
+				MemoryContextSwitchTo(ecxt);
+				PG_RE_THROW();
+			}
+			else
+			{
+				FlushErrorState();
+				FreeErrorData(errdata);
+				errdata = NULL;
+
+				/* remember the command ID and snapshot for the streaming run */
+				txn->command_id = command_id;
+
+				/* Avoid copying if it's already copied. */
+				if (snapshot_now->copied)
+					txn->snapshot_now = snapshot_now;
+				else
+					txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+															  txn, command_id);
+				/*
+				 * Set the last last of the stream as the final lsn before
+				 * calling stream stop.
+				 */
+				txn->final_lsn = prev_lsn;
+				rb->stream_stop(rb, txn);
+			}
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
 /*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * We currently can only decode a transaction's contents when its commit
+ * record is read because that's the only place where we know about cache
+ * invalidations. Thus, once a toplevel commit is read, we iterate over the top
+ * and subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot snapshot_now;
+	CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -1946,6 +2287,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2015,6 +2363,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2150,8 +2505,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2159,6 +2523,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2170,19 +2535,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2211,6 +2585,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2295,6 +2670,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2399,6 +2781,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
  *
@@ -2418,15 +2832,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2746,6 +3191,102 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot snapshot_now;
+	CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e102840..6d65986 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -171,6 +171,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -190,6 +191,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -225,6 +244,16 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	final_lsn;
 
 	/*
+	 * Have we sent any changes for this transaction in output plugin?
+	 */
+	bool		any_data_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
+	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
 	XLogRecPtr	end_lsn;
@@ -255,6 +284,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
-- 
1.8.3.1

v15-0006-Add-support-for-streaming-to-built-in-replicatio.patchapplication/octet-stream; name=v15-0006-Add-support-for-streaming-to-built-in-replicatio.patchDownload
From 5918775be74d7b7b99091d445c0fbf5d39dc8430 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 16 Apr 2020 01:55:22 -0700
Subject: [PATCH v15 06/11] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |    4 +-
 doc/src/sgml/ref/create_subscription.sgml          |   12 +
 src/backend/catalog/pg_subscription.c              |    1 +
 src/backend/commands/subscriptioncmds.c            |   45 +-
 src/backend/postmaster/pgstat.c                    |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    3 +
 src/backend/replication/logical/launcher.c         |    1 -
 src/backend/replication/logical/logical.c          |    4 +-
 src/backend/replication/logical/proto.c            |  140 ++-
 src/backend/replication/logical/worker.c           | 1026 ++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c        |  315 +++++-
 src/backend/replication/slotfuncs.c                |    6 +
 src/backend/replication/walsender.c                |    6 +
 src/include/catalog/pg_subscription.h              |    3 +
 src/include/pgstat.h                               |    6 +-
 src/include/replication/logicalproto.h             |   42 +-
 src/include/replication/walreceiver.h              |    1 +
 src/test/subscription/t/009_stream_simple.pl       |   86 ++
 src/test/subscription/t/010_stream_subxact.pl      |  102 ++
 src/test/subscription/t/011_stream_ddl.pl          |   95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |   82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |   84 ++
 22 files changed, 2033 insertions(+), 43 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 1b8bead..95b7c24 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -164,8 +164,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c24..3349cc4 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731..f28482f 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 7f15667..65b6b76 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 50eea2e..4ef4fd4 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4133,6 +4133,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9..5257ab0 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e..8156a42 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,7 +14,6 @@
  *
  *-------------------------------------------------------------------------
  */
-
 #include "postgres.h"
 
 #include "access/heapam.h"
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 497d8a9..dfc681d 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1148,7 +1148,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1193,7 +1193,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = txn->final_lsn;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd..5242ac0 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a12..3dc5f83 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,6 +54,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -64,6 +86,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +94,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -110,12 +134,57 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool	in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;						/* XID of the subxact */
+	off_t           offset;					/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +256,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -553,6 +658,318 @@ apply_handle_origin(StringInfo s)
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+			return;
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	/* XXX Should this be allocated in another memory context? */
+
+	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	ensure_transaction();
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+
+	pfree(buffer);
+	pfree(s2.data);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -565,6 +982,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1000,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1039,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1157,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1302,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1675,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1816,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1478,6 +1929,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 }
 
 /*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
+/*
  * Apply main loop.
  */
 static void
@@ -1493,6 +1960,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1941,6 +2411,561 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	CloseTransientFile(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	CloseTransientFile(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 *
+	 * Don't check for error from mkdir; it could fail if the directory
+	 * already exists (maybe someone else just did the same thing).  If
+	 * it doesn't work then we'll bomb out when opening the file
+	 */
+	mkdir(tempdirpath, S_IRWXU);
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3131,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 77b85fc..59a09b9 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,54 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +119,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +145,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +208,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +237,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +260,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +281,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +370,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +427,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +469,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +490,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +522,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +542,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +566,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +586,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +611,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +643,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -588,6 +724,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -624,6 +845,34 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -756,6 +1005,36 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -790,7 +1069,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index f776de3..9121420 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -156,6 +156,12 @@ create_logical_replication_slot(char *name, char *plugin,
 									NULL);
 
 	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 0e93322..eacea12 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1004,6 +1004,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndUpdateProgress);
 
 		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
+		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
 		 * messages or send keepalives. As we possibly need to wait for
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d4..617b909 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8041d9..3b3e1fd 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -981,7 +981,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561..89158ed 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f1aa6e9..70d39f8 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v15-0009-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v15-0009-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From a05c493fe4bd2ee009525b4e5050c4657c5bd35b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v15 09/11] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v15-0008-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v15-0008-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From fddff6b6ba5f078c12324642e0ab6ac647cc4426 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v15 08/11] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 6da7f71..086d0c7 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -70,7 +70,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 34ab11e..ad3ed13 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 81520a7..0c9c6b3 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v15-0010-Bugfix-handling-of-incomplete-toast-tuple.patchapplication/octet-stream; name=v15-0010-Bugfix-handling-of-incomplete-toast-tuple.patchDownload
From b3a19c455130853bc4efb12bd68933a5224cad86 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 28 Feb 2020 11:07:46 +0530
Subject: [PATCH v15 10/11] Bugfix handling of incomplete toast tuple

---
 src/backend/access/heap/heapam.c                |   3 +
 src/backend/replication/logical/decode.c        |  17 ++-
 src/backend/replication/logical/reorderbuffer.c | 182 +++++++++++++++---------
 src/include/access/heapam_xlog.h                |   1 +
 src/include/replication/reorderbuffer.h         |  24 +++-
 5 files changed, 147 insertions(+), 80 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 84884a4..18b0bad 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1978,6 +1978,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 69c1f45..c841687 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -727,7 +727,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -794,7 +796,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -851,7 +854,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -887,7 +891,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -987,7 +991,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1025,7 +1029,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 845d820..f299c64 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -654,11 +654,14 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
-	ReorderBufferTXN *txn;
+	ReorderBufferTXN *txn, *toptxn;
+	bool	can_stream = false;
 
-	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+	toptxn = txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
 
 	change->lsn = lsn;
 	change->txn = txn;
@@ -668,9 +671,50 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Otherwise, if
+	 * we have toast insert bit set and this is insert/update then clear the
+	 * bit.
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(txn) &&
+			 ((change->action == REORDER_BUFFER_CHANGE_INSERT) ||
+			 (change->action == REORDER_BUFFER_CHANGE_UPDATE)))
+	{
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+		can_stream = true;
+	}
+
+	/*
+	 * If this is a speculative insert then set the corresponding bit.
+	 * Otherwise, if we have speculative insert bit set and this is spec confirm
+	 * record then clear the bit.
+	 */
+	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM)
+	{
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+		can_stream = true;
+	}
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/*
+	 * If streaming is enable and we have serialized this transaction because
+	 * it had incomplete tuple.  So if now we have got the complete tuple we
+	 * can stream it.
+	 */
+	if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
+		&& !rbtxn_has_toast_insert(txn) && !rbtxn_has_spec_insert(txn))
+	{
+		ReorderBufferStreamTXN(rb, toptxn);
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
+	}
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -700,7 +744,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1862,8 +1906,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						 * disk.
 						 */
 						dlist_delete(&change->node);
-						ReorderBufferToastAppendChunk(rb, txn, relation,
-													  change);
+							ReorderBufferToastAppendChunk(rb, txn, relation,
+														  change);
 					}
 
 			change_done:
@@ -2456,7 +2500,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2505,7 +2549,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2528,6 +2572,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2542,8 +2587,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2551,12 +2601,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2617,7 +2675,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 	memcpy(change->data.inval.invalidations, msgs,
 		   sizeof(SharedInvalidationMessage) * nmsgs);
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -2804,15 +2862,16 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		/*
+		 * if the current transaction is larger and doesn't have incomplete data
+		 * remember it.
+		 */
+		if (((!largest) || (txn->total_size > largest->total_size)) &&
+			((txn->size > 0) && !rbtxn_has_toast_insert(txn) &&
+			!rbtxn_has_spec_insert(txn)))
+		largest = txn;
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2830,66 +2889,51 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
 	ReorderBufferTXN *txn;
 
-	/* bail out if we haven't exceeded the memory limit */
-	if (rb->size < logical_decoding_work_mem * 1024L)
-		return;
-
-	/*
-	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by streaming, if supported. Otherwise spill to disk.
-	 */
-	if (ReorderBufferCanStream(rb))
+	/* Loop until we reach under the memory limit. */
+	while (rb->size >= logical_decoding_work_mem * 1024L)
 	{
 		/*
-		 * Pick the largest toplevel transaction and evict it from memory by
-		 * streaming the already decoded part.
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTopTXN(rb);
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn && !txn->toptxn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		ReorderBufferStreamTXN(rb, txn);
-	}
-	else
-	{
-		/*
-		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
-		 */
-		txn = ReorderBufferLargestTXN(rb);
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
-		ReorderBufferSerializeTXN(rb, txn);
+		/*
+		 * After eviction, the transaction should have no entries in memory, and
+		 * should use 0 bytes for changes.
+		 *
+		 * XXX Checking the size is fine for both cases - spill to disk and
+		 * streaming. But for streaming we should really check nentries_mem for
+		 * all subtransactions too.
+		 */
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
 	}
 
-	/*
-	 * After eviction, the transaction should have no entries in memory, and
-	 * should use 0 bytes for changes.
-	 *
-	 * XXX Checking the size is fine for both cases - spill to disk and
-	 * streaming. But for streaming we should really check nentries_mem for
-	 * all subtransactions too.
-	 */
-	Assert(txn->size == 0);
-	Assert(txn->nentries_mem == 0);
-
-	/*
-	 * And furthermore, evicting the transaction should get us below the
-	 * memory limit again - it is not possible that we're still exceeding the
-	 * memory limit after evicting the transaction.
-	 *
-	 * This follows from the simple fact that the selected transaction is at
-	 * least as large as the most recent change (which caused us to go over
-	 * the memory limit). So by evicting it we're definitely back below the
-	 * memory limit.
-	 */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 603f325..ba2ab71 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -172,6 +172,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -191,6 +193,17 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -199,10 +212,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -355,6 +364,9 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -545,7 +557,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v15-0007-Track-statistics-for-streaming.patchapplication/octet-stream; name=v15-0007-Track-statistics-for-streaming.patchDownload
From 54abdb2e87b0d65b124b59b27e918e4f64aa95a8 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 2 Apr 2020 13:19:29 +0530
Subject: [PATCH v15 07/11] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                    | 25 +++++++++++++++++++
 src/backend/catalog/system_views.sql            |  5 +++-
 src/backend/replication/logical/reorderbuffer.c | 13 ++++++++++
 src/backend/replication/walsender.c             | 32 +++++++++++++++++++++----
 src/include/catalog/pg_proc.dat                 |  6 ++---
 src/include/replication/reorderbuffer.h         | 13 ++++++----
 src/include/replication/walsender_private.h     |  5 ++++
 src/test/regress/expected/rules.out             |  7 ++++--
 8 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6562cc4..d8bf587 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2063,6 +2063,31 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       may get spilled repeatedly, and this counter gets incremented on every
       such invocation.</entry>
     </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
+
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index d406ea8..65d650d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -788,7 +788,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 801fdc5..845d820 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -331,6 +331,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3282,6 +3286,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
 
+	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index eacea12..d0028d9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1333,7 +1333,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1354,7 +1354,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2396,6 +2397,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3237,7 +3241,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3295,6 +3299,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillCount;
 		int64		spillBytes;
 		bool		is_sync_standby;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 		int			j;
@@ -3320,6 +3327,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		/*
@@ -3422,6 +3432,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3670,11 +3685,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4bce3ad..9fb1ffe 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5237,9 +5237,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6d65986..603f325 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -517,15 +517,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 734acec..b997d17 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -83,6 +83,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ac31840..68e2deb 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2005,9 +2005,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_slru| SELECT s.name,
     s.blks_zeroed,
-- 
1.8.3.1

v15-0011-Provide-new-api-to-get-the-streaming-changes.patchapplication/octet-stream; name=v15-0011-Provide-new-api-to-get-the-streaming-changes.patchDownload
From d9b8f3ab04980d57a2df7f120b7c522c381935db Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 22 Apr 2020 16:33:07 +0530
Subject: [PATCH v15 11/11] Provide new api to get the streaming changes

---
 src/backend/catalog/system_views.sql           |  8 ++++++++
 src/backend/replication/logical/logicalfuncs.c | 23 ++++++++++++++++++-----
 src/include/catalog/pg_proc.dat                |  9 +++++++++
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 65d650d..d9ab14b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1242,6 +1242,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
 
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_peek_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index f5384f1..7561141 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -108,7 +108,8 @@ check_permissions(void)
  * Helper function for the various SQL callable logical decoding functions.
  */
 static Datum
-pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool binary)
+pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
+								 bool binary, bool streaming)
 {
 	Name		name;
 	XLogRecPtr	upto_lsn;
@@ -237,6 +238,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 									LogicalOutputPrepareWrite,
 									LogicalOutputWrite, NULL);
 
+		/* If called has not asked for streaming changes then disable it. */
+		ctx->streaming &= streaming;
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
@@ -347,7 +351,16 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 Datum
 pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, false);
+}
+
+/*
+ * SQL function to get the streaming changes as text, consuming the data.
+ */
+Datum
+pg_logical_slot_get_streaming_changes(PG_FUNCTION_ARGS)
+{
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, true);
 }
 
 /*
@@ -356,7 +369,7 @@ pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, false, false);
 }
 
 /*
@@ -365,7 +378,7 @@ pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, true, false);
 }
 
 /*
@@ -374,7 +387,7 @@ pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, true, false);
 }
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9fb1ffe..3dfc5c1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10117,6 +10117,15 @@
   proargmodes => '{i,i,i,v,o,o,o}',
   proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
   prosrc => 'pg_logical_slot_get_binary_changes' },
+{ oid => '6150', descr => 'get streaming changes from replication slot',
+  proname => 'pg_logical_slot_get_streaming_changes', procost => '1000',
+  prorows => '1000', provariadic => 'text', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'name pg_lsn int4 _text',
+  proallargtypes => '{name,pg_lsn,int4,_text,pg_lsn,xid,text}',
+  proargmodes => '{i,i,i,v,o,o,o}',
+  proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
+  prosrc => 'pg_logical_slot_get_streaming_changes' },
 { oid => '3784', descr => 'peek at changes from replication slot',
   proname => 'pg_logical_slot_peek_changes', procost => '1000',
   prorows => '1000', provariadic => 'text', proisstrict => 'f',
-- 
1.8.3.1

#271Erik Rijkers
er@xs4all.nl
In reply to: Dilip Kumar (#270)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 2020-04-22 16:49, Dilip Kumar wrote:

On Tue, Apr 21, 2020 at 5:30 PM Dilip Kumar <dilipbalaut@gmail.com>
wrote:

(by the way: this build's regression tests 'ddl', 'toast', and
'spill' fail)

Yeah, this is a. known issue, actually, while streaming the
transaction the output message is changed. I have a plan to work on
this part.

I have fixed this part. Basically, now, I have created a separate
function to get the streaming changes
'pg_logical_slot_get_streaming_changes'. So the default function
pg_logical_slot_get_changes will work as it is and test decoding test
cases will not fail.

The 'ddl' one is apparently not quite fixed - I get this in (cd
contrib; make check)' (in both assert-enabled and non-assert-enabled
build)

grep -A7 -B7 make.check_contrib.out:

contrib/make.check_contrib.out-============== initializing database
system ==============
contrib/make.check_contrib.out-============== starting postmaster
==============
contrib/make.check_contrib.out-running on port 64464 with PID 9175
contrib/make.check_contrib.out-============== creating database
"contrib_regression" ==============
contrib/make.check_contrib.out-CREATE DATABASE
contrib/make.check_contrib.out-ALTER DATABASE
contrib/make.check_contrib.out-============== running regression test
queries ==============
contrib/make.check_contrib.out:test ddl ...
FAILED 840 ms
contrib/make.check_contrib.out-test xact ... ok
24 ms
contrib/make.check_contrib.out-test rewrite ... ok
187 ms
contrib/make.check_contrib.out-test toast ... ok
851 ms
contrib/make.check_contrib.out-test permissions ... ok
26 ms
contrib/make.check_contrib.out-test decoding_in_xact ... ok
31 ms
contrib/make.check_contrib.out-test decoding_into_rel ... ok
25 ms
contrib/make.check_contrib.out-test binary ... ok
12 ms

Otherwise patches apply and build OK so will go run some tests...

#272Dilip Kumar
dilipbalaut@gmail.com
In reply to: Erik Rijkers (#271)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote:

On 2020-04-22 16:49, Dilip Kumar wrote:

On Tue, Apr 21, 2020 at 5:30 PM Dilip Kumar <dilipbalaut@gmail.com>
wrote:

(by the way: this build's regression tests 'ddl', 'toast', and
'spill' fail)

Yeah, this is a. known issue, actually, while streaming the
transaction the output message is changed. I have a plan to work on
this part.

I have fixed this part. Basically, now, I have created a separate
function to get the streaming changes
'pg_logical_slot_get_streaming_changes'. So the default function
pg_logical_slot_get_changes will work as it is and test decoding test
cases will not fail.

The 'ddl' one is apparently not quite fixed - I get this in (cd
contrib; make check)' (in both assert-enabled and non-assert-enabled
build)

Can you send me the contrib/test_decoding/regression.diffs file?

grep -A7 -B7 make.check_contrib.out:

contrib/make.check_contrib.out-============== initializing database
system ==============
contrib/make.check_contrib.out-============== starting postmaster
==============
contrib/make.check_contrib.out-running on port 64464 with PID 9175
contrib/make.check_contrib.out-============== creating database
"contrib_regression" ==============
contrib/make.check_contrib.out-CREATE DATABASE
contrib/make.check_contrib.out-ALTER DATABASE
contrib/make.check_contrib.out-============== running regression test
queries ==============
contrib/make.check_contrib.out:test ddl ...
FAILED 840 ms
contrib/make.check_contrib.out-test xact ... ok
24 ms
contrib/make.check_contrib.out-test rewrite ... ok
187 ms
contrib/make.check_contrib.out-test toast ... ok
851 ms
contrib/make.check_contrib.out-test permissions ... ok
26 ms
contrib/make.check_contrib.out-test decoding_in_xact ... ok
31 ms
contrib/make.check_contrib.out-test decoding_into_rel ... ok
25 ms
contrib/make.check_contrib.out-test binary ... ok
12 ms

Otherwise patches apply and build OK so will go run some tests...

Thanks

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#273Erik Rijkers
er@xs4all.nl
In reply to: Dilip Kumar (#272)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 2020-04-23 05:24, Dilip Kumar wrote:

On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote:

The 'ddl' one is apparently not quite fixed - I get this in (cd
contrib; make check)' (in both assert-enabled and non-assert-enabled
build)

Can you send me the contrib/test_decoding/regression.diffs file?

Attached.

Below is the patch list, in case that was unclear

20200422/v15-0001-Immediately-WAL-log-assignments.patch
+
20200422/v15-0002-Issue-individual-invalidations-with-wal_level-lo.patch+
20200422/v15-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
20200422/v15-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
20200422/v15-0005-Implement-streaming-mode-in-ReorderBuffer.patch
+
20200422/v15-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
20200422/v15-0007-Track-statistics-for-streaming.patch
+
20200422/v15-0008-Enable-streaming-for-all-subscription-TAP-tests.patch
+
20200422/v15-0009-Add-TAP-test-for-streaming-vs.-DDL.patch
+
20200422/v15-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
+
20200422/v15-0011-Provide-new-api-to-get-the-streaming-changes.patch
+
20200414/bugfix_in_schema_sent.patch

Show quoted text

grep -A7 -B7 make.check_contrib.out:

contrib/make.check_contrib.out-============== initializing database
system ==============
contrib/make.check_contrib.out-============== starting postmaster
==============
contrib/make.check_contrib.out-running on port 64464 with PID 9175
contrib/make.check_contrib.out-============== creating database
"contrib_regression" ==============
contrib/make.check_contrib.out-CREATE DATABASE
contrib/make.check_contrib.out-ALTER DATABASE
contrib/make.check_contrib.out-============== running regression test
queries ==============
contrib/make.check_contrib.out:test ddl ...
FAILED 840 ms
contrib/make.check_contrib.out-test xact ...
ok
24 ms
contrib/make.check_contrib.out-test rewrite ...
ok
187 ms
contrib/make.check_contrib.out-test toast ...
ok
851 ms
contrib/make.check_contrib.out-test permissions ...
ok
26 ms
contrib/make.check_contrib.out-test decoding_in_xact ...
ok
31 ms
contrib/make.check_contrib.out-test decoding_into_rel ...
ok
25 ms
contrib/make.check_contrib.out-test binary ...
ok
12 ms

Otherwise patches apply and build OK so will go run some tests...

Thanks

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

regression.diffstext/x-diff; name=regression.diffsDownload
diff -U3 /home/aardvark/pg_stuff/pg_sandbox/pgsql.large_logical/contrib/test_decoding/expected/ddl.out /home/aardvark/pg_stuff/pg_sandbox/pgsql.large_logical/contrib/test_decoding/results/ddl.out
--- /home/aardvark/pg_stuff/pg_sandbox/pgsql.large_logical/contrib/test_decoding/expected/ddl.out	2020-04-22 18:08:28.166822219 +0200
+++ /home/aardvark/pg_stuff/pg_sandbox/pgsql.large_logical/contrib/test_decoding/results/ddl.out	2020-04-22 18:18:55.996887367 +0200
@@ -245,15 +245,7 @@
 FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1')
 GROUP BY substring(data, 1, 24)
 ORDER BY 1,2;
- count |                                  min                                  |                                  max                                   
--------+-----------------------------------------------------------------------+------------------------------------------------------------------------
-     1 | BEGIN                                                                 | BEGIN
-     1 | COMMIT                                                                | COMMIT
-     1 | message: transactional: 1 prefix: test, sz: 14 content:tx logical msg | message: transactional: 1 prefix: test, sz: 14 content:tx logical msg
-     1 | table public.tr_oddlength: INSERT: id[text]:'ab' data[text]:'foo'     | table public.tr_oddlength: INSERT: id[text]:'ab' data[text]:'foo'
- 20467 | table public.tr_etoomuch: DELETE: id[integer]:1                       | table public.tr_etoomuch: UPDATE: id[integer]:9999 data[integer]:-9999
-(5 rows)
-
+ERROR:  invalid memory alloc request size 94119198201896
 -- check updates of primary keys work correctly
 BEGIN;
 CREATE TABLE spoolme AS SELECT g.i FROM generate_series(1, 5000) g(i);
@@ -266,13 +258,7 @@
 SELECT data
 FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1')
 WHERE data ~ 'UPDATE';
-                                                    data                                                     
--------------------------------------------------------------------------------------------------------------
- table public.tr_etoomuch: UPDATE: old-key: id[integer]:5000 new-tuple: id[integer]:-5000 data[integer]:5000
- table public.tr_oddlength: UPDATE: old-key: id[text]:'ab' new-tuple: id[text]:'x' data[text]:'quux'
- table public.tr_oddlength: UPDATE: old-key: id[text]:'x' new-tuple: id[text]:'yy' data[text]:'a'
-(3 rows)
-
+ERROR:  invalid memory alloc request size 94119197988536
 -- check that a large, spooled, upsert works
 INSERT INTO tr_etoomuch (id, data)
 SELECT g.i, -g.i FROM generate_series(8000, 12000) g(i)
@@ -281,14 +267,7 @@
 FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1') WITH ORDINALITY
 GROUP BY 1
 ORDER BY min(ordinality);
-           substring           | count 
--------------------------------+-------
- BEGIN                         |     1
- table public.tr_etoomuch: UPD |  2235
- table public.tr_etoomuch: INS |  1766
- COMMIT                        |     1
-(4 rows)
-
+ERROR:  invalid memory alloc request size 94119198201896
 /*
  * check whether we decode subtransactions correctly in relation with each
  * other
@@ -310,18 +289,7 @@
 RELEASE SAVEPOINT b;
 COMMIT;
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-                                 data                                 
-----------------------------------------------------------------------
- BEGIN
- table public.tr_sub: INSERT: id[integer]:1 path[text]:'1-top-#1'
- table public.tr_sub: INSERT: id[integer]:2 path[text]:'1-top-1-#1'
- table public.tr_sub: INSERT: id[integer]:3 path[text]:'1-top-1-#2'
- table public.tr_sub: INSERT: id[integer]:4 path[text]:'1-top-2-1-#1'
- table public.tr_sub: INSERT: id[integer]:5 path[text]:'1-top-2-1-#2'
- table public.tr_sub: INSERT: id[integer]:6 path[text]:'1-top-2-#1'
- COMMIT
-(8 rows)
-
+ERROR:  invalid memory alloc request size 94119197980328
 -- check that we handle xlog assignments correctly
 BEGIN;
 -- nest 80 subtxns
@@ -349,16 +317,7 @@
 INSERT INTO tr_sub(path) VALUES ('2-top-#1');
 COMMIT;
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-                                  data                                  
-------------------------------------------------------------------------
- BEGIN
- table public.tr_sub: INSERT: id[integer]:7 path[text]:'2-top-1...--#1'
- table public.tr_sub: INSERT: id[integer]:8 path[text]:'2-top-1...--#2'
- table public.tr_sub: INSERT: id[integer]:9 path[text]:'2-top-1...--#3'
- table public.tr_sub: INSERT: id[integer]:10 path[text]:'2-top-#1'
- COMMIT
-(6 rows)
-
+ERROR:  invalid memory alloc request size 94119197980328
 -- make sure rollbacked subtransactions aren't decoded
 BEGIN;
 INSERT INTO tr_sub(path) VALUES ('3-top-2-#1');
@@ -370,15 +329,7 @@
 INSERT INTO tr_sub(path) VALUES ('3-top-2-#2');
 COMMIT;
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-                                 data                                  
------------------------------------------------------------------------
- BEGIN
- table public.tr_sub: INSERT: id[integer]:11 path[text]:'3-top-2-#1'
- table public.tr_sub: INSERT: id[integer]:12 path[text]:'3-top-2-1-#1'
- table public.tr_sub: INSERT: id[integer]:14 path[text]:'3-top-2-#2'
- COMMIT
-(5 rows)
-
+ERROR:  invalid memory alloc request size 94119197980328
 -- test whether a known, but not yet logged toplevel xact, followed by a
 -- subxact commit is handled correctly
 BEGIN;
@@ -399,16 +350,7 @@
 INSERT INTO tr_sub(path) VALUES ('5-top-1-#1');
 COMMIT;
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-                                data                                 
----------------------------------------------------------------------
- BEGIN
- table public.tr_sub: INSERT: id[integer]:15 path[text]:'4-top-1-#1'
- COMMIT
- BEGIN
- table public.tr_sub: INSERT: id[integer]:16 path[text]:'5-top-1-#1'
- COMMIT
-(6 rows)
-
+ERROR:  invalid memory alloc request size 94119197980328
 -- check that DDL in aborted subtransactions handled correctly
 CREATE TABLE tr_sub_ddl(data int);
 BEGIN;
@@ -420,13 +362,7 @@
 INSERT INTO tr_sub_ddl VALUES(43);
 COMMIT;
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-                       data                       
---------------------------------------------------
- BEGIN
- table public.tr_sub_ddl: INSERT: data[bigint]:43
- COMMIT
-(3 rows)
-
+ERROR:  invalid memory alloc request size 94119198078792
 /*
  * Check whether treating a table as a catalog table works somewhat
  */
@@ -497,22 +433,7 @@
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-                                                                data                                                                
-------------------------------------------------------------------------------------------------------------------------------------
- BEGIN
- table public.replication_metadata: INSERT: id[integer]:1 relation[name]:'foo' options[text[]]:'{a,b}'
- COMMIT
- BEGIN
- table public.replication_metadata: INSERT: id[integer]:2 relation[name]:'bar' options[text[]]:'{a,b}'
- COMMIT
- BEGIN
- table public.replication_metadata: INSERT: id[integer]:3 relation[name]:'blub' options[text[]]:null
- COMMIT
- BEGIN
- table public.replication_metadata: INSERT: id[integer]:4 relation[name]:'zaphod' options[text[]]:null rewritemeornot[integer]:null
- COMMIT
-(12 rows)
-
+ERROR:  invalid memory alloc request size 94119206549464
 /*
  * check whether we handle updates/deletes correct with & without a pkey
  */
@@ -585,113 +506,7 @@
     SET toasted_col1 = (SELECT string_agg(g.i::text, '') FROM generate_series(1, 2000) g(i))
 WHERE id = 1;
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          data                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
- BEGIN
- table public.table_without_key: INSERT: id[integer]:1 data[integer]:1
- table public.table_without_key: INSERT: id[integer]:2 data[integer]:2
- COMMIT
- BEGIN
- table public.table_without_key: DELETE: (no-tuple-data)
- COMMIT
- BEGIN
- table public.table_without_key: UPDATE: id[integer]:2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_without_key: UPDATE: id[integer]:-2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_without_key: UPDATE: id[integer]:2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_without_key: UPDATE: old-key: id[integer]:2 data[integer]:3 new-tuple: id[integer]:-2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_without_key: UPDATE: old-key: id[integer]:-2 data[integer]:3 new-tuple: id[integer]:2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_without_key: UPDATE: old-key: id[integer]:2 data[integer]:3 new-tuple: id[integer]:-2 data[integer]:3 new_column[text]:null
- COMMIT
- BEGIN
- table public.table_without_key: UPDATE: old-key: id[integer]:-2 data[integer]:3 new-tuple: id[integer]:2 data[integer]:3 new_column[text]:'someval'
- COMMIT
- BEGIN
- table public.table_without_key: DELETE: id[integer]:2 data[integer]:3 new_column[text]:'someval'
- COMMIT
- BEGIN
- table public.table_with_pkey: INSERT: id[integer]:1 data[integer]:1
- table public.table_with_pkey: INSERT: id[integer]:2 data[integer]:2
- COMMIT
- BEGIN
- table public.table_with_pkey: DELETE: id[integer]:1
- COMMIT
- BEGIN
- table public.table_with_pkey: UPDATE: id[integer]:2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_pkey: UPDATE: old-key: id[integer]:2 new-tuple: id[integer]:-2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_pkey: UPDATE: old-key: id[integer]:-2 new-tuple: id[integer]:2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_pkey: UPDATE: old-key: id[integer]:2 new-tuple: id[integer]:-2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_pkey: UPDATE: old-key: id[integer]:-2 new-tuple: id[integer]:2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_pkey: DELETE: id[integer]:2
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: INSERT: id[integer]:1 data[integer]:1
- table public.table_with_unique_not_null: INSERT: id[integer]:2 data[integer]:2
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: DELETE: (no-tuple-data)
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: UPDATE: id[integer]:2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: UPDATE: id[integer]:-2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: UPDATE: id[integer]:2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: DELETE: (no-tuple-data)
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: INSERT: id[integer]:3 data[integer]:1
- table public.table_with_unique_not_null: INSERT: id[integer]:4 data[integer]:2
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: DELETE: id[integer]:3
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: UPDATE: id[integer]:4 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: UPDATE: old-key: id[integer]:4 new-tuple: id[integer]:-4 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: UPDATE: old-key: id[integer]:-4 new-tuple: id[integer]:4 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: DELETE: id[integer]:4
- COMMIT
- BEGIN
- table public.toasttable: INSERT: id[integer]:1 toasted_col1[text]:'12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916016116216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320420520620720820921021121221321421521621721821922022122222322422522622722822923023123223323423523623723823924024124224324424524624724824925025125225325425525625725825926026126226326426526626726826927027127227327427527627727827928028128228328428528628728828929029129229329429529629729829930030130230330430530630730830931031131231331431531631731831932032132232332432532632732832933033133233333433533633733833934034134234334434534634734834935035135235335435535635735835936036136236336436536636736836937037137237337437537637737837938038138238338438538638738838939039139239339439539639739839940040140240340440540640740840941041141241341441541641741841942042142242342442542642742842943043143243343443543643743843944044144244344444544644744844945045145245345445545645745845946046146246346446546646746846947047147247347447547647747847948048148248348448548648748848949049149249349449549649749849950050150250350450550650750850951051151251351451551651751851952052152252352452552652752852953053153253353453553653753853954054154254354454554654754854955055155255355455555655755855956056156256356456556656756856957057157257357457557657757857958058158258358458558658758858959059159259359459559659759859960060160260360460560660760860961061161261361461561661761861962062162262362462562662762862963063163263363463563663763863964064164264364464564664764864965065165265365465565665765865966066166266366466566666766866967067167267367467567667767867968068168268368468568668768868969069169269369469569669769869970070170270370470570670770870971071171271371471571671771871972072172272372472572672772872973073173273373473573673773873974074174274374474574674774874975075175275375475575675775875976076176276376476576676776876977077177277377477577677777877978078178278378478578678778878979079179279379479579679779879980080180280380480580680780880981081181281381481581681781881982082182282382482582682782882983083183283383483583683783883984084184284384484584684784884985085185285385485585685785885986086186286386486586686786886987087187287387487587687787887988088188288388488588688788888989089189289389489589689789889990090190290390490590690790890991091191291391491591691791891992092192292392492592692792892993093193293393493593693793893994094194294394494594694794894995095195295395495595695795895996096196296396496596696796896997097197297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007100810091010101110121013101410151016101710181019102010211022102310241025102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070107110721073107410751076107710781079108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163116411651166116711681169117011711172117311741175117611771178117911801181118211831184118511861187118811891190119111921193119411951196119711981199120012011202120312041205120612071208120912101211121212131214121512161217121812191220122112221223122412251226122712281229123012311232123312341235123612371238123912401241124212431244124512461247124812491250125112521253125412551256125712581259126012611262126312641265126612671268126912701271127212731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295129612971298129913001301130213031304130513061307130813091310131113121313131413151316131713181319132013211322132313241325132613271328132913301331133213331334133513361337133813391340134113421343134413451346134713481349135013511352135313541355135613571358135913601361136213631364136513661367136813691370137113721373137413751376137713781379138013811382138313841385138613871388138913901391139213931394139513961397139813991400140114021403140414051406140714081409141014111412141314141415141614171418141914201421142214231424142514261427142814291430143114321433143414351436143714381439144014411442144314441445144614471448144914501451145214531454145514561457145814591460146114621463146414651466146714681469147014711472147314741475147614771478147914801481148214831484148514861487148814891490149114921493149414951496149714981499150015011502150315041505150615071508150915101511151215131514151515161517151815191520152115221523152415251526152715281529153015311532153315341535153615371538153915401541154215431544154515461547154815491550155115521553155415551556155715581559156015611562156315641565156615671568156915701571157215731574157515761577157815791580158115821583158415851586158715881589159015911592159315941595159615971598159916001601160216031604160516061607160816091610161116121613161416151616161716181619162016211622162316241625162616271628162916301631163216331634163516361637163816391640164116421643164416451646164716481649165016511652165316541655165616571658165916601661166216631664166516661667166816691670167116721673167416751676167716781679168016811682168316841685168616871688168916901691169216931694169516961697169816991700170117021703170417051706170717081709171017111712171317141715171617171718171917201721172217231724172517261727172817291730173117321733173417351736173717381739174017411742174317441745174617471748174917501751175217531754175517561757175817591760176117621763176417651766176717681769177017711772177317741775177617771778177917801781178217831784178517861787178817891790179117921793179417951796179717981799180018011802180318041805180618071808180918101811181218131814181518161817181818191820182118221823182418251826182718281829183018311832183318341835183618371838183918401841184218431844184518461847184818491850185118521853185418551856185718581859186018611862186318641865186618671868186918701871187218731874187518761877187818791880188118821883188418851886188718881889189018911892189318941895189618971898189919001901190219031904190519061907190819091910191119121913191419151916191719181919192019211922192319241925192619271928192919301931193219331934193519361937193819391940194119421943194419451946194719481949195019511952195319541955195619571958195919601961196219631964196519661967196819691970197119721973197419751976197719781979198019811982198319841985198619871988198919901991199219931994199519961997199819992000' rand1[double precision]:79 toasted_col2[text]:null rand2[double precision]:1578
- COMMIT
- BEGIN
- table public.toasttable: INSERT: id[integer]:2 toasted_col1[text]:null rand1[double precision]:3077 toasted_col2[text]:'0001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500' rand2[double precision]:4576
- COMMIT
- BEGIN
- table public.toasttable: UPDATE: id[integer]:1 toasted_col1[text]:'12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916016116216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320420520620720820921021121221321421521621721821922022122222322422522622722822923023123223323423523623723823924024124224324424524624724824925025125225325425525625725825926026126226326426526626726826927027127227327427527627727827928028128228328428528628728828929029129229329429529629729829930030130230330430530630730830931031131231331431531631731831932032132232332432532632732832933033133233333433533633733833934034134234334434534634734834935035135235335435535635735835936036136236336436536636736836937037137237337437537637737837938038138238338438538638738838939039139239339439539639739839940040140240340440540640740840941041141241341441541641741841942042142242342442542642742842943043143243343443543643743843944044144244344444544644744844945045145245345445545645745845946046146246346446546646746846947047147247347447547647747847948048148248348448548648748848949049149249349449549649749849950050150250350450550650750850951051151251351451551651751851952052152252352452552652752852953053153253353453553653753853954054154254354454554654754854955055155255355455555655755855956056156256356456556656756856957057157257357457557657757857958058158258358458558658758858959059159259359459559659759859960060160260360460560660760860961061161261361461561661761861962062162262362462562662762862963063163263363463563663763863964064164264364464564664764864965065165265365465565665765865966066166266366466566666766866967067167267367467567667767867968068168268368468568668768868969069169269369469569669769869970070170270370470570670770870971071171271371471571671771871972072172272372472572672772872973073173273373473573673773873974074174274374474574674774874975075175275375475575675775875976076176276376476576676776876977077177277377477577677777877978078178278378478578678778878979079179279379479579679779879980080180280380480580680780880981081181281381481581681781881982082182282382482582682782882983083183283383483583683783883984084184284384484584684784884985085185285385485585685785885986086186286386486586686786886987087187287387487587687787887988088188288388488588688788888989089189289389489589689789889990090190290390490590690790890991091191291391491591691791891992092192292392492592692792892993093193293393493593693793893994094194294394494594694794894995095195295395495595695795895996096196296396496596696796896997097197297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007100810091010101110121013101410151016101710181019102010211022102310241025102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070107110721073107410751076107710781079108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163116411651166116711681169117011711172117311741175117611771178117911801181118211831184118511861187118811891190119111921193119411951196119711981199120012011202120312041205120612071208120912101211121212131214121512161217121812191220122112221223122412251226122712281229123012311232123312341235123612371238123912401241124212431244124512461247124812491250125112521253125412551256125712581259126012611262126312641265126612671268126912701271127212731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295129612971298129913001301130213031304130513061307130813091310131113121313131413151316131713181319132013211322132313241325132613271328132913301331133213331334133513361337133813391340134113421343134413451346134713481349135013511352135313541355135613571358135913601361136213631364136513661367136813691370137113721373137413751376137713781379138013811382138313841385138613871388138913901391139213931394139513961397139813991400140114021403140414051406140714081409141014111412141314141415141614171418141914201421142214231424142514261427142814291430143114321433143414351436143714381439144014411442144314441445144614471448144914501451145214531454145514561457145814591460146114621463146414651466146714681469147014711472147314741475147614771478147914801481148214831484148514861487148814891490149114921493149414951496149714981499150015011502150315041505150615071508150915101511151215131514151515161517151815191520152115221523152415251526152715281529153015311532153315341535153615371538153915401541154215431544154515461547154815491550155115521553155415551556155715581559156015611562156315641565156615671568156915701571157215731574157515761577157815791580158115821583158415851586158715881589159015911592159315941595159615971598159916001601160216031604160516061607160816091610161116121613161416151616161716181619162016211622162316241625162616271628162916301631163216331634163516361637163816391640164116421643164416451646164716481649165016511652165316541655165616571658165916601661166216631664166516661667166816691670167116721673167416751676167716781679168016811682168316841685168616871688168916901691169216931694169516961697169816991700170117021703170417051706170717081709171017111712171317141715171617171718171917201721172217231724172517261727172817291730173117321733173417351736173717381739174017411742174317441745174617471748174917501751175217531754175517561757175817591760176117621763176417651766176717681769177017711772177317741775177617771778177917801781178217831784178517861787178817891790179117921793179417951796179717981799180018011802180318041805180618071808180918101811181218131814181518161817181818191820182118221823182418251826182718281829183018311832183318341835183618371838183918401841184218431844184518461847184818491850185118521853185418551856185718581859186018611862186318641865186618671868186918701871187218731874187518761877187818791880188118821883188418851886188718881889189018911892189318941895189618971898189919001901190219031904190519061907190819091910191119121913191419151916191719181919192019211922192319241925192619271928192919301931193219331934193519361937193819391940194119421943194419451946194719481949195019511952195319541955195619571958195919601961196219631964196519661967196819691970197119721973197419751976197719781979198019811982198319841985198619871988198919901991199219931994199519961997199819992000' rand1[double precision]:79 toasted_col2[text]:null rand2[double precision]:1578
- COMMIT
-(103 rows)
-
+ERROR:  invalid memory alloc request size 94119197734824
 INSERT INTO toasttable(toasted_col1) SELECT string_agg(g.i::text, '') FROM generate_series(1, 2000) g(i);
 -- update of second column, first column unchanged
 UPDATE toasttable
@@ -700,22 +515,10 @@
 -- make sure we decode correctly even if the toast table is gone
 DROP TABLE toasttable;
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        data                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
- BEGIN
- table public.toasttable: INSERT: id[integer]:3 toasted_col1[text]:'12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916016116216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320420520620720820921021121221321421521621721821922022122222322422522622722822923023123223323423523623723823924024124224324424524624724824925025125225325425525625725825926026126226326426526626726826927027127227327427527627727827928028128228328428528628728828929029129229329429529629729829930030130230330430530630730830931031131231331431531631731831932032132232332432532632732832933033133233333433533633733833934034134234334434534634734834935035135235335435535635735835936036136236336436536636736836937037137237337437537637737837938038138238338438538638738838939039139239339439539639739839940040140240340440540640740840941041141241341441541641741841942042142242342442542642742842943043143243343443543643743843944044144244344444544644744844945045145245345445545645745845946046146246346446546646746846947047147247347447547647747847948048148248348448548648748848949049149249349449549649749849950050150250350450550650750850951051151251351451551651751851952052152252352452552652752852953053153253353453553653753853954054154254354454554654754854955055155255355455555655755855956056156256356456556656756856957057157257357457557657757857958058158258358458558658758858959059159259359459559659759859960060160260360460560660760860961061161261361461561661761861962062162262362462562662762862963063163263363463563663763863964064164264364464564664764864965065165265365465565665765865966066166266366466566666766866967067167267367467567667767867968068168268368468568668768868969069169269369469569669769869970070170270370470570670770870971071171271371471571671771871972072172272372472572672772872973073173273373473573673773873974074174274374474574674774874975075175275375475575675775875976076176276376476576676776876977077177277377477577677777877978078178278378478578678778878979079179279379479579679779879980080180280380480580680780880981081181281381481581681781881982082182282382482582682782882983083183283383483583683783883984084184284384484584684784884985085185285385485585685785885986086186286386486586686786886987087187287387487587687787887988088188288388488588688788888989089189289389489589689789889990090190290390490590690790890991091191291391491591691791891992092192292392492592692792892993093193293393493593693793893994094194294394494594694794894995095195295395495595695795895996096196296396496596696796896997097197297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007100810091010101110121013101410151016101710181019102010211022102310241025102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070107110721073107410751076107710781079108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163116411651166116711681169117011711172117311741175117611771178117911801181118211831184118511861187118811891190119111921193119411951196119711981199120012011202120312041205120612071208120912101211121212131214121512161217121812191220122112221223122412251226122712281229123012311232123312341235123612371238123912401241124212431244124512461247124812491250125112521253125412551256125712581259126012611262126312641265126612671268126912701271127212731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295129612971298129913001301130213031304130513061307130813091310131113121313131413151316131713181319132013211322132313241325132613271328132913301331133213331334133513361337133813391340134113421343134413451346134713481349135013511352135313541355135613571358135913601361136213631364136513661367136813691370137113721373137413751376137713781379138013811382138313841385138613871388138913901391139213931394139513961397139813991400140114021403140414051406140714081409141014111412141314141415141614171418141914201421142214231424142514261427142814291430143114321433143414351436143714381439144014411442144314441445144614471448144914501451145214531454145514561457145814591460146114621463146414651466146714681469147014711472147314741475147614771478147914801481148214831484148514861487148814891490149114921493149414951496149714981499150015011502150315041505150615071508150915101511151215131514151515161517151815191520152115221523152415251526152715281529153015311532153315341535153615371538153915401541154215431544154515461547154815491550155115521553155415551556155715581559156015611562156315641565156615671568156915701571157215731574157515761577157815791580158115821583158415851586158715881589159015911592159315941595159615971598159916001601160216031604160516061607160816091610161116121613161416151616161716181619162016211622162316241625162616271628162916301631163216331634163516361637163816391640164116421643164416451646164716481649165016511652165316541655165616571658165916601661166216631664166516661667166816691670167116721673167416751676167716781679168016811682168316841685168616871688168916901691169216931694169516961697169816991700170117021703170417051706170717081709171017111712171317141715171617171718171917201721172217231724172517261727172817291730173117321733173417351736173717381739174017411742174317441745174617471748174917501751175217531754175517561757175817591760176117621763176417651766176717681769177017711772177317741775177617771778177917801781178217831784178517861787178817891790179117921793179417951796179717981799180018011802180318041805180618071808180918101811181218131814181518161817181818191820182118221823182418251826182718281829183018311832183318341835183618371838183918401841184218431844184518461847184818491850185118521853185418551856185718581859186018611862186318641865186618671868186918701871187218731874187518761877187818791880188118821883188418851886188718881889189018911892189318941895189618971898189919001901190219031904190519061907190819091910191119121913191419151916191719181919192019211922192319241925192619271928192919301931193219331934193519361937193819391940194119421943194419451946194719481949195019511952195319541955195619571958195919601961196219631964196519661967196819691970197119721973197419751976197719781979198019811982198319841985198619871988198919901991199219931994199519961997199819992000' rand1[double precision]:6075 toasted_col2[text]:null rand2[double precision]:7574
- COMMIT
- BEGIN
- table public.toasttable: UPDATE: id[integer]:1 toasted_col1[text]:unchanged-toast-datum rand1[double precision]:79 toasted_col2[text]:'12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916016116216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320420520620720820921021121221321421521621721821922022122222322422522622722822923023123223323423523623723823924024124224324424524624724824925025125225325425525625725825926026126226326426526626726826927027127227327427527627727827928028128228328428528628728828929029129229329429529629729829930030130230330430530630730830931031131231331431531631731831932032132232332432532632732832933033133233333433533633733833934034134234334434534634734834935035135235335435535635735835936036136236336436536636736836937037137237337437537637737837938038138238338438538638738838939039139239339439539639739839940040140240340440540640740840941041141241341441541641741841942042142242342442542642742842943043143243343443543643743843944044144244344444544644744844945045145245345445545645745845946046146246346446546646746846947047147247347447547647747847948048148248348448548648748848949049149249349449549649749849950050150250350450550650750850951051151251351451551651751851952052152252352452552652752852953053153253353453553653753853954054154254354454554654754854955055155255355455555655755855956056156256356456556656756856957057157257357457557657757857958058158258358458558658758858959059159259359459559659759859960060160260360460560660760860961061161261361461561661761861962062162262362462562662762862963063163263363463563663763863964064164264364464564664764864965065165265365465565665765865966066166266366466566666766866967067167267367467567667767867968068168268368468568668768868969069169269369469569669769869970070170270370470570670770870971071171271371471571671771871972072172272372472572672772872973073173273373473573673773873974074174274374474574674774874975075175275375475575675775875976076176276376476576676776876977077177277377477577677777877978078178278378478578678778878979079179279379479579679779879980080180280380480580680780880981081181281381481581681781881982082182282382482582682782882983083183283383483583683783883984084184284384484584684784884985085185285385485585685785885986086186286386486586686786886987087187287387487587687787887988088188288388488588688788888989089189289389489589689789889990090190290390490590690790890991091191291391491591691791891992092192292392492592692792892993093193293393493593693793893994094194294394494594694794894995095195295395495595695795895996096196296396496596696796896997097197297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007100810091010101110121013101410151016101710181019102010211022102310241025102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070107110721073107410751076107710781079108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163116411651166116711681169117011711172117311741175117611771178117911801181118211831184118511861187118811891190119111921193119411951196119711981199120012011202120312041205120612071208120912101211121212131214121512161217121812191220122112221223122412251226122712281229123012311232123312341235123612371238123912401241124212431244124512461247124812491250125112521253125412551256125712581259126012611262126312641265126612671268126912701271127212731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295129612971298129913001301130213031304130513061307130813091310131113121313131413151316131713181319132013211322132313241325132613271328132913301331133213331334133513361337133813391340134113421343134413451346134713481349135013511352135313541355135613571358135913601361136213631364136513661367136813691370137113721373137413751376137713781379138013811382138313841385138613871388138913901391139213931394139513961397139813991400140114021403140414051406140714081409141014111412141314141415141614171418141914201421142214231424142514261427142814291430143114321433143414351436143714381439144014411442144314441445144614471448144914501451145214531454145514561457145814591460146114621463146414651466146714681469147014711472147314741475147614771478147914801481148214831484148514861487148814891490149114921493149414951496149714981499150015011502150315041505150615071508150915101511151215131514151515161517151815191520152115221523152415251526152715281529153015311532153315341535153615371538153915401541154215431544154515461547154815491550155115521553155415551556155715581559156015611562156315641565156615671568156915701571157215731574157515761577157815791580158115821583158415851586158715881589159015911592159315941595159615971598159916001601160216031604160516061607160816091610161116121613161416151616161716181619162016211622162316241625162616271628162916301631163216331634163516361637163816391640164116421643164416451646164716481649165016511652165316541655165616571658165916601661166216631664166516661667166816691670167116721673167416751676167716781679168016811682168316841685168616871688168916901691169216931694169516961697169816991700170117021703170417051706170717081709171017111712171317141715171617171718171917201721172217231724172517261727172817291730173117321733173417351736173717381739174017411742174317441745174617471748174917501751175217531754175517561757175817591760176117621763176417651766176717681769177017711772177317741775177617771778177917801781178217831784178517861787178817891790179117921793179417951796179717981799180018011802180318041805180618071808180918101811181218131814181518161817181818191820182118221823182418251826182718281829183018311832183318341835183618371838183918401841184218431844184518461847184818491850185118521853185418551856185718581859186018611862186318641865186618671868186918701871187218731874187518761877187818791880188118821883188418851886188718881889189018911892189318941895189618971898189919001901190219031904190519061907190819091910191119121913191419151916191719181919192019211922192319241925192619271928192919301931193219331934193519361937193819391940194119421943194419451946194719481949195019511952195319541955195619571958195919601961196219631964196519661967196819691970197119721973197419751976197719781979198019811982198319841985198619871988198919901991199219931994199519961997199819992000' rand2[double precision]:1578
- COMMIT
-(6 rows)
-
+ERROR:  invalid memory alloc request size 94119197734824
 -- done, free logical replication slot
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
- data 
-------
-(0 rows)
-
+ERROR:  invalid memory alloc request size 94119197734824
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
 --------------------------
#274Dilip Kumar
dilipbalaut@gmail.com
In reply to: Erik Rijkers (#273)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Apr 23, 2020 at 2:28 PM Erik Rijkers <er@xs4all.nl> wrote:

On 2020-04-23 05:24, Dilip Kumar wrote:

On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote:

The 'ddl' one is apparently not quite fixed - I get this in (cd
contrib; make check)' (in both assert-enabled and non-assert-enabled
build)

Can you send me the contrib/test_decoding/regression.diffs file?

Attached.

So from regression.diff, it appears that in failing in memory
allocation (+ERROR: invalid memory alloc request size
94119198201896). My colleague tried to reproduce this in a different
environment but there is no success so far. One more thing surprises
me is that after
(v15-0011-Provide-new-api-to-get-the-streaming-changes.patch)
actually, it should never go for the streaming path. However, we can
not ignore the fact that some of the changes might impact the
non-streaming path as well. Is it possible for you to somehow stop or
break the code and send the stack trace? One idea is by seeing the
log we can see from where the error is raised i.e MemoryContextAlloc
or palloc or some other similar function. Once we know that we can
convert that error to an assert and find the call stack.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#275Dilip Kumar
dilipbalaut@gmail.com
In reply to: Kuntal Ghosh (#263)
12 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Apr 17, 2020 at 1:40 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Few review comments from 0006-Add-support-for-streaming*.patch

+ subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
lseek can return (-)ve value in case of error, right?

+ /*
+ * We might need to create the tablespace's tempfile directory, if no
+ * one has yet done so.
+ *
+ * Don't check for error from mkdir; it could fail if the directory
+ * already exists (maybe someone else just did the same thing).  If
+ * it doesn't work then we'll bomb out when opening the file
+ */
+ mkdir(tempdirpath, S_IRWXU);
If that's the only reason, perhaps we can use something like following:

if (mkdir(tempdirpath, S_IRWXU) < 0 && errno != EEXIST)
throw error;

Done

+
+ CloseTransientFile(stream_fd);
Might failed to close the file. We should handle the case.

Changed

Still, one place is pending because I don't have the filename there to
report an error. One option is we can just give an error without the
filename. I will try to think about this part.

Also, I think we need some implementations in dumpSubscription() to
dump the (streaming = 'on') option.

Right, created another patch and attached.

I have also fixed a couple of bugs internally reported by my colleague
Neha Sharma.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v16-0005-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v16-0005-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From cdf1728d12f4440f0ab6ef3e4cc5ace6ae2645a7 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri, 10 Jan 2020 18:03:27 -0300
Subject: [PATCH v16 05/12] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c     |  38 +-
 src/backend/replication/logical/reorderbuffer.c | 720 +++++++++++++++++++++---
 src/include/replication/reorderbuffer.h         |  36 ++
 3 files changed, 702 insertions(+), 92 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..160b167 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2302875..1868ab2 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -236,6 +236,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +245,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +381,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -773,6 +786,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -865,6 +910,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -988,7 +1036,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1024,6 +1072,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1038,6 +1089,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1316,6 +1370,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1341,8 +1404,93 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
  */
 static void
 ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
-
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
 
 	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
@@ -1491,63 +1636,76 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error that we can ignore.  We
+ * might have already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the abort we will
+ * stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
-		return;
-	}
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
 
-	/* build data to be able to lookup the CommandIds of catalog tuples */
+	/*
+	 * build data to be able to lookup the CommandIds of catalog tuples
+	 */
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, txn->xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1564,14 +1722,17 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction("stream");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* start streaming this chunk of transaction */
+		if (streaming)
+			rb->stream_start(rb, txn);
+		else
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1579,6 +1740,19 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1588,8 +1762,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * use as a normal record. It'll be cleaned up at the end
 					 * of INSERT processing.
 					 */
-					if (specinsert == NULL)
-						elog(ERROR, "invalid ordering of speculative insertion changes");
 					Assert(specinsert->data.tp.oldtuple == NULL);
 					change = specinsert;
 					change->action = REORDER_BUFFER_CHANGE_INSERT;
@@ -1655,7 +1827,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						if (streaming)
+						{
+							rb->stream_change(rb, txn, relation, change);
+
+							/* Remember that we have sent some data for this xid. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_change(rb, txn, relation, change);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1676,8 +1856,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						 * freed/reused while restoring spooled data from
 						 * disk.
 						 */
-						Assert(change->data.tp.newtuple != NULL);
-
 						dlist_delete(&change->node);
 						ReorderBufferToastAppendChunk(rb, txn, relation,
 													  change);
@@ -1695,7 +1873,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1753,7 +1931,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						if (streaming)
+						{
+							rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+							/* Remember that we have sent some data. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_truncate(rb, txn, nrelations, relations, change);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1762,10 +1948,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					if (streaming)
+						rb->stream_message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
+					else
+						rb->message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1796,9 +1988,9 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+										  txn->xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1818,7 +2010,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash,
+											  txn->xid);
 					}
 
 					break;
@@ -1858,14 +2051,46 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			/*
+			 * Set the last of the stream as the final lsn before calling
+			 * stream stop.
+			 */
+			if (!XLogRecPtrIsInvalid(prev_lsn))
+				txn->final_lsn = prev_lsn;
+			rb->stream_stop(rb, txn);
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+		{
+			txn->command_id = command_id;
+
+			/* Avoid copying if it's already copied. */
+			if (snapshot_now->copied)
+				txn->snapshot_now = snapshot_now;
+			else
+				txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+														  txn, command_id);
+		}
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1884,14 +2109,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,18 +2144,131 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		if (streaming)
+		{
+			/* Discard the changes that we just streamed. */
+			ReorderBufferTruncateTXN(rb, txn);
 
-		PG_RE_THROW();
+			/* Re-throw only if it's not an abort. */
+			if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+			{
+				MemoryContextSwitchTo(ecxt);
+				PG_RE_THROW();
+			}
+			else
+			{
+				FlushErrorState();
+				FreeErrorData(errdata);
+				errdata = NULL;
+
+				/* remember the command ID and snapshot for the streaming run */
+				txn->command_id = command_id;
+
+				/* Avoid copying if it's already copied. */
+				if (snapshot_now->copied)
+					txn->snapshot_now = snapshot_now;
+				else
+					txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+															  txn, command_id);
+				/*
+				 * Set the last last of the stream as the final lsn before
+				 * calling stream stop.
+				 */
+				txn->final_lsn = prev_lsn;
+				rb->stream_stop(rb, txn);
+				ReorderBufferToastReset(rb, txn);
+				if (specinsert != NULL)
+				{
+					ReorderBufferReturnChange(rb, specinsert);
+					specinsert = NULL;
+				}
+			}
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
 /*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * We currently can only decode a transaction's contents when its commit
+ * record is read because that's the only place where we know about cache
+ * invalidations. Thus, once a toplevel commit is read, we iterate over the top
+ * and subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot snapshot_now;
+	CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -1946,6 +2292,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2015,6 +2368,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2150,8 +2510,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2159,6 +2528,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2170,19 +2540,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2211,6 +2590,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2295,6 +2675,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2399,6 +2786,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
  *
@@ -2418,15 +2837,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2746,6 +3196,102 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot snapshot_now;
+	CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e102840..6d65986 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -171,6 +171,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -190,6 +191,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -225,6 +244,16 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	final_lsn;
 
 	/*
+	 * Have we sent any changes for this transaction in output plugin?
+	 */
+	bool		any_data_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
+	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
 	XLogRecPtr	end_lsn;
@@ -255,6 +284,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
-- 
1.8.3.1

v16-0009-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v16-0009-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From e71bd7c12db18f2caf2885e9279374d79983ef55 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v16 09/12] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v16-0006-Add-support-for-streaming-to-built-in-replicatio.patchapplication/octet-stream; name=v16-0006-Add-support-for-streaming-to-built-in-replicatio.patchDownload
From 1b8a05e44a991237ccd0ac109ecc1018cba9c946 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 16 Apr 2020 01:55:22 -0700
Subject: [PATCH v16 06/12] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |    4 +-
 doc/src/sgml/ref/create_subscription.sgml          |   12 +
 src/backend/catalog/pg_subscription.c              |    1 +
 src/backend/commands/subscriptioncmds.c            |   45 +-
 src/backend/postmaster/pgstat.c                    |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    3 +
 src/backend/replication/logical/launcher.c         |    1 -
 src/backend/replication/logical/logical.c          |    4 +-
 src/backend/replication/logical/proto.c            |  140 ++-
 src/backend/replication/logical/worker.c           | 1035 ++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c        |  318 +++++-
 src/backend/replication/slotfuncs.c                |    6 +
 src/backend/replication/walsender.c                |    6 +
 src/include/catalog/pg_subscription.h              |    3 +
 src/include/pgstat.h                               |    6 +-
 src/include/replication/logicalproto.h             |   42 +-
 src/include/replication/walreceiver.h              |    1 +
 src/test/subscription/t/009_stream_simple.pl       |   86 ++
 src/test/subscription/t/010_stream_subxact.pl      |  102 ++
 src/test/subscription/t/011_stream_ddl.pl          |   95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |   82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |   84 ++
 22 files changed, 2045 insertions(+), 43 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 1b8bead..95b7c24 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -164,8 +164,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c24..3349cc4 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731..f28482f 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 7f15667..65b6b76 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 50eea2e..4ef4fd4 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4133,6 +4133,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9..5257ab0 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e..8156a42 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,7 +14,6 @@
  *
  *-------------------------------------------------------------------------
  */
-
 #include "postgres.h"
 
 #include "access/heapam.h"
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 497d8a9..dfc681d 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1148,7 +1148,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1193,7 +1193,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = txn->final_lsn;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd..5242ac0 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a12..05e7954 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,6 +54,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -64,6 +86,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +94,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -110,12 +134,57 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool	in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;						/* XID of the subxact */
+	off_t           offset;					/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +256,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -553,6 +658,321 @@ apply_handle_origin(StringInfo s)
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+			return;
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	/* XXX Should this be allocated in another memory context? */
+
+	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	ensure_transaction();
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));	
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+
+	pfree(buffer);
+	pfree(s2.data);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -565,6 +985,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1003,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1042,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1160,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1305,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1678,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1819,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1478,6 +1932,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 }
 
 /*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
+/*
  * Apply main loop.
  */
 static void
@@ -1493,6 +1963,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1941,6 +2414,567 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));	
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3140,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 77b85fc..811706a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,54 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +119,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +145,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +208,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +237,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +260,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +281,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +370,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +427,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +469,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +490,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +522,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +542,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +566,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +586,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +611,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +643,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -588,6 +724,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -624,6 +845,34 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -750,12 +999,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -790,7 +1072,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index f776de3..9121420 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -156,6 +156,12 @@ create_logical_replication_slot(char *name, char *plugin,
 									NULL);
 
 	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 0e93322..eacea12 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1004,6 +1004,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndUpdateProgress);
 
 		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
+		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
 		 * messages or send keepalives. As we possibly need to wait for
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d4..617b909 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8041d9..3b3e1fd 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -981,7 +981,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561..89158ed 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f1aa6e9..70d39f8 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v16-0007-Track-statistics-for-streaming.patchapplication/octet-stream; name=v16-0007-Track-statistics-for-streaming.patchDownload
From 06ed149a64907c0814b305936cca83b66ae89ff1 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 2 Apr 2020 13:19:29 +0530
Subject: [PATCH v16 07/12] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                    | 25 +++++++++++++++++++
 src/backend/catalog/system_views.sql            |  5 +++-
 src/backend/replication/logical/reorderbuffer.c | 13 ++++++++++
 src/backend/replication/walsender.c             | 32 +++++++++++++++++++++----
 src/include/catalog/pg_proc.dat                 |  6 ++---
 src/include/replication/reorderbuffer.h         | 13 ++++++----
 src/include/replication/walsender_private.h     |  5 ++++
 src/test/regress/expected/rules.out             |  7 ++++--
 8 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6562cc4..d8bf587 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2063,6 +2063,31 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       may get spilled repeatedly, and this counter gets incremented on every
       such invocation.</entry>
     </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
+
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index d406ea8..65d650d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -788,7 +788,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1868ab2..e207f90 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -331,6 +331,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3287,6 +3291,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
 
+	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index eacea12..d0028d9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1333,7 +1333,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1354,7 +1354,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2396,6 +2397,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3237,7 +3241,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3295,6 +3299,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillCount;
 		int64		spillBytes;
 		bool		is_sync_standby;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 		int			j;
@@ -3320,6 +3327,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		/*
@@ -3422,6 +3432,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3670,11 +3685,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4bce3ad..9fb1ffe 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5237,9 +5237,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6d65986..603f325 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -517,15 +517,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 734acec..b997d17 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -83,6 +83,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ac31840..68e2deb 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2005,9 +2005,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_slru| SELECT s.name,
     s.blks_zeroed,
-- 
1.8.3.1

v16-0008-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v16-0008-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From 38e1c3a16b6e3b0be12c3e7a0ec723f0823e8bb6 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v16 08/12] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 6da7f71..086d0c7 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -70,7 +70,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 34ab11e..ad3ed13 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 81520a7..0c9c6b3 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v16-0010-Bugfix-handling-of-incomplete-toast-tuple.patchapplication/octet-stream; name=v16-0010-Bugfix-handling-of-incomplete-toast-tuple.patchDownload
From 1dc90a93222e4cef0f648f63d5d0efc8888f9c4f Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 28 Feb 2020 11:07:46 +0530
Subject: [PATCH v16 10/12] Bugfix handling of incomplete toast tuple

---
 src/backend/access/heap/heapam.c                |   3 +
 src/backend/replication/logical/decode.c        |  17 ++-
 src/backend/replication/logical/reorderbuffer.c | 182 +++++++++++++++---------
 src/include/access/heapam_xlog.h                |   1 +
 src/include/replication/reorderbuffer.h         |  24 +++-
 5 files changed, 147 insertions(+), 80 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 84884a4..18b0bad 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1978,6 +1978,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 69c1f45..c841687 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -727,7 +727,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -794,7 +796,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -851,7 +854,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -887,7 +891,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -987,7 +991,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1025,7 +1029,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e207f90..eed9a50 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -654,11 +654,14 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
-	ReorderBufferTXN *txn;
+	ReorderBufferTXN *txn, *toptxn;
+	bool	can_stream = false;
 
-	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+	toptxn = txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
 
 	change->lsn = lsn;
 	change->txn = txn;
@@ -668,9 +671,50 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Otherwise, if
+	 * we have toast insert bit set and this is insert/update then clear the
+	 * bit.
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(txn) &&
+			 ((change->action == REORDER_BUFFER_CHANGE_INSERT) ||
+			 (change->action == REORDER_BUFFER_CHANGE_UPDATE)))
+	{
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+		can_stream = true;
+	}
+
+	/*
+	 * If this is a speculative insert then set the corresponding bit.
+	 * Otherwise, if we have speculative insert bit set and this is spec confirm
+	 * record then clear the bit.
+	 */
+	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM)
+	{
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+		can_stream = true;
+	}
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/*
+	 * If streaming is enable and we have serialized this transaction because
+	 * it had incomplete tuple.  So if now we have got the complete tuple we
+	 * can stream it.
+	 */
+	if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
+		&& !rbtxn_has_toast_insert(txn) && !rbtxn_has_spec_insert(txn))
+	{
+		ReorderBufferStreamTXN(rb, toptxn);
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
+	}
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -700,7 +744,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1861,8 +1905,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						 * disk.
 						 */
 						dlist_delete(&change->node);
-						ReorderBufferToastAppendChunk(rb, txn, relation,
-													  change);
+							ReorderBufferToastAppendChunk(rb, txn, relation,
+														  change);
 					}
 
 			change_done:
@@ -2461,7 +2505,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2510,7 +2554,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2533,6 +2577,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2547,8 +2592,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2556,12 +2606,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2622,7 +2680,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 	memcpy(change->data.inval.invalidations, msgs,
 		   sizeof(SharedInvalidationMessage) * nmsgs);
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -2809,15 +2867,16 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		/*
+		 * if the current transaction is larger and doesn't have incomplete data
+		 * remember it.
+		 */
+		if (((!largest) || (txn->total_size > largest->total_size)) &&
+			((txn->size > 0) && !rbtxn_has_toast_insert(txn) &&
+			!rbtxn_has_spec_insert(txn)))
+		largest = txn;
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2835,66 +2894,51 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
 	ReorderBufferTXN *txn;
 
-	/* bail out if we haven't exceeded the memory limit */
-	if (rb->size < logical_decoding_work_mem * 1024L)
-		return;
-
-	/*
-	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by streaming, if supported. Otherwise spill to disk.
-	 */
-	if (ReorderBufferCanStream(rb))
+	/* Loop until we reach under the memory limit. */
+	while (rb->size >= logical_decoding_work_mem * 1024L)
 	{
 		/*
-		 * Pick the largest toplevel transaction and evict it from memory by
-		 * streaming the already decoded part.
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTopTXN(rb);
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn && !txn->toptxn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		ReorderBufferStreamTXN(rb, txn);
-	}
-	else
-	{
-		/*
-		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
-		 */
-		txn = ReorderBufferLargestTXN(rb);
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
-		ReorderBufferSerializeTXN(rb, txn);
+		/*
+		 * After eviction, the transaction should have no entries in memory, and
+		 * should use 0 bytes for changes.
+		 *
+		 * XXX Checking the size is fine for both cases - spill to disk and
+		 * streaming. But for streaming we should really check nentries_mem for
+		 * all subtransactions too.
+		 */
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
 	}
 
-	/*
-	 * After eviction, the transaction should have no entries in memory, and
-	 * should use 0 bytes for changes.
-	 *
-	 * XXX Checking the size is fine for both cases - spill to disk and
-	 * streaming. But for streaming we should really check nentries_mem for
-	 * all subtransactions too.
-	 */
-	Assert(txn->size == 0);
-	Assert(txn->nentries_mem == 0);
-
-	/*
-	 * And furthermore, evicting the transaction should get us below the
-	 * memory limit again - it is not possible that we're still exceeding the
-	 * memory limit after evicting the transaction.
-	 *
-	 * This follows from the simple fact that the selected transaction is at
-	 * least as large as the most recent change (which caused us to go over
-	 * the memory limit). So by evicting it we're definitely back below the
-	 * memory limit.
-	 */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 603f325..ba2ab71 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -172,6 +172,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -191,6 +193,17 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -199,10 +212,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -355,6 +364,9 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -545,7 +557,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v16-0012-Add-streaming-option-in-pg_dump.patchapplication/octet-stream; name=v16-0012-Add-streaming-option-in-pg_dump.patchDownload
From 3c53b86530a8c44d43579364b510114ee35cb200 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v16 12/12] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 5db4f57..11db7b7 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4210,6 +4210,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4244,8 +4245,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4258,6 +4259,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4274,6 +4276,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4351,6 +4354,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 61c909e..5c5b072 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
1.8.3.1

v16-0011-Provide-new-api-to-get-the-streaming-changes.patchapplication/octet-stream; name=v16-0011-Provide-new-api-to-get-the-streaming-changes.patchDownload
From af543b5a247e785cb8f4439fc89f979c2b5ec7b2 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 22 Apr 2020 16:33:07 +0530
Subject: [PATCH v16 11/12] Provide new api to get the streaming changes

---
 src/backend/catalog/system_views.sql           |  8 ++++++++
 src/backend/replication/logical/logicalfuncs.c | 23 ++++++++++++++++++-----
 src/include/catalog/pg_proc.dat                |  9 +++++++++
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 65d650d..d9ab14b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1242,6 +1242,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
 
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_peek_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index f5384f1..7561141 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -108,7 +108,8 @@ check_permissions(void)
  * Helper function for the various SQL callable logical decoding functions.
  */
 static Datum
-pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool binary)
+pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
+								 bool binary, bool streaming)
 {
 	Name		name;
 	XLogRecPtr	upto_lsn;
@@ -237,6 +238,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 									LogicalOutputPrepareWrite,
 									LogicalOutputWrite, NULL);
 
+		/* If called has not asked for streaming changes then disable it. */
+		ctx->streaming &= streaming;
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
@@ -347,7 +351,16 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 Datum
 pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, false);
+}
+
+/*
+ * SQL function to get the streaming changes as text, consuming the data.
+ */
+Datum
+pg_logical_slot_get_streaming_changes(PG_FUNCTION_ARGS)
+{
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, true);
 }
 
 /*
@@ -356,7 +369,7 @@ pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, false, false);
 }
 
 /*
@@ -365,7 +378,7 @@ pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, true, false);
 }
 
 /*
@@ -374,7 +387,7 @@ pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, true, false);
 }
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9fb1ffe..3dfc5c1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10117,6 +10117,15 @@
   proargmodes => '{i,i,i,v,o,o,o}',
   proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
   prosrc => 'pg_logical_slot_get_binary_changes' },
+{ oid => '6150', descr => 'get streaming changes from replication slot',
+  proname => 'pg_logical_slot_get_streaming_changes', procost => '1000',
+  prorows => '1000', provariadic => 'text', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'name pg_lsn int4 _text',
+  proallargtypes => '{name,pg_lsn,int4,_text,pg_lsn,xid,text}',
+  proargmodes => '{i,i,i,v,o,o,o}',
+  proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
+  prosrc => 'pg_logical_slot_get_streaming_changes' },
 { oid => '3784', descr => 'peek at changes from replication slot',
   proname => 'pg_logical_slot_peek_changes', procost => '1000',
   prorows => '1000', provariadic => 'text', proisstrict => 'f',
-- 
1.8.3.1

v16-0002-Issue-individual-invalidations-with-wal_level-lo.patchapplication/octet-stream; name=v16-0002-Issue-individual-invalidations-with-wal_level-lo.patchDownload
From 36577ba865b02113645c76c92b214b606a634727 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Tue, 14 Apr 2020 11:11:37 +0530
Subject: [PATCH v16 02/12] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.
---
 src/backend/access/rmgrdesc/xactdesc.c          |  40 +++++++++
 src/backend/access/transam/xact.c               |   7 ++
 src/backend/replication/logical/decode.c        |  16 ++++
 src/backend/replication/logical/reorderbuffer.c | 104 +++++++++++++++++++++---
 src/backend/utils/cache/inval.c                 |  49 +++++++++++
 src/include/access/xact.h                       |  13 ++-
 src/include/replication/reorderbuffer.h         |  11 +++
 7 files changed, 226 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index fbc5942..17c06f7 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c2604bb..8e6b1a6 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6020,6 +6020,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 122c581..69c1f45 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,22 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				Assert(TransactionIdIsValid(xid));
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->nmsgs, invals->msgs);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4594cf9..0d5bb73 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -220,7 +220,7 @@ static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
 static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
-static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
 
 /*
  * ---------------------------------------
@@ -455,6 +455,11 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 				pfree(change->data.msg.message);
 			change->data.msg.message = NULL;
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			if (change->data.inval.invalidations)
+				pfree(change->data.inval.invalidations);
+			change->data.inval.invalidations = NULL;
+			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			if (change->data.snapshot)
 			{
@@ -1814,17 +1819,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation messages locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					ReorderBufferExecuteInvalidations(
+							change->data.inval.ninvalidations,
+							change->data.inval.invalidations);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -1866,7 +1878,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1892,7 +1905,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2204,6 +2218,33 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 
 /*
  * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn, int nmsgs,
+							 SharedInvalidationMessage *msgs)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.ninvalidations = nmsgs;
+	change->data.inval.invalidations = (SharedInvalidationMessage *)
+		MemoryContextAlloc(rb->context,
+						   sizeof(SharedInvalidationMessage) * nmsgs);
+	memcpy(change->data.inval.invalidations, msgs,
+		   sizeof(SharedInvalidationMessage) * nmsgs);
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Setup the invalidation of the toplevel transaction.
  *
  * This needs to be done before ReorderBufferCommit is called!
  */
@@ -2234,12 +2275,12 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  * in the changestream but we don't know which those are.
  */
 static void
-ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
 {
 	int			i;
 
-	for (i = 0; i < txn->ninvalidations; i++)
-		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	for (i = 0; i < nmsgs; i++)
+		LocalExecuteInvalidationMessage(&msgs[i]);
 }
 
 /*
@@ -2591,6 +2632,24 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				char	   *data;
+				Size		inval_size = sizeof(SharedInvalidationMessage) *
+										change->data.inval.ninvalidations;
+
+				sz += inval_size;
+
+				ReorderBufferSerializeReserve(rb, sz);
+				data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+
+				/* might have been reallocated above */
+				ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+				memcpy(data, change->data.inval.invalidations, inval_size);
+				data += inval_size;
+
+				break;
+			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -2736,6 +2795,11 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			sz += sizeof(SharedInvalidationMessage) *
+					change->data.inval.ninvalidations;
+			break;
+
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -3004,6 +3068,20 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				Size	inval_size = sizeof(SharedInvalidationMessage) *
+									change->data.inval.ninvalidations;
+
+				change->data.inval.invalidations =
+						MemoryContextAlloc(rb->context,
+										   change->data.msg.message_size);
+				/* read the message */
+				memcpy(change->data.inval.invalidations, data, inval_size);
+				data += inval_size;
+
+				break;
+			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	oldsnap;
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..cba5b6c 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transaction.  As of now it was
+ *	enough to log invalidation only at commit because we are only decoding the
+ *	transaction at the commit time.   We only need to log the catalog cache and
+ *	relcache invalidation.  There can not be any active MVCC scan in logical
+ *	decoding so we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
 	if (transInvalInfo == NULL)
 		return;
 
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
@@ -1501,3 +1513,40 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage	*invalMessages;
+	int	nmsgs = 0;
+
+	if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+	{
+		ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+								 MakeSharedInvalidMessagesArray);
+		invalMessages = SharedInvalidMessagesArray;
+		nmsgs  = numSharedInvalidMessagesArray;
+		SharedInvalidMessagesArray = NULL;
+		numSharedInvalidMessagesArray = 0;
+	}
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b38..b822c5e 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,17 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4..af35287 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,14 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations;		/* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation
+														 * message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +468,8 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  int nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
1.8.3.1

v16-0001-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=v16-0001-Immediately-WAL-log-assignments.patchDownload
From e856c586bb131e1047471ecfddc6b0f118f1b0f3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 20 Mar 2020 15:03:01 +0530
Subject: [PATCH v16 01/12] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 46 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 +++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 39 ++++++++++++++-------------
 src/include/access/xact.h                |  3 +++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 +++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 99 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3984dd3..c2604bb 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5118,6 +5120,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6020,3 +6023,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 4259309..3c49954 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 79ff976..7b5257f 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1189,6 +1189,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1227,6 +1228,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3a..122c581 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel xid is valid, we need to assign the subxact to the
+	 * toplevel xact. We need to do this for all records, hence we do it before
+	 * the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(XLogRecGetTopXid(r)))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04ba..8645b38 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f60ed2d..6d439d0 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -229,6 +229,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 4582196..756f6df 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -150,6 +150,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -285,6 +287,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v16-0003-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=v16-0003-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From a65eda69a3bd6810223ea321d24f8080451e3351 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v16 03/12] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 +++++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  57 +++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..64f651f 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index bad3bfe..65244b1 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5adf253..497d8a9 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +204,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -862,6 +910,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->final_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 3b7ca7f..f24e246 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..0d0a94a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index af35287..e102840 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -354,6 +354,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -393,6 +439,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchapplication/octet-stream; name=v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchDownload
From 1ef5f6fbc9525a24f83e43cdeea9f46a13848cbd Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 10:55:19 +0530
Subject: [PATCH v16 04/12] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml               |  5 ++-
 src/backend/access/heap/heapam.c                | 41 +++++++++++++++++++++++++
 src/backend/access/index/genam.c                | 38 +++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c |  8 ++---
 src/backend/utils/time/snapmgr.c                | 25 +++++++++++++--
 src/include/utils/snapmgr.h                     |  4 ++-
 6 files changed, 113 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 65244b1..b59a6c8 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,7 +432,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
      includes writing to tables, performing DDL changes, and
      calling <literal>pg_current_xact_id()</literal>.
     </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0d4ed60..84884a4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,15 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(scan->rs_base.rs_rd) ||
+		  RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
@@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
 	bool		valid;
 
 	/*
+	 * We don't expect direct calls to heap_fetch with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_fetch call during logical decoding");
+
+	/*
 	 * Fetch and pin the appropriate page of the relation.
 	 */
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
@@ -1497,6 +1514,14 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		valid;
 	bool		skip;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search_buffer with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_hot_search_buffer call during logical decoding");
+
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
 		*all_dead = first_call;
@@ -1646,6 +1671,14 @@ heap_get_latest_tid(TableScanDesc sscan,
 	Assert(ItemPointerIsValid(tid));
 
 	/*
+	 * We don't expect direct calls to heap_get_latest_tid with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_get_latest_tid call during logical decoding");
+
+	/*
 	 * Loop to chase down t_ctid links.  At top of loop, ctid is the tuple we
 	 * need to examine, and *tid is the TID we will return if ctid turns out
 	 * to be bogus.
@@ -5451,6 +5484,14 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 	ItemId		lp = NULL;
 	HeapTupleHeader htup;
 
+	/*
+	 * We don't expect direct calls to heap_finish_speculative with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		elog(ERROR, "unexpected heap_finish_speculative call during logical decoding");
+
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..97a1075 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -433,6 +434,25 @@ systable_beginscan(Relation heapRelation,
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we error
+ * out.  We can't directly use TransactionIdDidAbort as after crash such
+ * transaction might not have been marked as aborted.  See detailed  comments at
+ * snapmgr.c where the variable is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +501,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +543,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -643,6 +675,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 0d5bb73..2302875 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -696,7 +696,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 			txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 		/* setup snapshot to allow catalog access */
-		SetupHistoricSnapshot(snapshot_now, NULL);
+		SetupHistoricSnapshot(snapshot_now, NULL, xid);
 		PG_TRY();
 		{
 			rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1547,7 +1547,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1798,7 +1798,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1818,7 +1818,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					}
 
 					break;
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c5..93a0c04 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -154,6 +154,13 @@ static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
 /*
+ * An xid value pointing to a possibly ongoing (sub)transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
+/*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2029,10 +2036,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
  *
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive.  This is to re-check XID status while accessing catalog.
+ *
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+					  TransactionId snapshot_xid)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -2041,8 +2052,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 
 	/* setup (cmin, cmax) lookup hash */
 	tuplecid_data = tuplecids;
-}
 
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
+	 */
+	if (TransactionIdIsValid(snapshot_xid) &&
+		!TransactionIdDidCommit(snapshot_xid))
+		CheckXidAlive = snapshot_xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
 /*
  * Make catalog snapshots behave normally again.
@@ -2052,6 +2072,7 @@ TeardownHistoricSnapshot(bool is_error)
 {
 	HistoricSnapshot = NULL;
 	tuplecid_data = NULL;
+	CheckXidAlive = InvalidTransactionId;
 }
 
 bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13c..12f737b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,8 +145,10 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+								  TransactionId snapshot_xid);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-- 
1.8.3.1

#276Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#275)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have also fixed a couple of bugs internally reported by my colleague
Neha Sharma.

I think it would be good if you can briefly explain what were the bugs
and how you fixed those?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#277Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#276)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Apr 27, 2020 at 4:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have also fixed a couple of bugs internally reported by my colleague
Neha Sharma.

I think it would be good if you can briefly explain what were the bugs
and how you fixed those?

Issue1: If the concurrent transaction was aborted then in CATCH block
we were not freeing the memory of the toast_has, and it was causing
the assert that after the stream is complete txn->size != 0.

Issue2: After streaming is complete we set the txn->final_lsn and we
remember that in the local variable, But mistakenly it was remembered
in local TRY block variable so if there is a concurrent abort in the
CATCH block the variable value is always a zero. So after streaming
the final_lsn were becoming 0 and that was asserting.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#278Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#275)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

[latest patches]

v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
-     Any actions leading to transaction ID assignment are prohibited.
That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the
<literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment
are prohibited. That, among others,
..
@@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
  bool valid;
  /*
+ * We don't expect direct calls to heap_fetch with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ elog(ERROR, "unexpected heap_fetch call during logical decoding");
+

I think comments and code don't match. In the comment, we are saying
that via output plugins access to user catalog tables or regular
system catalog tables won't be allowed via heap_* APIs but code
doesn't seem to reflect it. I feel only
TransactionIdIsValid(CheckXidAlive) is sufficient here. See, the
original discussion about this point [1]/messages/by-id/20180726200241.aje4dv4jsv25v4k2@alap3.anarazel.de (Refer "I think it'd also be
good to add assertions to codepaths not going through systable_*
asserting that ...").

Isn't it better to block the scan to user catalog tables or regular
system catalog tables for tableam scan APIs rather than at the heap
level? There might be some APIs like heap_getnext where such a check
might still be required but I guess it is still better to block at
tableam level.

[1]: /messages/by-id/20180726200241.aje4dv4jsv25v4k2@alap3.anarazel.de

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#279Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#278)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

[latest patches]

v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
-     Any actions leading to transaction ID assignment are prohibited.
That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the
<literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment
are prohibited. That, among others,
..
@@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
bool valid;
/*
+ * We don't expect direct calls to heap_fetch with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ elog(ERROR, "unexpected heap_fetch call during logical decoding");
+

I think comments and code don't match. In the comment, we are saying
that via output plugins access to user catalog tables or regular
system catalog tables won't be allowed via heap_* APIs but code
doesn't seem to reflect it. I feel only
TransactionIdIsValid(CheckXidAlive) is sufficient here. See, the
original discussion about this point [1] (Refer "I think it'd also be
good to add assertions to codepaths not going through systable_*
asserting that ...").

Right, So I think we can just add an assert in these function that
Assert(!TransactionIdIsValid(CheckXidAlive)) ?

Isn't it better to block the scan to user catalog tables or regular
system catalog tables for tableam scan APIs rather than at the heap
level? There might be some APIs like heap_getnext where such a check
might still be required but I guess it is still better to block at
tableam level.

[1] - /messages/by-id/20180726200241.aje4dv4jsv25v4k2@alap3.anarazel.de

Okay, let me analyze this part. Because someplace we have to keep at
heap level like heap_getnext and other places at tableam level so it
seems a bit inconsistent. Also, I think the number of checks might
going to increase because some of the heap functions like
heap_hot_search_buffer are being called from multiple tableam calls,
so we need to put check at every place.

Another point is that I feel some of the checks what we have today
might not be required like heap_finish_speculative, is not fetching
any tuple for us so why do we need to care about this function?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#280Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#279)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

[latest patches]

v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
-     Any actions leading to transaction ID assignment are prohibited.
That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the
<literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment
are prohibited. That, among others,
..
@@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
bool valid;
/*
+ * We don't expect direct calls to heap_fetch with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ elog(ERROR, "unexpected heap_fetch call during logical decoding");
+

I think comments and code don't match. In the comment, we are saying
that via output plugins access to user catalog tables or regular
system catalog tables won't be allowed via heap_* APIs but code
doesn't seem to reflect it. I feel only
TransactionIdIsValid(CheckXidAlive) is sufficient here. See, the
original discussion about this point [1] (Refer "I think it'd also be
good to add assertions to codepaths not going through systable_*
asserting that ...").

Right, So I think we can just add an assert in these function that
Assert(!TransactionIdIsValid(CheckXidAlive)) ?

I am fine with Assertion but update the documentation accordingly.
However, I think you can once cross-verify if there are any output
plugins that are already using such APIs. There is a list of "Logical
Decoding Plugins" on the wiki [1]https://wiki.postgresql.org/wiki/Logical_Decoding_Plugins, just look into those once.

Isn't it better to block the scan to user catalog tables or regular
system catalog tables for tableam scan APIs rather than at the heap
level? There might be some APIs like heap_getnext where such a check
might still be required but I guess it is still better to block at
tableam level.

[1] - /messages/by-id/20180726200241.aje4dv4jsv25v4k2@alap3.anarazel.de

Okay, let me analyze this part. Because someplace we have to keep at
heap level like heap_getnext and other places at tableam level so it
seems a bit inconsistent. Also, I think the number of checks might
going to increase because some of the heap functions like
heap_hot_search_buffer are being called from multiple tableam calls,
so we need to put check at every place.

Another point is that I feel some of the checks what we have today
might not be required like heap_finish_speculative, is not fetching
any tuple for us so why do we need to care about this function?

Yeah, I don't see the need for such a check (or Assertion) in
heap_finish_speculative.

One additional comment:
---------------------------------------
-     Any actions leading to transaction ID assignment are prohibited.
That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the
<literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment
are prohibited. That, among others,

The above text doesn't seem to be aligned properly and you need to
update it if we want to change the error to Assertion for heap APIs

[1]: https://wiki.postgresql.org/wiki/Logical_Decoding_Plugins

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#281Mahendra Singh Thalor
mahi6run@gmail.com
In reply to: Dilip Kumar (#274)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, 24 Apr 2020 at 11:55, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Apr 23, 2020 at 2:28 PM Erik Rijkers <er@xs4all.nl> wrote:

On 2020-04-23 05:24, Dilip Kumar wrote:

On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote:

The 'ddl' one is apparently not quite fixed - I get this in (cd
contrib; make check)' (in both assert-enabled and non-assert-enabled
build)

Can you send me the contrib/test_decoding/regression.diffs file?

Attached.

So from regression.diff, it appears that in failing in memory
allocation (+ERROR: invalid memory alloc request size
94119198201896). My colleague tried to reproduce this in a different
environment but there is no success so far. One more thing surprises
me is that after
(v15-0011-Provide-new-api-to-get-the-streaming-changes.patch)
actually, it should never go for the streaming path. However, we can
not ignore the fact that some of the changes might impact the
non-streaming path as well. Is it possible for you to somehow stop or
break the code and send the stack trace? One idea is by seeing the
log we can see from where the error is raised i.e MemoryContextAlloc
or palloc or some other similar function. Once we know that we can
convert that error to an assert and find the call stack.

--

Thanks Erik for reporting this issue.

I am able to reproduce this issue(+ERROR: invalid memory alloc
request size) on the top of v16 patch set. I applied all patches(12
patches) of v16 series and then I fired "make check -i" from
"contrib/test_decoding" folder. Below is stack trace of error:

#0 0x0000560b1350902d in MemoryContextAlloc (context=0x560b14188d70,
size=94605581787992) at mcxt.c:806
#1 0x0000560b130f0ad5 in ReorderBufferRestoreChange
(rb=0x560b14188e90, txn=0x560b141baf08, data=0x560b1418a5e8 "K") at
reorderbuffer.c:3680
#2 0x0000560b130f0662 in ReorderBufferRestoreChanges
(rb=0x560b14188e90, txn=0x560b141baf08, file=0x560b1418ad10,
segno=0x560b1418ad20) at reorderbuffer.c:3564
#3 0x0000560b130e918a in ReorderBufferIterTXNInit (rb=0x560b14188e90,
txn=0x560b141baf08, iter_state=0x7ffef18b1600) at reorderbuffer.c:1186
#4 0x0000560b130eaee1 in ReorderBufferProcessTXN (rb=0x560b14188e90,
txn=0x560b141baf08, commit_lsn=25986584, snapshot_now=0x560b141b74d8,
command_id=0, streaming=false)
at reorderbuffer.c:1785
#5 0x0000560b130ecae1 in ReorderBufferCommit (rb=0x560b14188e90,
xid=508, commit_lsn=25986584, end_lsn=25989088,
commit_time=641449268431600, origin_id=0, origin_lsn=0)
at reorderbuffer.c:2315
#6 0x0000560b130d14a1 in DecodeCommit (ctx=0x560b1416ea80,
buf=0x7ffef18b19b0, parsed=0x7ffef18b1850, xid=508) at decode.c:654
#7 0x0000560b130cff98 in DecodeXactOp (ctx=0x560b1416ea80,
buf=0x7ffef18b19b0) at decode.c:261
#8 0x0000560b130cf99a in LogicalDecodingProcessRecord
(ctx=0x560b1416ea80, record=0x560b1416ee00) at decode.c:130
#9 0x0000560b130dbbbc in pg_logical_slot_get_changes_guts
(fcinfo=0x560b1417ee50, confirm=true, binary=false, streaming=false)
at logicalfuncs.c:285
#10 0x0000560b130dbe71 in pg_logical_slot_get_changes
(fcinfo=0x560b1417ee50) at logicalfuncs.c:354
#11 0x0000560b12e294d4 in ExecMakeTableFunctionResult
(setexpr=0x560b14177838, econtext=0x560b14177748,
argContext=0x560b1417ed30, expectedDesc=0x560b141814a0,
randomAccess=false) at execSRF.c:234
#12 0x0000560b12e5490f in FunctionNext (node=0x560b14177630) at
nodeFunctionscan.c:94
#13 0x0000560b12e2c108 in ExecScanFetch (node=0x560b14177630,
accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
<FunctionRecheck>) at execScan.c:133
#14 0x0000560b12e2c227 in ExecScan (node=0x560b14177630,
accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
<FunctionRecheck>) at execScan.c:199
#15 0x0000560b12e54e9b in ExecFunctionScan (pstate=0x560b14177630) at
nodeFunctionscan.c:270
#16 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14177630) at
execProcnode.c:450
#17 0x0000560b12e3e172 in ExecProcNode (node=0x560b14177630) at
../../../src/include/executor/executor.h:245
#18 0x0000560b12e3e998 in fetch_input_tuple (aggstate=0x560b14176f40)
at nodeAgg.c:566
#19 0x0000560b12e4398f in agg_fill_hash_table
(aggstate=0x560b14176f40) at nodeAgg.c:2518
#20 0x0000560b12e42c9a in ExecAgg (pstate=0x560b14176f40) at nodeAgg.c:2139
#21 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176f40) at
execProcnode.c:450
#22 0x0000560b12e8bb58 in ExecProcNode (node=0x560b14176f40) at
../../../src/include/executor/executor.h:245
#23 0x0000560b12e8bd59 in ExecSort (pstate=0x560b14176d28) at nodeSort.c:108
#24 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176d28) at
execProcnode.c:450
#25 0x0000560b12e10e71 in ExecProcNode (node=0x560b14176d28) at
../../../src/include/executor/executor.h:245
#26 0x0000560b12e15c4c in ExecutePlan (estate=0x560b14176af0,
planstate=0x560b14176d28, use_parallel_mode=false,
operation=CMD_SELECT, sendTuples=true, numberTuples=0,
direction=ForwardScanDirection, dest=0x560b1419d188,
execute_once=true) at execMain.c:1646
#27 0x0000560b12e11a19 in standard_ExecutorRun
(queryDesc=0x560b1412db10, direction=ForwardScanDirection, count=0,
execute_once=true) at execMain.c:364
#28 0x0000560b12e116e1 in ExecutorRun (queryDesc=0x560b1412db10,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:308
#29 0x0000560b131f2177 in PortalRunSelect (portal=0x560b140db860,
forward=true, count=0, dest=0x560b1419d188) at pquery.c:912
#30 0x0000560b131f1b14 in PortalRun (portal=0x560b140db860,
count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x560b1419d188, altdest=0x560b1419d188,
qc=0x7ffef18b2350) at pquery.c:756
#31 0x0000560b131e550b in exec_simple_query (
query_string=0x560b14076720 "/ display results, but hide most of the
output /\nSELECT count(*), min(data), max(data)\nFROM
pg_logical_slot_get_changes('regression_slot', NULL, NULL,
'include-xids', '0', 'skip-empty-xacts', '1')\nG"...) at
postgres.c:1239
#32 0x0000560b131ee343 in PostgresMain (argc=1, argv=0x560b1409faa0,
dbname=0x560b1409f858 "contrib_regression", username=0x560b1409f830
"mahendrathalor") at postgres.c:4315
#33 0x0000560b130a325b in BackendRun (port=0x560b14096880) at postmaster.c:4510
#34 0x0000560b130a22c3 in BackendStartup (port=0x560b14096880) at
postmaster.c:4202
#35 0x0000560b1309a5cc in ServerLoop () at postmaster.c:1727
#36 0x0000560b130997c9 in PostmasterMain (argc=8, argv=0x560b1406f010)
at postmaster.c:1400
#37 0x0000560b12ee9530 in main (argc=8, argv=0x560b1406f010) at main.c:210

I have Ubuntu setup. I think, this is reproducing into Ubuntu only. I
am looking into this issue with Dilip.

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Attachments:

regression.diffsapplication/octet-stream; name=regression.diffsDownload
diff -U3 /home/mahendrathalor/PG_TESTING/testPG/postgres/contrib/test_decoding/expected/ddl.out /home/mahendrathalor/PG_TESTING/testPG/postgres/contrib/test_decoding/results/ddl.out
--- /home/mahendrathalor/PG_TESTING/testPG/postgres/contrib/test_decoding/expected/ddl.out	2020-04-27 20:39:35.265395031 -0700
+++ /home/mahendrathalor/PG_TESTING/testPG/postgres/contrib/test_decoding/results/ddl.out	2020-04-28 20:32:02.435335508 -0700
@@ -245,15 +245,7 @@
 FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1')
 GROUP BY substring(data, 1, 24)
 ORDER BY 1,2;
- count |                                  min                                  |                                  max                                   
--------+-----------------------------------------------------------------------+------------------------------------------------------------------------
-     1 | BEGIN                                                                 | BEGIN
-     1 | COMMIT                                                                | COMMIT
-     1 | message: transactional: 1 prefix: test, sz: 14 content:tx logical msg | message: transactional: 1 prefix: test, sz: 14 content:tx logical msg
-     1 | table public.tr_oddlength: INSERT: id[text]:'ab' data[text]:'foo'     | table public.tr_oddlength: INSERT: id[text]:'ab' data[text]:'foo'
- 20467 | table public.tr_etoomuch: DELETE: id[integer]:1                       | table public.tr_etoomuch: UPDATE: id[integer]:9999 data[integer]:-9999
-(5 rows)
-
+ERROR:  invalid memory alloc request size 94222912546760
 -- check updates of primary keys work correctly
 BEGIN;
 CREATE TABLE spoolme AS SELECT g.i FROM generate_series(1, 5000) g(i);
@@ -266,13 +258,7 @@
 SELECT data
 FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1')
 WHERE data ~ 'UPDATE';
-                                                    data                                                     
--------------------------------------------------------------------------------------------------------------
- table public.tr_etoomuch: UPDATE: old-key: id[integer]:5000 new-tuple: id[integer]:-5000 data[integer]:5000
- table public.tr_oddlength: UPDATE: old-key: id[text]:'ab' new-tuple: id[text]:'x' data[text]:'quux'
- table public.tr_oddlength: UPDATE: old-key: id[text]:'x' new-tuple: id[text]:'yy' data[text]:'a'
-(3 rows)
-
+ERROR:  invalid memory alloc request size 94222912505736
 -- check that a large, spooled, upsert works
 INSERT INTO tr_etoomuch (id, data)
 SELECT g.i, -g.i FROM generate_series(8000, 12000) g(i)
@@ -281,14 +267,7 @@
 FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1') WITH ORDINALITY
 GROUP BY 1
 ORDER BY min(ordinality);
-           substring           | count 
--------------------------------+-------
- BEGIN                         |     1
- table public.tr_etoomuch: UPD |  2235
- table public.tr_etoomuch: INS |  1766
- COMMIT                        |     1
-(4 rows)
-
+ERROR:  invalid memory alloc request size 94222912546760
 /*
  * check whether we decode subtransactions correctly in relation with each
  * other
@@ -310,18 +289,7 @@
 RELEASE SAVEPOINT b;
 COMMIT;
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-                                 data                                 
-----------------------------------------------------------------------
- BEGIN
- table public.tr_sub: INSERT: id[integer]:1 path[text]:'1-top-#1'
- table public.tr_sub: INSERT: id[integer]:2 path[text]:'1-top-1-#1'
- table public.tr_sub: INSERT: id[integer]:3 path[text]:'1-top-1-#2'
- table public.tr_sub: INSERT: id[integer]:4 path[text]:'1-top-2-1-#1'
- table public.tr_sub: INSERT: id[integer]:5 path[text]:'1-top-2-1-#2'
- table public.tr_sub: INSERT: id[integer]:6 path[text]:'1-top-2-#1'
- COMMIT
-(8 rows)
-
+ERROR:  invalid memory alloc request size 94222912497528
 -- check that we handle xlog assignments correctly
 BEGIN;
 -- nest 80 subtxns
@@ -349,16 +317,7 @@
 INSERT INTO tr_sub(path) VALUES ('2-top-#1');
 COMMIT;
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-                                  data                                  
-------------------------------------------------------------------------
- BEGIN
- table public.tr_sub: INSERT: id[integer]:7 path[text]:'2-top-1...--#1'
- table public.tr_sub: INSERT: id[integer]:8 path[text]:'2-top-1...--#2'
- table public.tr_sub: INSERT: id[integer]:9 path[text]:'2-top-1...--#3'
- table public.tr_sub: INSERT: id[integer]:10 path[text]:'2-top-#1'
- COMMIT
-(6 rows)
-
+ERROR:  invalid memory alloc request size 94222912497528
 -- make sure rollbacked subtransactions aren't decoded
 BEGIN;
 INSERT INTO tr_sub(path) VALUES ('3-top-2-#1');
@@ -370,15 +329,7 @@
 INSERT INTO tr_sub(path) VALUES ('3-top-2-#2');
 COMMIT;
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-                                 data                                  
------------------------------------------------------------------------
- BEGIN
- table public.tr_sub: INSERT: id[integer]:11 path[text]:'3-top-2-#1'
- table public.tr_sub: INSERT: id[integer]:12 path[text]:'3-top-2-1-#1'
- table public.tr_sub: INSERT: id[integer]:14 path[text]:'3-top-2-#2'
- COMMIT
-(5 rows)
-
+ERROR:  invalid memory alloc request size 94222912497528
 -- test whether a known, but not yet logged toplevel xact, followed by a
 -- subxact commit is handled correctly
 BEGIN;
@@ -399,16 +350,7 @@
 INSERT INTO tr_sub(path) VALUES ('5-top-1-#1');
 COMMIT;
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-                                data                                 
----------------------------------------------------------------------
- BEGIN
- table public.tr_sub: INSERT: id[integer]:15 path[text]:'4-top-1-#1'
- COMMIT
- BEGIN
- table public.tr_sub: INSERT: id[integer]:16 path[text]:'5-top-1-#1'
- COMMIT
-(6 rows)
-
+ERROR:  invalid memory alloc request size 94222912497528
 -- check that DDL in aborted subtransactions handled correctly
 CREATE TABLE tr_sub_ddl(data int);
 BEGIN;
@@ -420,13 +362,7 @@
 INSERT INTO tr_sub_ddl VALUES(43);
 COMMIT;
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-                       data                       
---------------------------------------------------
- BEGIN
- table public.tr_sub_ddl: INSERT: data[bigint]:43
- COMMIT
-(3 rows)
-
+ERROR:  invalid memory alloc request size 94222912595992
 /*
  * Check whether treating a table as a catalog table works somewhat
  */
@@ -497,22 +433,7 @@
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-                                                                data                                                                
-------------------------------------------------------------------------------------------------------------------------------------
- BEGIN
- table public.replication_metadata: INSERT: id[integer]:1 relation[name]:'foo' options[text[]]:'{a,b}'
- COMMIT
- BEGIN
- table public.replication_metadata: INSERT: id[integer]:2 relation[name]:'bar' options[text[]]:'{a,b}'
- COMMIT
- BEGIN
- table public.replication_metadata: INSERT: id[integer]:3 relation[name]:'blub' options[text[]]:null
- COMMIT
- BEGIN
- table public.replication_metadata: INSERT: id[integer]:4 relation[name]:'zaphod' options[text[]]:null rewritemeornot[integer]:null
- COMMIT
-(12 rows)
-
+ERROR:  invalid memory alloc request size 94222912374552
 /*
  * check whether we handle updates/deletes correct with & without a pkey
  */
@@ -585,113 +506,7 @@
     SET toasted_col1 = (SELECT string_agg(g.i::text, '') FROM generate_series(1, 2000) g(i))
 WHERE id = 1;
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          data                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
- BEGIN
- table public.table_without_key: INSERT: id[integer]:1 data[integer]:1
- table public.table_without_key: INSERT: id[integer]:2 data[integer]:2
- COMMIT
- BEGIN
- table public.table_without_key: DELETE: (no-tuple-data)
- COMMIT
- BEGIN
- table public.table_without_key: UPDATE: id[integer]:2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_without_key: UPDATE: id[integer]:-2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_without_key: UPDATE: id[integer]:2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_without_key: UPDATE: old-key: id[integer]:2 data[integer]:3 new-tuple: id[integer]:-2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_without_key: UPDATE: old-key: id[integer]:-2 data[integer]:3 new-tuple: id[integer]:2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_without_key: UPDATE: old-key: id[integer]:2 data[integer]:3 new-tuple: id[integer]:-2 data[integer]:3 new_column[text]:null
- COMMIT
- BEGIN
- table public.table_without_key: UPDATE: old-key: id[integer]:-2 data[integer]:3 new-tuple: id[integer]:2 data[integer]:3 new_column[text]:'someval'
- COMMIT
- BEGIN
- table public.table_without_key: DELETE: id[integer]:2 data[integer]:3 new_column[text]:'someval'
- COMMIT
- BEGIN
- table public.table_with_pkey: INSERT: id[integer]:1 data[integer]:1
- table public.table_with_pkey: INSERT: id[integer]:2 data[integer]:2
- COMMIT
- BEGIN
- table public.table_with_pkey: DELETE: id[integer]:1
- COMMIT
- BEGIN
- table public.table_with_pkey: UPDATE: id[integer]:2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_pkey: UPDATE: old-key: id[integer]:2 new-tuple: id[integer]:-2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_pkey: UPDATE: old-key: id[integer]:-2 new-tuple: id[integer]:2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_pkey: UPDATE: old-key: id[integer]:2 new-tuple: id[integer]:-2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_pkey: UPDATE: old-key: id[integer]:-2 new-tuple: id[integer]:2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_pkey: DELETE: id[integer]:2
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: INSERT: id[integer]:1 data[integer]:1
- table public.table_with_unique_not_null: INSERT: id[integer]:2 data[integer]:2
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: DELETE: (no-tuple-data)
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: UPDATE: id[integer]:2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: UPDATE: id[integer]:-2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: UPDATE: id[integer]:2 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: DELETE: (no-tuple-data)
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: INSERT: id[integer]:3 data[integer]:1
- table public.table_with_unique_not_null: INSERT: id[integer]:4 data[integer]:2
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: DELETE: id[integer]:3
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: UPDATE: id[integer]:4 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: UPDATE: old-key: id[integer]:4 new-tuple: id[integer]:-4 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: UPDATE: old-key: id[integer]:-4 new-tuple: id[integer]:4 data[integer]:3
- COMMIT
- BEGIN
- table public.table_with_unique_not_null: DELETE: id[integer]:4
- COMMIT
- BEGIN
- table public.toasttable: INSERT: id[integer]:1 toasted_col1[text]:'12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916016116216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320420520620720820921021121221321421521621721821922022122222322422522622722822923023123223323423523623723823924024124224324424524624724824925025125225325425525625725825926026126226326426526626726826927027127227327427527627727827928028128228328428528628728828929029129229329429529629729829930030130230330430530630730830931031131231331431531631731831932032132232332432532632732832933033133233333433533633733833934034134234334434534634734834935035135235335435535635735835936036136236336436536636736836937037137237337437537637737837938038138238338438538638738838939039139239339439539639739839940040140240340440540640740840941041141241341441541641741841942042142242342442542642742842943043143243343443543643743843944044144244344444544644744844945045145245345445545645745845946046146246346446546646746846947047147247347447547647747847948048148248348448548648748848949049149249349449549649749849950050150250350450550650750850951051151251351451551651751851952052152252352452552652752852953053153253353453553653753853954054154254354454554654754854955055155255355455555655755855956056156256356456556656756856957057157257357457557657757857958058158258358458558658758858959059159259359459559659759859960060160260360460560660760860961061161261361461561661761861962062162262362462562662762862963063163263363463563663763863964064164264364464564664764864965065165265365465565665765865966066166266366466566666766866967067167267367467567667767867968068168268368468568668768868969069169269369469569669769869970070170270370470570670770870971071171271371471571671771871972072172272372472572672772872973073173273373473573673773873974074174274374474574674774874975075175275375475575675775875976076176276376476576676776876977077177277377477577677777877978078178278378478578678778878979079179279379479579679779879980080180280380480580680780880981081181281381481581681781881982082182282382482582682782882983083183283383483583683783883984084184284384484584684784884985085185285385485585685785885986086186286386486586686786886987087187287387487587687787887988088188288388488588688788888989089189289389489589689789889990090190290390490590690790890991091191291391491591691791891992092192292392492592692792892993093193293393493593693793893994094194294394494594694794894995095195295395495595695795895996096196296396496596696796896997097197297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007100810091010101110121013101410151016101710181019102010211022102310241025102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070107110721073107410751076107710781079108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163116411651166116711681169117011711172117311741175117611771178117911801181118211831184118511861187118811891190119111921193119411951196119711981199120012011202120312041205120612071208120912101211121212131214121512161217121812191220122112221223122412251226122712281229123012311232123312341235123612371238123912401241124212431244124512461247124812491250125112521253125412551256125712581259126012611262126312641265126612671268126912701271127212731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295129612971298129913001301130213031304130513061307130813091310131113121313131413151316131713181319132013211322132313241325132613271328132913301331133213331334133513361337133813391340134113421343134413451346134713481349135013511352135313541355135613571358135913601361136213631364136513661367136813691370137113721373137413751376137713781379138013811382138313841385138613871388138913901391139213931394139513961397139813991400140114021403140414051406140714081409141014111412141314141415141614171418141914201421142214231424142514261427142814291430143114321433143414351436143714381439144014411442144314441445144614471448144914501451145214531454145514561457145814591460146114621463146414651466146714681469147014711472147314741475147614771478147914801481148214831484148514861487148814891490149114921493149414951496149714981499150015011502150315041505150615071508150915101511151215131514151515161517151815191520152115221523152415251526152715281529153015311532153315341535153615371538153915401541154215431544154515461547154815491550155115521553155415551556155715581559156015611562156315641565156615671568156915701571157215731574157515761577157815791580158115821583158415851586158715881589159015911592159315941595159615971598159916001601160216031604160516061607160816091610161116121613161416151616161716181619162016211622162316241625162616271628162916301631163216331634163516361637163816391640164116421643164416451646164716481649165016511652165316541655165616571658165916601661166216631664166516661667166816691670167116721673167416751676167716781679168016811682168316841685168616871688168916901691169216931694169516961697169816991700170117021703170417051706170717081709171017111712171317141715171617171718171917201721172217231724172517261727172817291730173117321733173417351736173717381739174017411742174317441745174617471748174917501751175217531754175517561757175817591760176117621763176417651766176717681769177017711772177317741775177617771778177917801781178217831784178517861787178817891790179117921793179417951796179717981799180018011802180318041805180618071808180918101811181218131814181518161817181818191820182118221823182418251826182718281829183018311832183318341835183618371838183918401841184218431844184518461847184818491850185118521853185418551856185718581859186018611862186318641865186618671868186918701871187218731874187518761877187818791880188118821883188418851886188718881889189018911892189318941895189618971898189919001901190219031904190519061907190819091910191119121913191419151916191719181919192019211922192319241925192619271928192919301931193219331934193519361937193819391940194119421943194419451946194719481949195019511952195319541955195619571958195919601961196219631964196519661967196819691970197119721973197419751976197719781979198019811982198319841985198619871988198919901991199219931994199519961997199819992000' rand1[double precision]:79 toasted_col2[text]:null rand2[double precision]:1578
- COMMIT
- BEGIN
- table public.toasttable: INSERT: id[integer]:2 toasted_col1[text]:null rand1[double precision]:3077 toasted_col2[text]:'0001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500000100020003000400050006000700080009001000110012001300140015001600170018001900200021002200230024002500260027002800290030003100320033003400350036003700380039004000410042004300440045004600470048004900500051005200530054005500560057005800590060006100620063006400650066006700680069007000710072007300740075007600770078007900800081008200830084008500860087008800890090009100920093009400950096009700980099010001010102010301040105010601070108010901100111011201130114011501160117011801190120012101220123012401250126012701280129013001310132013301340135013601370138013901400141014201430144014501460147014801490150015101520153015401550156015701580159016001610162016301640165016601670168016901700171017201730174017501760177017801790180018101820183018401850186018701880189019001910192019301940195019601970198019902000201020202030204020502060207020802090210021102120213021402150216021702180219022002210222022302240225022602270228022902300231023202330234023502360237023802390240024102420243024402450246024702480249025002510252025302540255025602570258025902600261026202630264026502660267026802690270027102720273027402750276027702780279028002810282028302840285028602870288028902900291029202930294029502960297029802990300030103020303030403050306030703080309031003110312031303140315031603170318031903200321032203230324032503260327032803290330033103320333033403350336033703380339034003410342034303440345034603470348034903500351035203530354035503560357035803590360036103620363036403650366036703680369037003710372037303740375037603770378037903800381038203830384038503860387038803890390039103920393039403950396039703980399040004010402040304040405040604070408040904100411041204130414041504160417041804190420042104220423042404250426042704280429043004310432043304340435043604370438043904400441044204430444044504460447044804490450045104520453045404550456045704580459046004610462046304640465046604670468046904700471047204730474047504760477047804790480048104820483048404850486048704880489049004910492049304940495049604970498049905000001000200030004000500060007000800090010001100120013001400150016001700180019002000210022002300240025002600270028002900300031003200330034003500360037003800390040004100420043004400450046004700480049005000510052005300540055005600570058005900600061006200630064006500660067006800690070007100720073007400750076007700780079008000810082008300840085008600870088008900900091009200930094009500960097009800990100010101020103010401050106010701080109011001110112011301140115011601170118011901200121012201230124012501260127012801290130013101320133013401350136013701380139014001410142014301440145014601470148014901500151015201530154015501560157015801590160016101620163016401650166016701680169017001710172017301740175017601770178017901800181018201830184018501860187018801890190019101920193019401950196019701980199020002010202020302040205020602070208020902100211021202130214021502160217021802190220022102220223022402250226022702280229023002310232023302340235023602370238023902400241024202430244024502460247024802490250025102520253025402550256025702580259026002610262026302640265026602670268026902700271027202730274027502760277027802790280028102820283028402850286028702880289029002910292029302940295029602970298029903000301030203030304030503060307030803090310031103120313031403150316031703180319032003210322032303240325032603270328032903300331033203330334033503360337033803390340034103420343034403450346034703480349035003510352035303540355035603570358035903600361036203630364036503660367036803690370037103720373037403750376037703780379038003810382038303840385038603870388038903900391039203930394039503960397039803990400040104020403040404050406040704080409041004110412041304140415041604170418041904200421042204230424042504260427042804290430043104320433043404350436043704380439044004410442044304440445044604470448044904500451045204530454045504560457045804590460046104620463046404650466046704680469047004710472047304740475047604770478047904800481048204830484048504860487048804890490049104920493049404950496049704980499050000010002000300040005000600070008000900100011001200130014001500160017001800190020002100220023002400250026002700280029003000310032003300340035003600370038003900400041004200430044004500460047004800490050005100520053005400550056005700580059006000610062006300640065006600670068006900700071007200730074007500760077007800790080008100820083008400850086008700880089009000910092009300940095009600970098009901000101010201030104010501060107010801090110011101120113011401150116011701180119012001210122012301240125012601270128012901300131013201330134013501360137013801390140014101420143014401450146014701480149015001510152015301540155015601570158015901600161016201630164016501660167016801690170017101720173017401750176017701780179018001810182018301840185018601870188018901900191019201930194019501960197019801990200020102020203020402050206020702080209021002110212021302140215021602170218021902200221022202230224022502260227022802290230023102320233023402350236023702380239024002410242024302440245024602470248024902500251025202530254025502560257025802590260026102620263026402650266026702680269027002710272027302740275027602770278027902800281028202830284028502860287028802890290029102920293029402950296029702980299030003010302030303040305030603070308030903100311031203130314031503160317031803190320032103220323032403250326032703280329033003310332033303340335033603370338033903400341034203430344034503460347034803490350035103520353035403550356035703580359036003610362036303640365036603670368036903700371037203730374037503760377037803790380038103820383038403850386038703880389039003910392039303940395039603970398039904000401040204030404040504060407040804090410041104120413041404150416041704180419042004210422042304240425042604270428042904300431043204330434043504360437043804390440044104420443044404450446044704480449045004510452045304540455045604570458045904600461046204630464046504660467046804690470047104720473047404750476047704780479048004810482048304840485048604870488048904900491049204930494049504960497049804990500' rand2[double precision]:4576
- COMMIT
- BEGIN
- table public.toasttable: UPDATE: id[integer]:1 toasted_col1[text]:'12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916016116216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320420520620720820921021121221321421521621721821922022122222322422522622722822923023123223323423523623723823924024124224324424524624724824925025125225325425525625725825926026126226326426526626726826927027127227327427527627727827928028128228328428528628728828929029129229329429529629729829930030130230330430530630730830931031131231331431531631731831932032132232332432532632732832933033133233333433533633733833934034134234334434534634734834935035135235335435535635735835936036136236336436536636736836937037137237337437537637737837938038138238338438538638738838939039139239339439539639739839940040140240340440540640740840941041141241341441541641741841942042142242342442542642742842943043143243343443543643743843944044144244344444544644744844945045145245345445545645745845946046146246346446546646746846947047147247347447547647747847948048148248348448548648748848949049149249349449549649749849950050150250350450550650750850951051151251351451551651751851952052152252352452552652752852953053153253353453553653753853954054154254354454554654754854955055155255355455555655755855956056156256356456556656756856957057157257357457557657757857958058158258358458558658758858959059159259359459559659759859960060160260360460560660760860961061161261361461561661761861962062162262362462562662762862963063163263363463563663763863964064164264364464564664764864965065165265365465565665765865966066166266366466566666766866967067167267367467567667767867968068168268368468568668768868969069169269369469569669769869970070170270370470570670770870971071171271371471571671771871972072172272372472572672772872973073173273373473573673773873974074174274374474574674774874975075175275375475575675775875976076176276376476576676776876977077177277377477577677777877978078178278378478578678778878979079179279379479579679779879980080180280380480580680780880981081181281381481581681781881982082182282382482582682782882983083183283383483583683783883984084184284384484584684784884985085185285385485585685785885986086186286386486586686786886987087187287387487587687787887988088188288388488588688788888989089189289389489589689789889990090190290390490590690790890991091191291391491591691791891992092192292392492592692792892993093193293393493593693793893994094194294394494594694794894995095195295395495595695795895996096196296396496596696796896997097197297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007100810091010101110121013101410151016101710181019102010211022102310241025102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070107110721073107410751076107710781079108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163116411651166116711681169117011711172117311741175117611771178117911801181118211831184118511861187118811891190119111921193119411951196119711981199120012011202120312041205120612071208120912101211121212131214121512161217121812191220122112221223122412251226122712281229123012311232123312341235123612371238123912401241124212431244124512461247124812491250125112521253125412551256125712581259126012611262126312641265126612671268126912701271127212731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295129612971298129913001301130213031304130513061307130813091310131113121313131413151316131713181319132013211322132313241325132613271328132913301331133213331334133513361337133813391340134113421343134413451346134713481349135013511352135313541355135613571358135913601361136213631364136513661367136813691370137113721373137413751376137713781379138013811382138313841385138613871388138913901391139213931394139513961397139813991400140114021403140414051406140714081409141014111412141314141415141614171418141914201421142214231424142514261427142814291430143114321433143414351436143714381439144014411442144314441445144614471448144914501451145214531454145514561457145814591460146114621463146414651466146714681469147014711472147314741475147614771478147914801481148214831484148514861487148814891490149114921493149414951496149714981499150015011502150315041505150615071508150915101511151215131514151515161517151815191520152115221523152415251526152715281529153015311532153315341535153615371538153915401541154215431544154515461547154815491550155115521553155415551556155715581559156015611562156315641565156615671568156915701571157215731574157515761577157815791580158115821583158415851586158715881589159015911592159315941595159615971598159916001601160216031604160516061607160816091610161116121613161416151616161716181619162016211622162316241625162616271628162916301631163216331634163516361637163816391640164116421643164416451646164716481649165016511652165316541655165616571658165916601661166216631664166516661667166816691670167116721673167416751676167716781679168016811682168316841685168616871688168916901691169216931694169516961697169816991700170117021703170417051706170717081709171017111712171317141715171617171718171917201721172217231724172517261727172817291730173117321733173417351736173717381739174017411742174317441745174617471748174917501751175217531754175517561757175817591760176117621763176417651766176717681769177017711772177317741775177617771778177917801781178217831784178517861787178817891790179117921793179417951796179717981799180018011802180318041805180618071808180918101811181218131814181518161817181818191820182118221823182418251826182718281829183018311832183318341835183618371838183918401841184218431844184518461847184818491850185118521853185418551856185718581859186018611862186318641865186618671868186918701871187218731874187518761877187818791880188118821883188418851886188718881889189018911892189318941895189618971898189919001901190219031904190519061907190819091910191119121913191419151916191719181919192019211922192319241925192619271928192919301931193219331934193519361937193819391940194119421943194419451946194719481949195019511952195319541955195619571958195919601961196219631964196519661967196819691970197119721973197419751976197719781979198019811982198319841985198619871988198919901991199219931994199519961997199819992000' rand1[double precision]:79 toasted_col2[text]:null rand2[double precision]:1578
- COMMIT
-(103 rows)
-
+ERROR:  invalid memory alloc request size 94222912472104
 INSERT INTO toasttable(toasted_col1) SELECT string_agg(g.i::text, '') FROM generate_series(1, 2000) g(i);
 -- update of second column, first column unchanged
 UPDATE toasttable
@@ -700,22 +515,10 @@
 -- make sure we decode correctly even if the toast table is gone
 DROP TABLE toasttable;
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        data                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
- BEGIN
- table public.toasttable: INSERT: id[integer]:3 toasted_col1[text]:'12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916016116216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320420520620720820921021121221321421521621721821922022122222322422522622722822923023123223323423523623723823924024124224324424524624724824925025125225325425525625725825926026126226326426526626726826927027127227327427527627727827928028128228328428528628728828929029129229329429529629729829930030130230330430530630730830931031131231331431531631731831932032132232332432532632732832933033133233333433533633733833934034134234334434534634734834935035135235335435535635735835936036136236336436536636736836937037137237337437537637737837938038138238338438538638738838939039139239339439539639739839940040140240340440540640740840941041141241341441541641741841942042142242342442542642742842943043143243343443543643743843944044144244344444544644744844945045145245345445545645745845946046146246346446546646746846947047147247347447547647747847948048148248348448548648748848949049149249349449549649749849950050150250350450550650750850951051151251351451551651751851952052152252352452552652752852953053153253353453553653753853954054154254354454554654754854955055155255355455555655755855956056156256356456556656756856957057157257357457557657757857958058158258358458558658758858959059159259359459559659759859960060160260360460560660760860961061161261361461561661761861962062162262362462562662762862963063163263363463563663763863964064164264364464564664764864965065165265365465565665765865966066166266366466566666766866967067167267367467567667767867968068168268368468568668768868969069169269369469569669769869970070170270370470570670770870971071171271371471571671771871972072172272372472572672772872973073173273373473573673773873974074174274374474574674774874975075175275375475575675775875976076176276376476576676776876977077177277377477577677777877978078178278378478578678778878979079179279379479579679779879980080180280380480580680780880981081181281381481581681781881982082182282382482582682782882983083183283383483583683783883984084184284384484584684784884985085185285385485585685785885986086186286386486586686786886987087187287387487587687787887988088188288388488588688788888989089189289389489589689789889990090190290390490590690790890991091191291391491591691791891992092192292392492592692792892993093193293393493593693793893994094194294394494594694794894995095195295395495595695795895996096196296396496596696796896997097197297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007100810091010101110121013101410151016101710181019102010211022102310241025102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070107110721073107410751076107710781079108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163116411651166116711681169117011711172117311741175117611771178117911801181118211831184118511861187118811891190119111921193119411951196119711981199120012011202120312041205120612071208120912101211121212131214121512161217121812191220122112221223122412251226122712281229123012311232123312341235123612371238123912401241124212431244124512461247124812491250125112521253125412551256125712581259126012611262126312641265126612671268126912701271127212731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295129612971298129913001301130213031304130513061307130813091310131113121313131413151316131713181319132013211322132313241325132613271328132913301331133213331334133513361337133813391340134113421343134413451346134713481349135013511352135313541355135613571358135913601361136213631364136513661367136813691370137113721373137413751376137713781379138013811382138313841385138613871388138913901391139213931394139513961397139813991400140114021403140414051406140714081409141014111412141314141415141614171418141914201421142214231424142514261427142814291430143114321433143414351436143714381439144014411442144314441445144614471448144914501451145214531454145514561457145814591460146114621463146414651466146714681469147014711472147314741475147614771478147914801481148214831484148514861487148814891490149114921493149414951496149714981499150015011502150315041505150615071508150915101511151215131514151515161517151815191520152115221523152415251526152715281529153015311532153315341535153615371538153915401541154215431544154515461547154815491550155115521553155415551556155715581559156015611562156315641565156615671568156915701571157215731574157515761577157815791580158115821583158415851586158715881589159015911592159315941595159615971598159916001601160216031604160516061607160816091610161116121613161416151616161716181619162016211622162316241625162616271628162916301631163216331634163516361637163816391640164116421643164416451646164716481649165016511652165316541655165616571658165916601661166216631664166516661667166816691670167116721673167416751676167716781679168016811682168316841685168616871688168916901691169216931694169516961697169816991700170117021703170417051706170717081709171017111712171317141715171617171718171917201721172217231724172517261727172817291730173117321733173417351736173717381739174017411742174317441745174617471748174917501751175217531754175517561757175817591760176117621763176417651766176717681769177017711772177317741775177617771778177917801781178217831784178517861787178817891790179117921793179417951796179717981799180018011802180318041805180618071808180918101811181218131814181518161817181818191820182118221823182418251826182718281829183018311832183318341835183618371838183918401841184218431844184518461847184818491850185118521853185418551856185718581859186018611862186318641865186618671868186918701871187218731874187518761877187818791880188118821883188418851886188718881889189018911892189318941895189618971898189919001901190219031904190519061907190819091910191119121913191419151916191719181919192019211922192319241925192619271928192919301931193219331934193519361937193819391940194119421943194419451946194719481949195019511952195319541955195619571958195919601961196219631964196519661967196819691970197119721973197419751976197719781979198019811982198319841985198619871988198919901991199219931994199519961997199819992000' rand1[double precision]:6075 toasted_col2[text]:null rand2[double precision]:7574
- COMMIT
- BEGIN
- table public.toasttable: UPDATE: id[integer]:1 toasted_col1[text]:unchanged-toast-datum rand1[double precision]:79 toasted_col2[text]:'12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916016116216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320420520620720820921021121221321421521621721821922022122222322422522622722822923023123223323423523623723823924024124224324424524624724824925025125225325425525625725825926026126226326426526626726826927027127227327427527627727827928028128228328428528628728828929029129229329429529629729829930030130230330430530630730830931031131231331431531631731831932032132232332432532632732832933033133233333433533633733833934034134234334434534634734834935035135235335435535635735835936036136236336436536636736836937037137237337437537637737837938038138238338438538638738838939039139239339439539639739839940040140240340440540640740840941041141241341441541641741841942042142242342442542642742842943043143243343443543643743843944044144244344444544644744844945045145245345445545645745845946046146246346446546646746846947047147247347447547647747847948048148248348448548648748848949049149249349449549649749849950050150250350450550650750850951051151251351451551651751851952052152252352452552652752852953053153253353453553653753853954054154254354454554654754854955055155255355455555655755855956056156256356456556656756856957057157257357457557657757857958058158258358458558658758858959059159259359459559659759859960060160260360460560660760860961061161261361461561661761861962062162262362462562662762862963063163263363463563663763863964064164264364464564664764864965065165265365465565665765865966066166266366466566666766866967067167267367467567667767867968068168268368468568668768868969069169269369469569669769869970070170270370470570670770870971071171271371471571671771871972072172272372472572672772872973073173273373473573673773873974074174274374474574674774874975075175275375475575675775875976076176276376476576676776876977077177277377477577677777877978078178278378478578678778878979079179279379479579679779879980080180280380480580680780880981081181281381481581681781881982082182282382482582682782882983083183283383483583683783883984084184284384484584684784884985085185285385485585685785885986086186286386486586686786886987087187287387487587687787887988088188288388488588688788888989089189289389489589689789889990090190290390490590690790890991091191291391491591691791891992092192292392492592692792892993093193293393493593693793893994094194294394494594694794894995095195295395495595695795895996096196296396496596696796896997097197297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007100810091010101110121013101410151016101710181019102010211022102310241025102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070107110721073107410751076107710781079108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163116411651166116711681169117011711172117311741175117611771178117911801181118211831184118511861187118811891190119111921193119411951196119711981199120012011202120312041205120612071208120912101211121212131214121512161217121812191220122112221223122412251226122712281229123012311232123312341235123612371238123912401241124212431244124512461247124812491250125112521253125412551256125712581259126012611262126312641265126612671268126912701271127212731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295129612971298129913001301130213031304130513061307130813091310131113121313131413151316131713181319132013211322132313241325132613271328132913301331133213331334133513361337133813391340134113421343134413451346134713481349135013511352135313541355135613571358135913601361136213631364136513661367136813691370137113721373137413751376137713781379138013811382138313841385138613871388138913901391139213931394139513961397139813991400140114021403140414051406140714081409141014111412141314141415141614171418141914201421142214231424142514261427142814291430143114321433143414351436143714381439144014411442144314441445144614471448144914501451145214531454145514561457145814591460146114621463146414651466146714681469147014711472147314741475147614771478147914801481148214831484148514861487148814891490149114921493149414951496149714981499150015011502150315041505150615071508150915101511151215131514151515161517151815191520152115221523152415251526152715281529153015311532153315341535153615371538153915401541154215431544154515461547154815491550155115521553155415551556155715581559156015611562156315641565156615671568156915701571157215731574157515761577157815791580158115821583158415851586158715881589159015911592159315941595159615971598159916001601160216031604160516061607160816091610161116121613161416151616161716181619162016211622162316241625162616271628162916301631163216331634163516361637163816391640164116421643164416451646164716481649165016511652165316541655165616571658165916601661166216631664166516661667166816691670167116721673167416751676167716781679168016811682168316841685168616871688168916901691169216931694169516961697169816991700170117021703170417051706170717081709171017111712171317141715171617171718171917201721172217231724172517261727172817291730173117321733173417351736173717381739174017411742174317441745174617471748174917501751175217531754175517561757175817591760176117621763176417651766176717681769177017711772177317741775177617771778177917801781178217831784178517861787178817891790179117921793179417951796179717981799180018011802180318041805180618071808180918101811181218131814181518161817181818191820182118221823182418251826182718281829183018311832183318341835183618371838183918401841184218431844184518461847184818491850185118521853185418551856185718581859186018611862186318641865186618671868186918701871187218731874187518761877187818791880188118821883188418851886188718881889189018911892189318941895189618971898189919001901190219031904190519061907190819091910191119121913191419151916191719181919192019211922192319241925192619271928192919301931193219331934193519361937193819391940194119421943194419451946194719481949195019511952195319541955195619571958195919601961196219631964196519661967196819691970197119721973197419751976197719781979198019811982198319841985198619871988198919901991199219931994199519961997199819992000' rand2[double precision]:1578
- COMMIT
-(6 rows)
-
+ERROR:  invalid memory alloc request size 94222912472104
 -- done, free logical replication slot
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
- data 
-------
-(0 rows)
-
+ERROR:  invalid memory alloc request size 94222912472104
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
 --------------------------
#282Mahendra Singh Thalor
mahi6run@gmail.com
In reply to: Mahendra Singh Thalor (#281)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, 29 Apr 2020 at 11:15, Mahendra Singh Thalor <mahi6run@gmail.com> wrote:

On Fri, 24 Apr 2020 at 11:55, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Apr 23, 2020 at 2:28 PM Erik Rijkers <er@xs4all.nl> wrote:

On 2020-04-23 05:24, Dilip Kumar wrote:

On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote:

The 'ddl' one is apparently not quite fixed - I get this in (cd
contrib; make check)' (in both assert-enabled and non-assert-enabled
build)

Can you send me the contrib/test_decoding/regression.diffs file?

Attached.

So from regression.diff, it appears that in failing in memory
allocation (+ERROR: invalid memory alloc request size
94119198201896). My colleague tried to reproduce this in a different
environment but there is no success so far. One more thing surprises
me is that after
(v15-0011-Provide-new-api-to-get-the-streaming-changes.patch)
actually, it should never go for the streaming path. However, we can
not ignore the fact that some of the changes might impact the
non-streaming path as well. Is it possible for you to somehow stop or
break the code and send the stack trace? One idea is by seeing the
log we can see from where the error is raised i.e MemoryContextAlloc
or palloc or some other similar function. Once we know that we can
convert that error to an assert and find the call stack.

--

Thanks Erik for reporting this issue.

I am able to reproduce this issue(+ERROR: invalid memory alloc
request size) on the top of v16 patch set. I applied all patches(12
patches) of v16 series and then I fired "make check -i" from
"contrib/test_decoding" folder. Below is stack trace of error:

#0 0x0000560b1350902d in MemoryContextAlloc (context=0x560b14188d70,
size=94605581787992) at mcxt.c:806
#1 0x0000560b130f0ad5 in ReorderBufferRestoreChange
(rb=0x560b14188e90, txn=0x560b141baf08, data=0x560b1418a5e8 "K") at
reorderbuffer.c:3680
#2 0x0000560b130f0662 in ReorderBufferRestoreChanges
(rb=0x560b14188e90, txn=0x560b141baf08, file=0x560b1418ad10,
segno=0x560b1418ad20) at reorderbuffer.c:3564
#3 0x0000560b130e918a in ReorderBufferIterTXNInit (rb=0x560b14188e90,
txn=0x560b141baf08, iter_state=0x7ffef18b1600) at reorderbuffer.c:1186
#4 0x0000560b130eaee1 in ReorderBufferProcessTXN (rb=0x560b14188e90,
txn=0x560b141baf08, commit_lsn=25986584, snapshot_now=0x560b141b74d8,
command_id=0, streaming=false)
at reorderbuffer.c:1785
#5 0x0000560b130ecae1 in ReorderBufferCommit (rb=0x560b14188e90,
xid=508, commit_lsn=25986584, end_lsn=25989088,
commit_time=641449268431600, origin_id=0, origin_lsn=0)
at reorderbuffer.c:2315
#6 0x0000560b130d14a1 in DecodeCommit (ctx=0x560b1416ea80,
buf=0x7ffef18b19b0, parsed=0x7ffef18b1850, xid=508) at decode.c:654
#7 0x0000560b130cff98 in DecodeXactOp (ctx=0x560b1416ea80,
buf=0x7ffef18b19b0) at decode.c:261
#8 0x0000560b130cf99a in LogicalDecodingProcessRecord
(ctx=0x560b1416ea80, record=0x560b1416ee00) at decode.c:130
#9 0x0000560b130dbbbc in pg_logical_slot_get_changes_guts
(fcinfo=0x560b1417ee50, confirm=true, binary=false, streaming=false)
at logicalfuncs.c:285
#10 0x0000560b130dbe71 in pg_logical_slot_get_changes
(fcinfo=0x560b1417ee50) at logicalfuncs.c:354
#11 0x0000560b12e294d4 in ExecMakeTableFunctionResult
(setexpr=0x560b14177838, econtext=0x560b14177748,
argContext=0x560b1417ed30, expectedDesc=0x560b141814a0,
randomAccess=false) at execSRF.c:234
#12 0x0000560b12e5490f in FunctionNext (node=0x560b14177630) at
nodeFunctionscan.c:94
#13 0x0000560b12e2c108 in ExecScanFetch (node=0x560b14177630,
accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
<FunctionRecheck>) at execScan.c:133
#14 0x0000560b12e2c227 in ExecScan (node=0x560b14177630,
accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
<FunctionRecheck>) at execScan.c:199
#15 0x0000560b12e54e9b in ExecFunctionScan (pstate=0x560b14177630) at
nodeFunctionscan.c:270
#16 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14177630) at
execProcnode.c:450
#17 0x0000560b12e3e172 in ExecProcNode (node=0x560b14177630) at
../../../src/include/executor/executor.h:245
#18 0x0000560b12e3e998 in fetch_input_tuple (aggstate=0x560b14176f40)
at nodeAgg.c:566
#19 0x0000560b12e4398f in agg_fill_hash_table
(aggstate=0x560b14176f40) at nodeAgg.c:2518
#20 0x0000560b12e42c9a in ExecAgg (pstate=0x560b14176f40) at nodeAgg.c:2139
#21 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176f40) at
execProcnode.c:450
#22 0x0000560b12e8bb58 in ExecProcNode (node=0x560b14176f40) at
../../../src/include/executor/executor.h:245
#23 0x0000560b12e8bd59 in ExecSort (pstate=0x560b14176d28) at nodeSort.c:108
#24 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176d28) at
execProcnode.c:450
#25 0x0000560b12e10e71 in ExecProcNode (node=0x560b14176d28) at
../../../src/include/executor/executor.h:245
#26 0x0000560b12e15c4c in ExecutePlan (estate=0x560b14176af0,
planstate=0x560b14176d28, use_parallel_mode=false,
operation=CMD_SELECT, sendTuples=true, numberTuples=0,
direction=ForwardScanDirection, dest=0x560b1419d188,
execute_once=true) at execMain.c:1646
#27 0x0000560b12e11a19 in standard_ExecutorRun
(queryDesc=0x560b1412db10, direction=ForwardScanDirection, count=0,
execute_once=true) at execMain.c:364
#28 0x0000560b12e116e1 in ExecutorRun (queryDesc=0x560b1412db10,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:308
#29 0x0000560b131f2177 in PortalRunSelect (portal=0x560b140db860,
forward=true, count=0, dest=0x560b1419d188) at pquery.c:912
#30 0x0000560b131f1b14 in PortalRun (portal=0x560b140db860,
count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x560b1419d188, altdest=0x560b1419d188,
qc=0x7ffef18b2350) at pquery.c:756
#31 0x0000560b131e550b in exec_simple_query (
query_string=0x560b14076720 "/ display results, but hide most of the
output /\nSELECT count(*), min(data), max(data)\nFROM
pg_logical_slot_get_changes('regression_slot', NULL, NULL,
'include-xids', '0', 'skip-empty-xacts', '1')\nG"...) at
postgres.c:1239
#32 0x0000560b131ee343 in PostgresMain (argc=1, argv=0x560b1409faa0,
dbname=0x560b1409f858 "contrib_regression", username=0x560b1409f830
"mahendrathalor") at postgres.c:4315
#33 0x0000560b130a325b in BackendRun (port=0x560b14096880) at postmaster.c:4510
#34 0x0000560b130a22c3 in BackendStartup (port=0x560b14096880) at
postmaster.c:4202
#35 0x0000560b1309a5cc in ServerLoop () at postmaster.c:1727
#36 0x0000560b130997c9 in PostmasterMain (argc=8, argv=0x560b1406f010)
at postmaster.c:1400
#37 0x0000560b12ee9530 in main (argc=8, argv=0x560b1406f010) at main.c:210

I have Ubuntu setup. I think, this is reproducing into Ubuntu only. I
am looking into this issue with Dilip.

This error is due to invalid size.

diff --git a/src/backend/replication/logical/reorderbuffer.c
b/src/backend/replication/logical/reorderbuffer.c
index eed9a5048b..487c1b4252 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -3678,7 +3678,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb,
ReorderBufferTXN *txn,

change->data.inval.invalidations =
MemoryContextAlloc(rb->context,
-
change->data.msg.message_size);
+
inval_size);
/* read the message */

memcpy(change->data.inval.invalidations, data, inval_size);
data += inval_size;

Above change, fixes the error. Thanks Dilip for helping.

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

#283Dilip Kumar
dilipbalaut@gmail.com
In reply to: Mahendra Singh Thalor (#282)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Apr 29, 2020 at 12:37 PM Mahendra Singh Thalor
<mahi6run@gmail.com> wrote:

On Wed, 29 Apr 2020 at 11:15, Mahendra Singh Thalor <mahi6run@gmail.com> wrote:

On Fri, 24 Apr 2020 at 11:55, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Apr 23, 2020 at 2:28 PM Erik Rijkers <er@xs4all.nl> wrote:

On 2020-04-23 05:24, Dilip Kumar wrote:

On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote:

The 'ddl' one is apparently not quite fixed - I get this in (cd
contrib; make check)' (in both assert-enabled and non-assert-enabled
build)

Can you send me the contrib/test_decoding/regression.diffs file?

Attached.

So from regression.diff, it appears that in failing in memory
allocation (+ERROR: invalid memory alloc request size
94119198201896). My colleague tried to reproduce this in a different
environment but there is no success so far. One more thing surprises
me is that after
(v15-0011-Provide-new-api-to-get-the-streaming-changes.patch)
actually, it should never go for the streaming path. However, we can
not ignore the fact that some of the changes might impact the
non-streaming path as well. Is it possible for you to somehow stop or
break the code and send the stack trace? One idea is by seeing the
log we can see from where the error is raised i.e MemoryContextAlloc
or palloc or some other similar function. Once we know that we can
convert that error to an assert and find the call stack.

--

Thanks Erik for reporting this issue.

I am able to reproduce this issue(+ERROR: invalid memory alloc
request size) on the top of v16 patch set. I applied all patches(12
patches) of v16 series and then I fired "make check -i" from
"contrib/test_decoding" folder. Below is stack trace of error:

#0 0x0000560b1350902d in MemoryContextAlloc (context=0x560b14188d70,
size=94605581787992) at mcxt.c:806
#1 0x0000560b130f0ad5 in ReorderBufferRestoreChange
(rb=0x560b14188e90, txn=0x560b141baf08, data=0x560b1418a5e8 "K") at
reorderbuffer.c:3680
#2 0x0000560b130f0662 in ReorderBufferRestoreChanges
(rb=0x560b14188e90, txn=0x560b141baf08, file=0x560b1418ad10,
segno=0x560b1418ad20) at reorderbuffer.c:3564
#3 0x0000560b130e918a in ReorderBufferIterTXNInit (rb=0x560b14188e90,
txn=0x560b141baf08, iter_state=0x7ffef18b1600) at reorderbuffer.c:1186
#4 0x0000560b130eaee1 in ReorderBufferProcessTXN (rb=0x560b14188e90,
txn=0x560b141baf08, commit_lsn=25986584, snapshot_now=0x560b141b74d8,
command_id=0, streaming=false)
at reorderbuffer.c:1785
#5 0x0000560b130ecae1 in ReorderBufferCommit (rb=0x560b14188e90,
xid=508, commit_lsn=25986584, end_lsn=25989088,
commit_time=641449268431600, origin_id=0, origin_lsn=0)
at reorderbuffer.c:2315
#6 0x0000560b130d14a1 in DecodeCommit (ctx=0x560b1416ea80,
buf=0x7ffef18b19b0, parsed=0x7ffef18b1850, xid=508) at decode.c:654
#7 0x0000560b130cff98 in DecodeXactOp (ctx=0x560b1416ea80,
buf=0x7ffef18b19b0) at decode.c:261
#8 0x0000560b130cf99a in LogicalDecodingProcessRecord
(ctx=0x560b1416ea80, record=0x560b1416ee00) at decode.c:130
#9 0x0000560b130dbbbc in pg_logical_slot_get_changes_guts
(fcinfo=0x560b1417ee50, confirm=true, binary=false, streaming=false)
at logicalfuncs.c:285
#10 0x0000560b130dbe71 in pg_logical_slot_get_changes
(fcinfo=0x560b1417ee50) at logicalfuncs.c:354
#11 0x0000560b12e294d4 in ExecMakeTableFunctionResult
(setexpr=0x560b14177838, econtext=0x560b14177748,
argContext=0x560b1417ed30, expectedDesc=0x560b141814a0,
randomAccess=false) at execSRF.c:234
#12 0x0000560b12e5490f in FunctionNext (node=0x560b14177630) at
nodeFunctionscan.c:94
#13 0x0000560b12e2c108 in ExecScanFetch (node=0x560b14177630,
accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
<FunctionRecheck>) at execScan.c:133
#14 0x0000560b12e2c227 in ExecScan (node=0x560b14177630,
accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
<FunctionRecheck>) at execScan.c:199
#15 0x0000560b12e54e9b in ExecFunctionScan (pstate=0x560b14177630) at
nodeFunctionscan.c:270
#16 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14177630) at
execProcnode.c:450
#17 0x0000560b12e3e172 in ExecProcNode (node=0x560b14177630) at
../../../src/include/executor/executor.h:245
#18 0x0000560b12e3e998 in fetch_input_tuple (aggstate=0x560b14176f40)
at nodeAgg.c:566
#19 0x0000560b12e4398f in agg_fill_hash_table
(aggstate=0x560b14176f40) at nodeAgg.c:2518
#20 0x0000560b12e42c9a in ExecAgg (pstate=0x560b14176f40) at nodeAgg.c:2139
#21 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176f40) at
execProcnode.c:450
#22 0x0000560b12e8bb58 in ExecProcNode (node=0x560b14176f40) at
../../../src/include/executor/executor.h:245
#23 0x0000560b12e8bd59 in ExecSort (pstate=0x560b14176d28) at nodeSort.c:108
#24 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176d28) at
execProcnode.c:450
#25 0x0000560b12e10e71 in ExecProcNode (node=0x560b14176d28) at
../../../src/include/executor/executor.h:245
#26 0x0000560b12e15c4c in ExecutePlan (estate=0x560b14176af0,
planstate=0x560b14176d28, use_parallel_mode=false,
operation=CMD_SELECT, sendTuples=true, numberTuples=0,
direction=ForwardScanDirection, dest=0x560b1419d188,
execute_once=true) at execMain.c:1646
#27 0x0000560b12e11a19 in standard_ExecutorRun
(queryDesc=0x560b1412db10, direction=ForwardScanDirection, count=0,
execute_once=true) at execMain.c:364
#28 0x0000560b12e116e1 in ExecutorRun (queryDesc=0x560b1412db10,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:308
#29 0x0000560b131f2177 in PortalRunSelect (portal=0x560b140db860,
forward=true, count=0, dest=0x560b1419d188) at pquery.c:912
#30 0x0000560b131f1b14 in PortalRun (portal=0x560b140db860,
count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x560b1419d188, altdest=0x560b1419d188,
qc=0x7ffef18b2350) at pquery.c:756
#31 0x0000560b131e550b in exec_simple_query (
query_string=0x560b14076720 "/ display results, but hide most of the
output /\nSELECT count(*), min(data), max(data)\nFROM
pg_logical_slot_get_changes('regression_slot', NULL, NULL,
'include-xids', '0', 'skip-empty-xacts', '1')\nG"...) at
postgres.c:1239
#32 0x0000560b131ee343 in PostgresMain (argc=1, argv=0x560b1409faa0,
dbname=0x560b1409f858 "contrib_regression", username=0x560b1409f830
"mahendrathalor") at postgres.c:4315
#33 0x0000560b130a325b in BackendRun (port=0x560b14096880) at postmaster.c:4510
#34 0x0000560b130a22c3 in BackendStartup (port=0x560b14096880) at
postmaster.c:4202
#35 0x0000560b1309a5cc in ServerLoop () at postmaster.c:1727
#36 0x0000560b130997c9 in PostmasterMain (argc=8, argv=0x560b1406f010)
at postmaster.c:1400
#37 0x0000560b12ee9530 in main (argc=8, argv=0x560b1406f010) at main.c:210

I have Ubuntu setup. I think, this is reproducing into Ubuntu only. I
am looking into this issue with Dilip.

This error is due to invalid size.

diff --git a/src/backend/replication/logical/reorderbuffer.c
b/src/backend/replication/logical/reorderbuffer.c
index eed9a5048b..487c1b4252 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -3678,7 +3678,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb,
ReorderBufferTXN *txn,

change->data.inval.invalidations =
MemoryContextAlloc(rb->context,
-
change->data.msg.message_size);
+
inval_size);
/* read the message */

memcpy(change->data.inval.invalidations, data, inval_size);
data += inval_size;

Above change, fixes the error. Thanks Dilip for helping.

Thanks, Mahendra for reproducing and help in fixing this. I will
include this change in my next patch set.

#284Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#279)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

[latest patches]

v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
-     Any actions leading to transaction ID assignment are prohibited.
That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the
<literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment
are prohibited. That, among others,
..
@@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
bool valid;
/*
+ * We don't expect direct calls to heap_fetch with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ elog(ERROR, "unexpected heap_fetch call during logical decoding");
+

I think comments and code don't match. In the comment, we are saying
that via output plugins access to user catalog tables or regular
system catalog tables won't be allowed via heap_* APIs but code
doesn't seem to reflect it. I feel only
TransactionIdIsValid(CheckXidAlive) is sufficient here. See, the
original discussion about this point [1] (Refer "I think it'd also be
good to add assertions to codepaths not going through systable_*
asserting that ...").

Right, So I think we can just add an assert in these function that
Assert(!TransactionIdIsValid(CheckXidAlive)) ?

Isn't it better to block the scan to user catalog tables or regular
system catalog tables for tableam scan APIs rather than at the heap
level? There might be some APIs like heap_getnext where such a check
might still be required but I guess it is still better to block at
tableam level.

[1] - /messages/by-id/20180726200241.aje4dv4jsv25v4k2@alap3.anarazel.de

Okay, let me analyze this part. Because someplace we have to keep at
heap level like heap_getnext and other places at tableam level so it
seems a bit inconsistent. Also, I think the number of checks might
going to increase because some of the heap functions like
heap_hot_search_buffer are being called from multiple tableam calls,
so we need to put check at every place.

Another point is that I feel some of the checks what we have today
might not be required like heap_finish_speculative, is not fetching
any tuple for us so why do we need to care about this function?

While testing these changes, I have noticed that the systable_* APIs
internally, calls tableam apis and so if we just put assert
Assert(!TransactionIdIsValid(CheckXidAlive)) then it will always hit
that assert. Whether we put these assert in heap APIs or the tableam
APIs because systable_ always access heap through tableam APIs.

Refer below callstack
#0 table_index_fetch_tuple (scan=0x2392558, tid=0x2392270,
snapshot=0x2392178, slot=0x2391f60, call_again=0x2392276,
all_dead=0x7fff4b6cc89e)
at ../../../../src/include/access/tableam.h:1035
#1 0x00000000005100b6 in index_fetch_heap (scan=0x2392210,
slot=0x2391f60) at indexam.c:577
#2 0x00000000005101ea in index_getnext_slot (scan=0x2392210,
direction=ForwardScanDirection, slot=0x2391f60) at indexam.c:637
#3 0x000000000050e8f9 in systable_getnext (sysscan=0x2391f08) at genam.c:474
#4 0x0000000000aa44a2 in RelidByRelfilenode (reltablespace=0,
relfilenode=16593) at relfilenodemap.c:213
#5 0x00000000008a64da in ReorderBufferProcessTXN (rb=0x23734b0,
txn=0x2398e28, commit_lsn=23953600, snapshot_now=0x237b168,
command_id=0, streaming=false)
at reorderbuffer.c:1823
#6 0x00000000008a7201 in ReorderBufferCommit (rb=0x23734b0, xid=518,
commit_lsn=23953600, end_lsn=23953648, commit_time=641466886013448,
origin_id=0, origin_lsn=0)
at reorderbuffer.c:2315
#7 0x00000000008985b1 in DecodeCommit (ctx=0x22e16a0,
buf=0x7fff4b6cce30, parsed=0x7fff4b6ccca0, xid=518) at decode.c:654
#8 0x0000000000897a76 in DecodeXactOp (ctx=0x22e16a0,
buf=0x7fff4b6cce30) at decode.c:261
#9 0x0000000000897739 in LogicalDecodingProcessRecord (ctx=0x22e16a0,
record=0x22e19a0) at decode.c:130

So basically, the problem is that we can not distinguish whether the
tableam/heap routine is called directly or via systable_*.

Now I understand the current code was actually giving error for the
user table not the system table with the assumption that the system
table will come to this function only via systable_*. Only user table
can come directly. So if this is not a system table i.e. we reach
here directly so error out. Now, I am not sure if it is not for the
system table then what is the purpose of throwing that error?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#285Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#284)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Apr 29, 2020 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

[latest patches]

v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
-     Any actions leading to transaction ID assignment are prohibited.
That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the
<literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment
are prohibited. That, among others,
..
@@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
bool valid;
/*
+ * We don't expect direct calls to heap_fetch with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ elog(ERROR, "unexpected heap_fetch call during logical decoding");
+

I think comments and code don't match. In the comment, we are saying
that via output plugins access to user catalog tables or regular
system catalog tables won't be allowed via heap_* APIs but code
doesn't seem to reflect it. I feel only
TransactionIdIsValid(CheckXidAlive) is sufficient here. See, the
original discussion about this point [1] (Refer "I think it'd also be
good to add assertions to codepaths not going through systable_*
asserting that ...").

Right, So I think we can just add an assert in these function that
Assert(!TransactionIdIsValid(CheckXidAlive)) ?

Isn't it better to block the scan to user catalog tables or regular
system catalog tables for tableam scan APIs rather than at the heap
level? There might be some APIs like heap_getnext where such a check
might still be required but I guess it is still better to block at
tableam level.

[1] - /messages/by-id/20180726200241.aje4dv4jsv25v4k2@alap3.anarazel.de

Okay, let me analyze this part. Because someplace we have to keep at
heap level like heap_getnext and other places at tableam level so it
seems a bit inconsistent. Also, I think the number of checks might
going to increase because some of the heap functions like
heap_hot_search_buffer are being called from multiple tableam calls,
so we need to put check at every place.

Another point is that I feel some of the checks what we have today
might not be required like heap_finish_speculative, is not fetching
any tuple for us so why do we need to care about this function?

While testing these changes, I have noticed that the systable_* APIs
internally, calls tableam apis and so if we just put assert
Assert(!TransactionIdIsValid(CheckXidAlive)) then it will always hit
that assert. Whether we put these assert in heap APIs or the tableam
APIs because systable_ always access heap through tableam APIs.

Refer below callstack
#0 table_index_fetch_tuple (scan=0x2392558, tid=0x2392270,
snapshot=0x2392178, slot=0x2391f60, call_again=0x2392276,
all_dead=0x7fff4b6cc89e)
at ../../../../src/include/access/tableam.h:1035
#1 0x00000000005100b6 in index_fetch_heap (scan=0x2392210,
slot=0x2391f60) at indexam.c:577
#2 0x00000000005101ea in index_getnext_slot (scan=0x2392210,
direction=ForwardScanDirection, slot=0x2391f60) at indexam.c:637
#3 0x000000000050e8f9 in systable_getnext (sysscan=0x2391f08) at genam.c:474
#4 0x0000000000aa44a2 in RelidByRelfilenode (reltablespace=0,
relfilenode=16593) at relfilenodemap.c:213
#5 0x00000000008a64da in ReorderBufferProcessTXN (rb=0x23734b0,
txn=0x2398e28, commit_lsn=23953600, snapshot_now=0x237b168,
command_id=0, streaming=false)
at reorderbuffer.c:1823
#6 0x00000000008a7201 in ReorderBufferCommit (rb=0x23734b0, xid=518,
commit_lsn=23953600, end_lsn=23953648, commit_time=641466886013448,
origin_id=0, origin_lsn=0)
at reorderbuffer.c:2315
#7 0x00000000008985b1 in DecodeCommit (ctx=0x22e16a0,
buf=0x7fff4b6cce30, parsed=0x7fff4b6ccca0, xid=518) at decode.c:654
#8 0x0000000000897a76 in DecodeXactOp (ctx=0x22e16a0,
buf=0x7fff4b6cce30) at decode.c:261
#9 0x0000000000897739 in LogicalDecodingProcessRecord (ctx=0x22e16a0,
record=0x22e19a0) at decode.c:130

So basically, the problem is that we can not distinguish whether the
tableam/heap routine is called directly or via systable_*.

Now I understand the current code was actually giving error for the
user table not the system table with the assumption that the system
table will come to this function only via systable_*. Only user table
can come directly. So if this is not a system table i.e. we reach
here directly so error out. Now, I am not sure if it is not for the
system table then what is the purpose of throwing that error?

Putting some more thought upon this, I am just wondering what do we
really want any such check because, we are always getting relation
description from the reorder buffer code, not from the pgoutput
plugin. And, our main issue with the concurrent abort is that we
shall not get the wrong catalog entry for decoding our tuple. So if
we are always getting our relation entry using RelationIdGetRelation
then why should we bother about how output plugin is accessing
system/user relations?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#286Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#285)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Apr 29, 2020 at 3:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Apr 29, 2020 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

[latest patches]

v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
-     Any actions leading to transaction ID assignment are prohibited.
That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the
<literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment
are prohibited. That, among others,
..
@@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
bool valid;
/*
+ * We don't expect direct calls to heap_fetch with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ elog(ERROR, "unexpected heap_fetch call during logical decoding");
+

I think comments and code don't match. In the comment, we are saying
that via output plugins access to user catalog tables or regular
system catalog tables won't be allowed via heap_* APIs but code
doesn't seem to reflect it. I feel only
TransactionIdIsValid(CheckXidAlive) is sufficient here. See, the
original discussion about this point [1] (Refer "I think it'd also be
good to add assertions to codepaths not going through systable_*
asserting that ...").

Right, So I think we can just add an assert in these function that
Assert(!TransactionIdIsValid(CheckXidAlive)) ?

Isn't it better to block the scan to user catalog tables or regular
system catalog tables for tableam scan APIs rather than at the heap
level? There might be some APIs like heap_getnext where such a check
might still be required but I guess it is still better to block at
tableam level.

[1] - /messages/by-id/20180726200241.aje4dv4jsv25v4k2@alap3.anarazel.de

Okay, let me analyze this part. Because someplace we have to keep at
heap level like heap_getnext and other places at tableam level so it
seems a bit inconsistent. Also, I think the number of checks might
going to increase because some of the heap functions like
heap_hot_search_buffer are being called from multiple tableam calls,
so we need to put check at every place.

Another point is that I feel some of the checks what we have today
might not be required like heap_finish_speculative, is not fetching
any tuple for us so why do we need to care about this function?

While testing these changes, I have noticed that the systable_* APIs
internally, calls tableam apis and so if we just put assert
Assert(!TransactionIdIsValid(CheckXidAlive)) then it will always hit
that assert. Whether we put these assert in heap APIs or the tableam
APIs because systable_ always access heap through tableam APIs.

..
..

Putting some more thought upon this, I am just wondering what do we
really want any such check because, we are always getting relation
description from the reorder buffer code, not from the pgoutput
plugin.

But can't they access other catalogs like pg_publication*? I think
the basic thing we want to ensure here is that all historic accesses
always use systable* APIs to access catalogs. We can ensure that via
having Asserts (or elog(ERROR, ..) in heap/tableam APIs.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#287Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#286)
12 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Apr 30, 2020 at 12:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Apr 29, 2020 at 3:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Apr 29, 2020 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

[latest patches]

v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
-     Any actions leading to transaction ID assignment are prohibited.
That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the
<literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment
are prohibited. That, among others,
..
@@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
bool valid;
/*
+ * We don't expect direct calls to heap_fetch with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ elog(ERROR, "unexpected heap_fetch call during logical decoding");
+

I think comments and code don't match. In the comment, we are saying
that via output plugins access to user catalog tables or regular
system catalog tables won't be allowed via heap_* APIs but code
doesn't seem to reflect it. I feel only
TransactionIdIsValid(CheckXidAlive) is sufficient here. See, the
original discussion about this point [1] (Refer "I think it'd also be
good to add assertions to codepaths not going through systable_*
asserting that ...").

Right, So I think we can just add an assert in these function that
Assert(!TransactionIdIsValid(CheckXidAlive)) ?

Isn't it better to block the scan to user catalog tables or regular
system catalog tables for tableam scan APIs rather than at the heap
level? There might be some APIs like heap_getnext where such a check
might still be required but I guess it is still better to block at
tableam level.

[1] - /messages/by-id/20180726200241.aje4dv4jsv25v4k2@alap3.anarazel.de

Okay, let me analyze this part. Because someplace we have to keep at
heap level like heap_getnext and other places at tableam level so it
seems a bit inconsistent. Also, I think the number of checks might
going to increase because some of the heap functions like
heap_hot_search_buffer are being called from multiple tableam calls,
so we need to put check at every place.

Another point is that I feel some of the checks what we have today
might not be required like heap_finish_speculative, is not fetching
any tuple for us so why do we need to care about this function?

While testing these changes, I have noticed that the systable_* APIs
internally, calls tableam apis and so if we just put assert
Assert(!TransactionIdIsValid(CheckXidAlive)) then it will always hit
that assert. Whether we put these assert in heap APIs or the tableam
APIs because systable_ always access heap through tableam APIs.

..
..

Putting some more thought upon this, I am just wondering what do we
really want any such check because, we are always getting relation
description from the reorder buffer code, not from the pgoutput
plugin.

But can't they access other catalogs like pg_publication*? I think
the basic thing we want to ensure here is that all historic accesses
always use systable* APIs to access catalogs. We can ensure that via
having Asserts (or elog(ERROR, ..) in heap/tableam APIs.

Yeah, it can. So I have changed it now, actually along with
CheckXidLive, I have kept one more flag so whenever CheckXidLive is
set and we pass through systable_beginscan we will set that flag. So
while accessing the tableam API we will set if CheckXidLive is set
then another flag must also be set otherwise we through an error.

Apart from this, I have also fixed one defect raised by my colleague
Neha Sharma. That issue is the incomplete toast tuple flag was not
reset when the main table tuple was inserted through speculative
insert and due to that data was not streamed even if later we were
getting speculative confirm because incomplete toast flag was never
reset. This patch also includes the fix for the issue raised by Erik.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v17-0002-Issue-individual-invalidations-with-wal_level-lo.patchapplication/octet-stream; name=v17-0002-Issue-individual-invalidations-with-wal_level-lo.patchDownload
From c80a5d6d75a3af148e8a35dae0168db9ec00a4be Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Tue, 14 Apr 2020 11:11:37 +0530
Subject: [PATCH v17 02/12] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.
---
 src/backend/access/rmgrdesc/xactdesc.c          |  40 +++++++++
 src/backend/access/transam/xact.c               |   7 ++
 src/backend/replication/logical/decode.c        |  16 ++++
 src/backend/replication/logical/reorderbuffer.c | 104 +++++++++++++++++++++---
 src/backend/utils/cache/inval.c                 |  49 +++++++++++
 src/include/access/xact.h                       |  13 ++-
 src/include/replication/reorderbuffer.h         |  11 +++
 7 files changed, 226 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index fbc5942..17c06f7 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c2604bb..8e6b1a6 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6020,6 +6020,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 122c581..69c1f45 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,22 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				Assert(TransactionIdIsValid(xid));
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->nmsgs, invals->msgs);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4594cf9..b889edf 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -220,7 +220,7 @@ static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
 static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
-static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
 
 /*
  * ---------------------------------------
@@ -455,6 +455,11 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 				pfree(change->data.msg.message);
 			change->data.msg.message = NULL;
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			if (change->data.inval.invalidations)
+				pfree(change->data.inval.invalidations);
+			change->data.inval.invalidations = NULL;
+			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			if (change->data.snapshot)
 			{
@@ -1814,17 +1819,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation messages locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					ReorderBufferExecuteInvalidations(
+							change->data.inval.ninvalidations,
+							change->data.inval.invalidations);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -1866,7 +1878,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1892,7 +1905,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2204,6 +2218,33 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 
 /*
  * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn, int nmsgs,
+							 SharedInvalidationMessage *msgs)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.ninvalidations = nmsgs;
+	change->data.inval.invalidations = (SharedInvalidationMessage *)
+		MemoryContextAlloc(rb->context,
+						   sizeof(SharedInvalidationMessage) * nmsgs);
+	memcpy(change->data.inval.invalidations, msgs,
+		   sizeof(SharedInvalidationMessage) * nmsgs);
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Setup the invalidation of the toplevel transaction.
  *
  * This needs to be done before ReorderBufferCommit is called!
  */
@@ -2234,12 +2275,12 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  * in the changestream but we don't know which those are.
  */
 static void
-ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
 {
 	int			i;
 
-	for (i = 0; i < txn->ninvalidations; i++)
-		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	for (i = 0; i < nmsgs; i++)
+		LocalExecuteInvalidationMessage(&msgs[i]);
 }
 
 /*
@@ -2591,6 +2632,24 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				char	   *data;
+				Size		inval_size = sizeof(SharedInvalidationMessage) *
+										change->data.inval.ninvalidations;
+
+				sz += inval_size;
+
+				ReorderBufferSerializeReserve(rb, sz);
+				data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+
+				/* might have been reallocated above */
+				ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+				memcpy(data, change->data.inval.invalidations, inval_size);
+				data += inval_size;
+
+				break;
+			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -2736,6 +2795,11 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			sz += sizeof(SharedInvalidationMessage) *
+					change->data.inval.ninvalidations;
+			break;
+
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -3004,6 +3068,20 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				Size	inval_size = sizeof(SharedInvalidationMessage) *
+									change->data.inval.ninvalidations;
+
+				change->data.inval.invalidations =
+						MemoryContextAlloc(rb->context, inval_size);
+
+				/* read the message */
+				memcpy(change->data.inval.invalidations, data, inval_size);
+				data += inval_size;
+
+				break;
+			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	oldsnap;
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..cba5b6c 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transaction.  As of now it was
+ *	enough to log invalidation only at commit because we are only decoding the
+ *	transaction at the commit time.   We only need to log the catalog cache and
+ *	relcache invalidation.  There can not be any active MVCC scan in logical
+ *	decoding so we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
 	if (transInvalInfo == NULL)
 		return;
 
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
@@ -1501,3 +1513,40 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage	*invalMessages;
+	int	nmsgs = 0;
+
+	if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+	{
+		ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+								 MakeSharedInvalidMessagesArray);
+		invalMessages = SharedInvalidMessagesArray;
+		nmsgs  = numSharedInvalidMessagesArray;
+		SharedInvalidMessagesArray = NULL;
+		numSharedInvalidMessagesArray = 0;
+	}
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b38..b822c5e 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,17 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4..af35287 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,14 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations;		/* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation
+														 * message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +468,8 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  int nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
1.8.3.1

v17-0008-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v17-0008-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From bc052dfc801a182ee116990c12a2529cc718d958 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v17 08/12] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 6da7f71..086d0c7 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -70,7 +70,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 34ab11e..ad3ed13 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 81520a7..0c9c6b3 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v17-0007-Track-statistics-for-streaming.patchapplication/octet-stream; name=v17-0007-Track-statistics-for-streaming.patchDownload
From 03b43d8ea2d5484e0b99ce97c02651e73e9487b3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 2 Apr 2020 13:19:29 +0530
Subject: [PATCH v17 07/12] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                    | 25 +++++++++++++++++++
 src/backend/catalog/system_views.sql            |  5 +++-
 src/backend/replication/logical/reorderbuffer.c | 13 ++++++++++
 src/backend/replication/walsender.c             | 32 +++++++++++++++++++++----
 src/include/catalog/pg_proc.dat                 |  6 ++---
 src/include/replication/reorderbuffer.h         | 13 ++++++----
 src/include/replication/walsender_private.h     |  5 ++++
 src/test/regress/expected/rules.out             |  7 ++++--
 8 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6562cc4..d8bf587 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2063,6 +2063,31 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       may get spilled repeatedly, and this counter gets incremented on every
       such invocation.</entry>
     </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
+
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index d406ea8..65d650d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -788,7 +788,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index efee067..a518fff 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -331,6 +331,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3289,6 +3293,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
 
+	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index eacea12..d0028d9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1333,7 +1333,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1354,7 +1354,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2396,6 +2397,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3237,7 +3241,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3295,6 +3299,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillCount;
 		int64		spillBytes;
 		bool		is_sync_standby;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 		int			j;
@@ -3320,6 +3327,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		/*
@@ -3422,6 +3432,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3670,11 +3685,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4bce3ad..9fb1ffe 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5237,9 +5237,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6d65986..603f325 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -517,15 +517,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 734acec..b997d17 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -83,6 +83,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ac31840..68e2deb 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2005,9 +2005,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_slru| SELECT s.name,
     s.blks_zeroed,
-- 
1.8.3.1

v17-0009-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v17-0009-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From 2b774b2bb420efa368f9bc6bc896c11ea055ae7b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v17 09/12] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v17-0010-Bugfix-handling-of-incomplete-toast-tuple.patchapplication/octet-stream; name=v17-0010-Bugfix-handling-of-incomplete-toast-tuple.patchDownload
From c9238a0550133a21f370d934cb036b5c293562ea Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 28 Feb 2020 11:07:46 +0530
Subject: [PATCH v17 10/12] Bugfix handling of incomplete toast tuple

---
 src/backend/access/heap/heapam.c                |   3 +
 src/backend/replication/logical/decode.c        |  17 ++-
 src/backend/replication/logical/reorderbuffer.c | 186 +++++++++++++++---------
 src/include/access/heapam_xlog.h                |   1 +
 src/include/replication/reorderbuffer.h         |  24 ++-
 5 files changed, 151 insertions(+), 80 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d854c45..ee29f05 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1953,6 +1953,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 69c1f45..c841687 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -727,7 +727,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -794,7 +796,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -851,7 +854,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -887,7 +891,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -987,7 +991,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1025,7 +1029,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index a518fff..375e996 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -178,6 +178,11 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define ChangeIsInsertOrUpdate(action) \
+			(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+			((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+			((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT))
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -654,11 +659,14 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
-	ReorderBufferTXN *txn;
+	ReorderBufferTXN *txn, *toptxn;
+	bool	can_stream = false;
 
-	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+	toptxn = txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
 
 	change->lsn = lsn;
 	change->txn = txn;
@@ -668,9 +676,49 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Otherwise, if
+	 * we have toast insert bit set and this is insert/update then clear the
+	 * bit.
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(txn) &&
+			ChangeIsInsertOrUpdate(change->action))
+	{
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+		can_stream = true;
+	}
+
+	/*
+	 * If this is a speculative insert then set the corresponding bit.
+	 * Otherwise, if we have speculative insert bit set and this is spec
+	 * confirm record then clear the bit.
+	 */
+	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM)
+	{
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+		can_stream = true;
+	}
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/*
+	 * If streaming is enable and we have serialized this transaction because
+	 * it had incomplete tuple.  So if now we have got the complete tuple we
+	 * can stream it.
+	 */
+	if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
+		&& !rbtxn_has_toast_insert(txn) && !rbtxn_has_spec_insert(txn))
+	{
+		ReorderBufferStreamTXN(rb, toptxn);
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
+	}
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -700,7 +748,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1865,8 +1913,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						 * disk.
 						 */
 						dlist_delete(&change->node);
-						ReorderBufferToastAppendChunk(rb, txn, relation,
-													  change);
+							ReorderBufferToastAppendChunk(rb, txn, relation,
+														  change);
 					}
 
 			change_done:
@@ -2463,7 +2511,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2512,7 +2560,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2535,6 +2583,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2549,8 +2598,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2558,12 +2612,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2624,7 +2686,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 	memcpy(change->data.inval.invalidations, msgs,
 		   sizeof(SharedInvalidationMessage) * nmsgs);
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -2811,15 +2873,16 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		/*
+		 * if the current transaction is larger and doesn't have incomplete data
+		 * remember it.
+		 */
+		if (((!largest) || (txn->total_size > largest->total_size)) &&
+			((txn->size > 0) && !rbtxn_has_toast_insert(txn) &&
+			!rbtxn_has_spec_insert(txn)))
+		largest = txn;
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2837,66 +2900,51 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
 	ReorderBufferTXN *txn;
 
-	/* bail out if we haven't exceeded the memory limit */
-	if (rb->size < logical_decoding_work_mem * 1024L)
-		return;
-
-	/*
-	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by streaming, if supported. Otherwise spill to disk.
-	 */
-	if (ReorderBufferCanStream(rb))
+	/* Loop until we reach under the memory limit. */
+	while (rb->size >= logical_decoding_work_mem * 1024L)
 	{
 		/*
-		 * Pick the largest toplevel transaction and evict it from memory by
-		 * streaming the already decoded part.
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTopTXN(rb);
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn && !txn->toptxn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		ReorderBufferStreamTXN(rb, txn);
-	}
-	else
-	{
-		/*
-		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
-		 */
-		txn = ReorderBufferLargestTXN(rb);
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
-		ReorderBufferSerializeTXN(rb, txn);
+		/*
+		 * After eviction, the transaction should have no entries in memory, and
+		 * should use 0 bytes for changes.
+		 *
+		 * XXX Checking the size is fine for both cases - spill to disk and
+		 * streaming. But for streaming we should really check nentries_mem for
+		 * all subtransactions too.
+		 */
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
 	}
 
-	/*
-	 * After eviction, the transaction should have no entries in memory, and
-	 * should use 0 bytes for changes.
-	 *
-	 * XXX Checking the size is fine for both cases - spill to disk and
-	 * streaming. But for streaming we should really check nentries_mem for
-	 * all subtransactions too.
-	 */
-	Assert(txn->size == 0);
-	Assert(txn->nentries_mem == 0);
-
-	/*
-	 * And furthermore, evicting the transaction should get us below the
-	 * memory limit again - it is not possible that we're still exceeding the
-	 * memory limit after evicting the transaction.
-	 *
-	 * This follows from the simple fact that the selected transaction is at
-	 * least as large as the most recent change (which caused us to go over
-	 * the memory limit). So by evicting it we're definitely back below the
-	 * memory limit.
-	 */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 603f325..ba2ab71 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -172,6 +172,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -191,6 +193,17 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -199,10 +212,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -355,6 +364,9 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -545,7 +557,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v17-0006-Add-support-for-streaming-to-built-in-replicatio.patchapplication/octet-stream; name=v17-0006-Add-support-for-streaming-to-built-in-replicatio.patchDownload
From 4c2d785d072db413017f93bf49215a50b97e2f67 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 16 Apr 2020 01:55:22 -0700
Subject: [PATCH v17 06/12] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |    4 +-
 doc/src/sgml/ref/create_subscription.sgml          |   12 +
 src/backend/catalog/pg_subscription.c              |    1 +
 src/backend/commands/subscriptioncmds.c            |   45 +-
 src/backend/postmaster/pgstat.c                    |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    3 +
 src/backend/replication/logical/launcher.c         |    1 -
 src/backend/replication/logical/logical.c          |    4 +-
 src/backend/replication/logical/proto.c            |  140 ++-
 src/backend/replication/logical/worker.c           | 1035 ++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c        |  318 +++++-
 src/backend/replication/slotfuncs.c                |    6 +
 src/backend/replication/walsender.c                |    6 +
 src/include/catalog/pg_subscription.h              |    3 +
 src/include/pgstat.h                               |    6 +-
 src/include/replication/logicalproto.h             |   42 +-
 src/include/replication/walreceiver.h              |    1 +
 src/test/subscription/t/009_stream_simple.pl       |   86 ++
 src/test/subscription/t/010_stream_subxact.pl      |  102 ++
 src/test/subscription/t/011_stream_ddl.pl          |   95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |   82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |   84 ++
 22 files changed, 2045 insertions(+), 43 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 1b8bead..95b7c24 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -164,8 +164,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c24..3349cc4 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731..f28482f 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 7f15667..65b6b76 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 50eea2e..4ef4fd4 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4133,6 +4133,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9..5257ab0 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e..8156a42 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,7 +14,6 @@
  *
  *-------------------------------------------------------------------------
  */
-
 #include "postgres.h"
 
 #include "access/heapam.h"
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 497d8a9..dfc681d 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1148,7 +1148,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1193,7 +1193,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = txn->final_lsn;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd..5242ac0 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a12..a58442e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,6 +54,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -64,6 +86,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +94,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -110,12 +134,57 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool	in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;						/* XID of the subxact */
+	off_t           offset;					/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +256,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -553,6 +658,321 @@ apply_handle_origin(StringInfo s)
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+			return;
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	/* XXX Should this be allocated in another memory context? */
+
+	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	ensure_transaction();
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+
+	pfree(buffer);
+	pfree(s2.data);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -565,6 +985,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1003,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1042,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1160,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1305,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1678,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1819,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1478,6 +1932,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 }
 
 /*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
+/*
  * Apply main loop.
  */
 static void
@@ -1493,6 +1963,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1941,6 +2414,567 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3140,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 77b85fc..811706a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,54 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +119,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +145,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +208,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +237,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +260,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +281,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +370,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +427,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +469,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +490,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +522,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +542,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +566,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +586,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +611,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +643,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -588,6 +724,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -624,6 +845,34 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -750,12 +999,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -790,7 +1072,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index f776de3..9121420 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -156,6 +156,12 @@ create_logical_replication_slot(char *name, char *plugin,
 									NULL);
 
 	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 0e93322..eacea12 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1004,6 +1004,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndUpdateProgress);
 
 		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
+		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
 		 * messages or send keepalives. As we possibly need to wait for
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d4..617b909 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8041d9..3b3e1fd 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -981,7 +981,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561..89158ed 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f1aa6e9..70d39f8 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v17-0012-Add-streaming-option-in-pg_dump.patchapplication/octet-stream; name=v17-0012-Add-streaming-option-in-pg_dump.patchDownload
From 047408ec0612e2686bc78bed1c94440e7256850c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v17 12/12] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 5db4f57..11db7b7 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4210,6 +4210,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4244,8 +4245,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4258,6 +4259,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4274,6 +4276,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4351,6 +4354,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 61c909e..5c5b072 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
1.8.3.1

v17-0011-Provide-new-api-to-get-the-streaming-changes.patchapplication/octet-stream; name=v17-0011-Provide-new-api-to-get-the-streaming-changes.patchDownload
From 98913f7d6f706a4b441a235039d554806074b09d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 22 Apr 2020 16:33:07 +0530
Subject: [PATCH v17 11/12] Provide new api to get the streaming changes

---
 src/backend/catalog/system_views.sql           |  8 ++++++++
 src/backend/replication/logical/logicalfuncs.c | 23 ++++++++++++++++++-----
 src/include/catalog/pg_proc.dat                |  9 +++++++++
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 65d650d..d9ab14b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1242,6 +1242,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
 
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_peek_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index f5384f1..7561141 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -108,7 +108,8 @@ check_permissions(void)
  * Helper function for the various SQL callable logical decoding functions.
  */
 static Datum
-pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool binary)
+pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
+								 bool binary, bool streaming)
 {
 	Name		name;
 	XLogRecPtr	upto_lsn;
@@ -237,6 +238,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 									LogicalOutputPrepareWrite,
 									LogicalOutputWrite, NULL);
 
+		/* If called has not asked for streaming changes then disable it. */
+		ctx->streaming &= streaming;
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
@@ -347,7 +351,16 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 Datum
 pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, false);
+}
+
+/*
+ * SQL function to get the streaming changes as text, consuming the data.
+ */
+Datum
+pg_logical_slot_get_streaming_changes(PG_FUNCTION_ARGS)
+{
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, true);
 }
 
 /*
@@ -356,7 +369,7 @@ pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, false, false);
 }
 
 /*
@@ -365,7 +378,7 @@ pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, true, false);
 }
 
 /*
@@ -374,7 +387,7 @@ pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, true, false);
 }
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9fb1ffe..3dfc5c1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10117,6 +10117,15 @@
   proargmodes => '{i,i,i,v,o,o,o}',
   proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
   prosrc => 'pg_logical_slot_get_binary_changes' },
+{ oid => '6150', descr => 'get streaming changes from replication slot',
+  proname => 'pg_logical_slot_get_streaming_changes', procost => '1000',
+  prorows => '1000', provariadic => 'text', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'name pg_lsn int4 _text',
+  proallargtypes => '{name,pg_lsn,int4,_text,pg_lsn,xid,text}',
+  proargmodes => '{i,i,i,v,o,o,o}',
+  proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
+  prosrc => 'pg_logical_slot_get_streaming_changes' },
 { oid => '3784', descr => 'peek at changes from replication slot',
   proname => 'pg_logical_slot_peek_changes', procost => '1000',
   prorows => '1000', provariadic => 'text', proisstrict => 'f',
-- 
1.8.3.1

v17-0003-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=v17-0003-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From 2266383371cecf76031057254c0b55d27118c042 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v17 03/12] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 +++++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  57 +++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..64f651f 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index bad3bfe..65244b1 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5adf253..497d8a9 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +204,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -862,6 +910,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->final_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 3b7ca7f..f24e246 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..0d0a94a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index af35287..e102840 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -354,6 +354,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -393,6 +439,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v17-0001-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=v17-0001-Immediately-WAL-log-assignments.patchDownload
From a42f30286e2d4856850c65349903c28ada4f9445 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 20 Mar 2020 15:03:01 +0530
Subject: [PATCH v17 01/12] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 46 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 +++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 39 ++++++++++++++-------------
 src/include/access/xact.h                |  3 +++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 +++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 99 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3984dd3..c2604bb 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5118,6 +5120,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6020,3 +6023,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 4259309..3c49954 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 79ff976..7b5257f 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1189,6 +1189,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1227,6 +1228,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3a..122c581 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel xid is valid, we need to assign the subxact to the
+	 * toplevel xact. We need to do this for all records, hence we do it before
+	 * the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(XLogRecGetTopXid(r)))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04ba..8645b38 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f60ed2d..6d439d0 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -229,6 +229,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 4582196..756f6df 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -150,6 +150,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -285,6 +287,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v17-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchapplication/octet-stream; name=v17-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchDownload
From ea6f204261c61d02cd9d956adff79053d4437a33 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 10:55:19 +0530
Subject: [PATCH v17 04/12] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml  | 10 ++++---
 src/backend/access/heap/heapam.c   |  8 ++++++
 src/backend/access/index/genam.c   | 53 ++++++++++++++++++++++++++++++++++++++
 src/backend/access/table/tableam.c |  8 ++++++
 src/backend/utils/time/snapmgr.c   | 13 +++++++++-
 src/include/access/tableam.h       | 41 +++++++++++++++++++++++++++++
 src/include/utils/snapmgr.h        |  2 ++
 7 files changed, 131 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 65244b1..979844c 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,9 +432,13 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables in
+     the output plugins has to be done via the <literal>systable_*</literal> scan
+     APIs only. The user tables should not be accesed in the output plugins anyways.
+     Access via the <literal>heap_*</literal> scan APIs will error out. Additionally,
+     any actions leading to transaction ID assignment are prohibited. That, among
+     others, includes writing to tables, performing DDL changes, and calling
+     <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0d4ed60..d854c45 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,14 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to this routine when CheckXidAlive is a
+	 * valid transaction id, this should only come through systable_* call.
+	 * CheckXidAlive is set during logical decoding of a transactions.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !sysbegin_called))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..77cedd9 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,10 +430,37 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag that this call is passed through
+	 * systable_beginscan.  See detailed  comments at snapmgr.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		sysbegin_called = true;
+
 	return sysscan;
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we error
+ * out.  We can't directly use TransactionIdDidAbort as after crash such
+ * transaction might not have been marked as aborted.  See detailed  comments
+ * at snapmgr.c where the variable is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed  comments at snapmgr.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		sysbegin_called = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index c814733..e02faa5 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -231,6 +231,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
 	/*
+	 * We don't expect direct calls to this routine when CheckXidAlive is a
+	 * valid transaction id, this should only come through systable_* call.
+	 * CheckXidAlive is set during logical decoding of a transactions.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !sysbegin_called))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
+	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
 	 */
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c5..40af75c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -154,6 +154,18 @@ static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
 /*
+ * An xid value pointing to a possibly ongoing (sub)transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.  If CheckXidAlive is set
+ * then we will set sysbegin_called flag when we call systable_beginscan.  This
+ * is to ensure that from the pgoutput plugin we should never directly access
+ * the tableam or heap apis because we are checking for the concurrent abort
+ * only in systable_* apis.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool sysbegin_called = false;
+
+/*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2043,7 +2055,6 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 	tuplecid_data = tuplecids;
 }
 
-
 /*
  * Make catalog snapshots behave normally again.
  */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 94903dd..4daff77 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
+#include "utils/snapmgr.h"
 #include "utils/snapshot.h"
 
 
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to this routine when CheckXidAlive is a
+	 * valid transaction id.  CheckXidAlive is set during logical decoding of
+	 * a transactions.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !sysbegin_called))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1015,6 +1025,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to this routine when CheckXidAlive is a
+	 * valid transaction id, this should only come through systable_* call.
+	 * CheckXidAlive is set during logical decoding of a transactions.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !sysbegin_called))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1054,6 +1071,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to this routine when CheckXidAlive is a
+	 * valid transaction id, this should only come through systable_* call.
+	 * CheckXidAlive is set during logical decoding of a transactions.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !sysbegin_called))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1710,6 +1735,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to this routine when CheckXidAlive is a
+	 * valid transaction id, this should only come through systable_* call.
+	 * CheckXidAlive is set during logical decoding of a transactions.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !sysbegin_called))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1727,6 +1760,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to this routine when CheckXidAlive is a
+	 * valid transaction id, this should only come through systable_* call.
+	 * CheckXidAlive is set during logical decoding of a transactions.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !sysbegin_called))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13c..ae1cbe4 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,6 +145,8 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
+extern bool	sysbegin_called;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
 extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
 extern void TeardownHistoricSnapshot(bool is_error);
-- 
1.8.3.1

v17-0005-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v17-0005-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From 75fa5435080d56df3fb1924db70a8e848b2ff6f6 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 1 May 2020 19:56:35 +0530
Subject: [PATCH v17 05/12] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c     |  38 +-
 src/backend/replication/logical/reorderbuffer.c | 712 +++++++++++++++++++++---
 src/include/replication/reorderbuffer.h         |  36 ++
 3 files changed, 699 insertions(+), 87 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..160b167 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b889edf..efee067 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -236,6 +236,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +245,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +381,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -773,6 +786,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -865,6 +910,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -988,7 +1036,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1024,6 +1072,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1038,6 +1089,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1316,6 +1370,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1341,8 +1404,93 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
  */
 static void
 ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
-
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
 
 	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
@@ -1491,59 +1636,76 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error that we can ignore.  We
+ * might have already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the abort we will
+ * stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+	 * aborted. That will happen during catalog access.  Also reset the
+	 * sysbegin_called flag.
 	 */
-	if (txn->base_snapshot == NULL)
+	if (!TransactionIdDidCommit(xid))
 	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
-		return;
+		CheckXidAlive = xid;
+		sysbegin_called = false;
 	}
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
 
-	/* build data to be able to lookup the CommandIds of catalog tuples */
+	/*
+	 * build data to be able to lookup the CommandIds of catalog tuples
+	 */
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
@@ -1564,14 +1726,17 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction("stream");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* start streaming this chunk of transaction */
+		if (streaming)
+			rb->stream_start(rb, txn);
+		else
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1579,6 +1744,19 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1588,8 +1766,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * use as a normal record. It'll be cleaned up at the end
 					 * of INSERT processing.
 					 */
-					if (specinsert == NULL)
-						elog(ERROR, "invalid ordering of speculative insertion changes");
 					Assert(specinsert->data.tp.oldtuple == NULL);
 					change = specinsert;
 					change->action = REORDER_BUFFER_CHANGE_INSERT;
@@ -1655,7 +1831,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						if (streaming)
+						{
+							rb->stream_change(rb, txn, relation, change);
+
+							/* Remember that we have sent some data for this xid. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_change(rb, txn, relation, change);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1676,8 +1860,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						 * freed/reused while restoring spooled data from
 						 * disk.
 						 */
-						Assert(change->data.tp.newtuple != NULL);
-
 						dlist_delete(&change->node);
 						ReorderBufferToastAppendChunk(rb, txn, relation,
 													  change);
@@ -1695,7 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1753,7 +1935,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						if (streaming)
+						{
+							rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+							/* Remember that we have sent some data. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_truncate(rb, txn, nrelations, relations, change);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					if (streaming)
+						rb->stream_message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
+					else
+						rb->message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1796,7 +1992,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1858,14 +2053,46 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			/*
+			 * Set the last of the stream as the final lsn before calling
+			 * stream stop.
+			 */
+			if (!XLogRecPtrIsInvalid(prev_lsn))
+				txn->final_lsn = prev_lsn;
+			rb->stream_stop(rb, txn);
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+		{
+			txn->command_id = command_id;
+
+			/* Avoid copying if it's already copied. */
+			if (snapshot_now->copied)
+				txn->snapshot_now = snapshot_now;
+			else
+				txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+														  txn, command_id);
+		}
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1884,14 +2111,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,18 +2146,131 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		if (streaming)
+		{
+			/* Discard the changes that we just streamed. */
+			ReorderBufferTruncateTXN(rb, txn);
 
-		PG_RE_THROW();
+			/* Re-throw only if it's not an abort. */
+			if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+			{
+				MemoryContextSwitchTo(ecxt);
+				PG_RE_THROW();
+			}
+			else
+			{
+				FlushErrorState();
+				FreeErrorData(errdata);
+				errdata = NULL;
+
+				/* remember the command ID and snapshot for the streaming run */
+				txn->command_id = command_id;
+
+				/* Avoid copying if it's already copied. */
+				if (snapshot_now->copied)
+					txn->snapshot_now = snapshot_now;
+				else
+					txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+															  txn, command_id);
+				/*
+				 * Set the last last of the stream as the final lsn before
+				 * calling stream stop.
+				 */
+				txn->final_lsn = prev_lsn;
+				rb->stream_stop(rb, txn);
+				ReorderBufferToastReset(rb, txn);
+				if (specinsert != NULL)
+				{
+					ReorderBufferReturnChange(rb, specinsert);
+					specinsert = NULL;
+				}
+			}
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
 /*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * We currently can only decode a transaction's contents when its commit
+ * record is read because that's the only place where we know about cache
+ * invalidations. Thus, once a toplevel commit is read, we iterate over the top
+ * and subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot snapshot_now;
+	CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -1946,6 +2294,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2015,6 +2370,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2150,8 +2512,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2159,6 +2530,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2170,19 +2542,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2211,6 +2592,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2399,6 +2788,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
  *
@@ -2418,15 +2839,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2746,6 +3198,102 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot snapshot_now;
+	CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e102840..6d65986 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -171,6 +171,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -190,6 +191,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -225,6 +244,16 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	final_lsn;
 
 	/*
+	 * Have we sent any changes for this transaction in output plugin?
+	 */
+	bool		any_data_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
+	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
 	XLogRecPtr	end_lsn;
@@ -255,6 +284,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
-- 
1.8.3.1

#288Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#287)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Apr 30, 2020 at 12:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

But can't they access other catalogs like pg_publication*? I think
the basic thing we want to ensure here is that all historic accesses
always use systable* APIs to access catalogs. We can ensure that via
having Asserts (or elog(ERROR, ..) in heap/tableam APIs.

Yeah, it can. So I have changed it now, actually along with
CheckXidLive, I have kept one more flag so whenever CheckXidLive is
set and we pass through systable_beginscan we will set that flag. So
while accessing the tableam API we will set if CheckXidLive is set
then another flag must also be set otherwise we through an error.

Okay, I have reviewed these changes and below are my comments:

Review of  v17-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
--------------------------------------------------------------------
1.
+ /*
+ * If CheckXidAlive is set then set a flag that this call is passed through
+ * systable_beginscan.  See detailed  comments at snapmgr.c where these
+ * variables are declared.
+ */
+ if (TransactionIdIsValid(CheckXidAlive))
+ sysbegin_called = true;

a. How about calling this variable as bsysscan or sysscan instead of
sysbegin_called?
b. There is an extra space between detailed and comments. A similar
change is required at other place where this comment is used.
c. How about writing the first line as "If CheckXidAlive is set then
set a flag to indicate that system table scan is in-progress."

2.
-     Any actions leading to transaction ID assignment are prohibited.
That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system
catalog tables in
+     the output plugins has to be done via the
<literal>systable_*</literal> scan
+     APIs only. The user tables should not be accesed in the output
plugins anyways.
+     Access via the <literal>heap_*</literal> scan APIs will error out.

The line "The user tables should not be accesed in the output plugins
anyways." seems a bit of out of place. I don't think this is required
here. If you read the previous paragraph in the same document it is
written: "Read only access to relations is permitted as long as only
relations are accessed that either have been created by
<command>initdb</command> in the <literal>pg_catalog</literal> schema,
or have been marked as user provided catalog tables using ...". I
think that is sufficient to convey the information that the newly
added line by you is trying to convey.

3.
+ /*
+ * We don't expect direct calls to this routine when CheckXidAlive is a
+ * valid transaction id, this should only come through systable_* call.
+ * CheckXidAlive is set during logical decoding of a transactions.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) && !sysbegin_called))
+ elog(ERROR, "unexpected heap_getnext call during logical decoding");

How about changing this comment as "We don't expect direct calls to
heap_getnext with valid CheckXidAlive for catalog or regular tables.
See detailed comments at snapmgr.c where these variables are
declared."? Change the similar comment used in other places in the
patch.

For this specific API, we can also say "Normally we have such a check
at tableam level API but this is called from many places so we need to
ensure it here."

4.
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we error
+ * out.  We can't directly use TransactionIdDidAbort as after crash such
+ * transaction might not have been marked as aborted.  See detailed  comments
+ * at snapmgr.c where the variable is declared.
+ */
+static inline void
+HandleConcurrentAbort()

Can we change the comments as "Error out, if CheckXidAlive is aborted.
We can't directly use TransactionIdDidAbort as after crash such
transaction might not have been marked as aborted."

After this add one empty line and then we can say something like:
"This is a special API to check if CheckXidAlive is aborted in system
table scan APIs. See detailed comments at snapmgr.c where the
variable is declared."

5. Shouldn't we add a check in table_scan_sample_next_block and
table_scan_sample_next_tuple APIs as well?

6.
/*
+ * An xid value pointing to a possibly ongoing (sub)transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.  If CheckXidAlive is set
+ * then we will set sysbegin_called flag when we call systable_beginscan.  This
+ * is to ensure that from the pgoutput plugin we should never directly access
+ * the tableam or heap apis because we are checking for the concurrent abort
+ * only in systable_* apis.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool sysbegin_called = false;

Can we change the above comment as "CheckXidAlive is a xid value
pointing to a possibly ongoing (sub)transaction. Currently, it is
used in logical decoding. It's possible that such transactions can
get aborted while the decoding is ongoing in which case we skip
decoding that particular transaction. To ensure that we check whether
the CheckXidAlive is aborted after fetching the tuple from system
tables. We also ensure that during logical decoding we never directly
access the tableam or heap APIs because we are checking for the
concurrent aborts only in systable_* APIs."

Apart from this, I have also fixed one defect raised by my colleague
Neha Sharma. That issue is the incomplete toast tuple flag was not
reset when the main table tuple was inserted through speculative
insert and due to that data was not streamed even if later we were
getting speculative confirm because incomplete toast flag was never
reset. This patch also includes the fix for the issue raised by Erik.

It would be better if you can mention which all patches contain the
changes as it will be easier to review the fix.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#289Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#288)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

5. Shouldn't we add a check in table_scan_sample_next_block and
table_scan_sample_next_tuple APIs as well?

I am not sure that we need to do that, Because generally, we want to
avoid getting any wrong system table tuple which we can use for taking
some decision or decode tuple. But, I don't think that
table_scan_sample falls under that category.

Apart from this, I have also fixed one defect raised by my colleague
Neha Sharma. That issue is the incomplete toast tuple flag was not
reset when the main table tuple was inserted through speculative
insert and due to that data was not streamed even if later we were
getting speculative confirm because incomplete toast flag was never
reset. This patch also includes the fix for the issue raised by Erik.

It would be better if you can mention which all patches contain the
changes as it will be easier to review the fix.

Fix1: v17-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
Fix2: patch: v17-0002-Issue-individual-invalidations-with-wal_level-lo.patch

I will work on other comments and send the updated patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#290Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#289)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, May 5, 2020 at 9:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

5. Shouldn't we add a check in table_scan_sample_next_block and
table_scan_sample_next_tuple APIs as well?

I am not sure that we need to do that, Because generally, we want to
avoid getting any wrong system table tuple which we can use for taking
some decision or decode tuple. But, I don't think that
table_scan_sample falls under that category.

Hmm, I am asking a check similar to what you have in function
table_scan_bitmap_next_block(), can't we have that one? BTW, I
noticed a below spurious line removal in the patch we are talking
about.

+/*
* These are updated by GetSnapshotData. We initialize them this way
* for the convenience of TransactionIdIsInProgress: even in bootstrap
* mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2043,7 +2055,6 @@ SetupHistoricSnapshot(Snapshot
historic_snapshot, HTAB *tuplecids)
tuplecid_data = tuplecids;
}

-

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#291Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#290)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, May 5, 2020 at 10:25 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, May 5, 2020 at 9:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

5. Shouldn't we add a check in table_scan_sample_next_block and
table_scan_sample_next_tuple APIs as well?

I am not sure that we need to do that, Because generally, we want to
avoid getting any wrong system table tuple which we can use for taking
some decision or decode tuple. But, I don't think that
table_scan_sample falls under that category.

Hmm, I am asking a check similar to what you have in function
table_scan_bitmap_next_block(), can't we have that one?

Yeah we can put that and there is no harm in that, but my point is
the table_scan_bitmap_next_block and other functions where I have put
the check are used for fetching the tuple which can be used for
decoding tuple or taking some decision, but IMHO,
table_scan_sample_next_tuple is only used for analyzing the table. So
do we really need to do that? Am I missing something here?

BTW, I

noticed a below spurious line removal in the patch we are talking
about.

+/*
* These are updated by GetSnapshotData. We initialize them this way
* for the convenience of TransactionIdIsInProgress: even in bootstrap
* mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2043,7 +2055,6 @@ SetupHistoricSnapshot(Snapshot
historic_snapshot, HTAB *tuplecids)
tuplecid_data = tuplecids;
}

-

Okay, I will take care. of this.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#292Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#291)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, May 5, 2020 at 10:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 5, 2020 at 10:25 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, May 5, 2020 at 9:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

5. Shouldn't we add a check in table_scan_sample_next_block and
table_scan_sample_next_tuple APIs as well?

I am not sure that we need to do that, Because generally, we want to
avoid getting any wrong system table tuple which we can use for taking
some decision or decode tuple. But, I don't think that
table_scan_sample falls under that category.

Hmm, I am asking a check similar to what you have in function
table_scan_bitmap_next_block(), can't we have that one?

Yeah we can put that and there is no harm in that, but my point is
the table_scan_bitmap_next_block and other functions where I have put
the check are used for fetching the tuple which can be used for
decoding tuple or taking some decision, but IMHO,
table_scan_sample_next_tuple is only used for analyzing the table.

These will be used in TABLESAMPLE scan. Try something like "select c1
from t1 TABLESAMPLE BERNOULLI(30);". So, I guess these APIs can also
be used to fetch the tuple.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#293Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#288)
12 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Apr 30, 2020 at 12:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

But can't they access other catalogs like pg_publication*? I think
the basic thing we want to ensure here is that all historic accesses
always use systable* APIs to access catalogs. We can ensure that via
having Asserts (or elog(ERROR, ..) in heap/tableam APIs.

Yeah, it can. So I have changed it now, actually along with
CheckXidLive, I have kept one more flag so whenever CheckXidLive is
set and we pass through systable_beginscan we will set that flag. So
while accessing the tableam API we will set if CheckXidLive is set
then another flag must also be set otherwise we through an error.

Okay, I have reviewed these changes and below are my comments:

Review of  v17-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
--------------------------------------------------------------------
1.
+ /*
+ * If CheckXidAlive is set then set a flag that this call is passed through
+ * systable_beginscan.  See detailed  comments at snapmgr.c where these
+ * variables are declared.
+ */
+ if (TransactionIdIsValid(CheckXidAlive))
+ sysbegin_called = true;

a. How about calling this variable as bsysscan or sysscan instead of
sysbegin_called?

Done

b. There is an extra space between detailed and comments. A similar
change is required at other place where this comment is used.

Done

c. How about writing the first line as "If CheckXidAlive is set then
set a flag to indicate that system table scan is in-progress."

2.
-     Any actions leading to transaction ID assignment are prohibited.
That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system
catalog tables in
+     the output plugins has to be done via the
<literal>systable_*</literal> scan
+     APIs only. The user tables should not be accesed in the output
plugins anyways.
+     Access via the <literal>heap_*</literal> scan APIs will error out.

The line "The user tables should not be accesed in the output plugins
anyways." seems a bit of out of place. I don't think this is required
here. If you read the previous paragraph in the same document it is
written: "Read only access to relations is permitted as long as only
relations are accessed that either have been created by
<command>initdb</command> in the <literal>pg_catalog</literal> schema,
or have been marked as user provided catalog tables using ...". I
think that is sufficient to convey the information that the newly
added line by you is trying to convey.

Right.

3.
+ /*
+ * We don't expect direct calls to this routine when CheckXidAlive is a
+ * valid transaction id, this should only come through systable_* call.
+ * CheckXidAlive is set during logical decoding of a transactions.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) && !sysbegin_called))
+ elog(ERROR, "unexpected heap_getnext call during logical decoding");

How about changing this comment as "We don't expect direct calls to
heap_getnext with valid CheckXidAlive for catalog or regular tables.
See detailed comments at snapmgr.c where these variables are
declared."? Change the similar comment used in other places in the
patch.

For this specific API, we can also say "Normally we have such a check
at tableam level API but this is called from many places so we need to
ensure it here."

Done

4.
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we error
+ * out.  We can't directly use TransactionIdDidAbort as after crash such
+ * transaction might not have been marked as aborted.  See detailed  comments
+ * at snapmgr.c where the variable is declared.
+ */
+static inline void
+HandleConcurrentAbort()

Can we change the comments as "Error out, if CheckXidAlive is aborted.
We can't directly use TransactionIdDidAbort as after crash such
transaction might not have been marked as aborted."

After this add one empty line and then we can say something like:
"This is a special API to check if CheckXidAlive is aborted in system
table scan APIs. See detailed comments at snapmgr.c where the
variable is declared."

5. Shouldn't we add a check in table_scan_sample_next_block and
table_scan_sample_next_tuple APIs as well?

Done

6.
/*
+ * An xid value pointing to a possibly ongoing (sub)transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.  If CheckXidAlive is set
+ * then we will set sysbegin_called flag when we call systable_beginscan.  This
+ * is to ensure that from the pgoutput plugin we should never directly access
+ * the tableam or heap apis because we are checking for the concurrent abort
+ * only in systable_* apis.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool sysbegin_called = false;

Can we change the above comment as "CheckXidAlive is a xid value
pointing to a possibly ongoing (sub)transaction. Currently, it is
used in logical decoding. It's possible that such transactions can
get aborted while the decoding is ongoing in which case we skip
decoding that particular transaction. To ensure that we check whether
the CheckXidAlive is aborted after fetching the tuple from system
tables. We also ensure that during logical decoding we never directly
access the tableam or heap APIs because we are checking for the
concurrent aborts only in systable_* APIs."

Done

I have also fixed one issue in the patch
v18-0010-Bugfix-handling-of-incomplete-toast-tuple.patch.

Basically, the check, in ReorderBufferLargestTopTXN for selecting the
largest top transaction was incorrect so I have fixed that.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v18-0001-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=v18-0001-Immediately-WAL-log-assignments.patchDownload
From 95cb7a2f0ee195bb1ce74f66ba413d88b38cdb69 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 20 Mar 2020 15:03:01 +0530
Subject: [PATCH v18 01/12] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 46 ++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 +++
 src/backend/replication/logical/decode.c | 39 ++++++++++----------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 99 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3984dd3e1a..c2604bb514 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5118,6 +5120,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6020,3 +6023,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 4259309dba..3c49954b57 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 79ff976474..7b5257fe81 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1189,6 +1189,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1227,6 +1228,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3abf8..122c581d0f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel xid is valid, we need to assign the subxact to the
+	 * toplevel xact. We need to do this for all records, hence we do it before
+	 * the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(XLogRecGetTopXid(r)))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04babc2..8645b3816c 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0a12afb59e..3289ad753a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 4582196e18..756f6df8cf 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -150,6 +150,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -285,6 +287,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0194..2f0c8bf589 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
2.23.0

v18-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchapplication/octet-stream; name=v18-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchDownload
From f26cc42abff8183c9240d6191b56d164cddf93da Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 10:55:19 +0530
Subject: [PATCH v18 04/12] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml  |  9 +++--
 src/backend/access/heap/heapam.c   | 10 ++++++
 src/backend/access/index/genam.c   | 53 ++++++++++++++++++++++++++++
 src/backend/access/table/tableam.c |  8 +++++
 src/backend/utils/time/snapmgr.c   | 13 +++++++
 src/include/access/tableam.h       | 55 ++++++++++++++++++++++++++++++
 src/include/utils/snapmgr.h        |  2 ++
 7 files changed, 147 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 65244b1019..909a2139b6 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,9 +432,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0d4ed602d7..8371ec6e81 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments at snapmgr.c
+	 * where these variables are declared.  Normally we have such a check at
+	 * tableam level API but this is called from many places so we need to
+	 * ensure it here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae39a..446b8cbc86 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,9 +430,36 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments at snapmgr.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
+/*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments at snapmgr.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
 /*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments at snapmgr.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index c814733b22..892d8db7ab 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -230,6 +230,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	Relation	rel = scan->rs_rd;
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
+	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
 	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..ad1d567172 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -153,6 +153,19 @@ static Snapshot SecondarySnapshot = NULL;
 static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction. To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool bsysscan = false;
+
 /*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 94903dd8de..fe4811e2db 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
+#include "utils/snapmgr.h"
 #include "utils/snapshot.h"
 
 
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1015,6 +1025,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1054,6 +1071,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1710,6 +1735,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1727,6 +1760,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1745,6 +1786,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1761,6 +1809,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce84..5af6df698b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,6 +145,8 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
+extern bool	bsysscan;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
 extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
 extern void TeardownHistoricSnapshot(bool is_error);
-- 
2.23.0

v18-0002-Issue-individual-invalidations-with-wal_level-lo.patchapplication/octet-stream; name=v18-0002-Issue-individual-invalidations-with-wal_level-lo.patchDownload
From 384d8bcfe984df9620310c091fadc535c2f47024 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Tue, 14 Apr 2020 11:11:37 +0530
Subject: [PATCH v18 02/12] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.
---
 src/backend/access/rmgrdesc/xactdesc.c        |  40 +++++++
 src/backend/access/transam/xact.c             |   7 ++
 src/backend/replication/logical/decode.c      |  16 +++
 .../replication/logical/reorderbuffer.c       | 104 +++++++++++++++---
 src/backend/utils/cache/inval.c               |  49 +++++++++
 src/include/access/xact.h                     |  13 ++-
 src/include/replication/reorderbuffer.h       |  11 ++
 7 files changed, 226 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index fbc5942578..17c06f7062 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c2604bb514..8e6b1a6ebc 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6020,6 +6020,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 122c581d0f..69c1f45ef6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,22 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				Assert(TransactionIdIsValid(xid));
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->nmsgs, invals->msgs);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4594cf9509..b889edf461 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -220,7 +220,7 @@ static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
 static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
-static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
 
 /*
  * ---------------------------------------
@@ -455,6 +455,11 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 				pfree(change->data.msg.message);
 			change->data.msg.message = NULL;
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			if (change->data.inval.invalidations)
+				pfree(change->data.inval.invalidations);
+			change->data.inval.invalidations = NULL;
+			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			if (change->data.snapshot)
 			{
@@ -1814,17 +1819,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation messages locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					ReorderBufferExecuteInvalidations(
+							change->data.inval.ninvalidations,
+							change->data.inval.invalidations);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -1866,7 +1878,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1892,7 +1905,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2202,6 +2216,33 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	txn->ntuplecids++;
 }
 
+/*
+ * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn, int nmsgs,
+							 SharedInvalidationMessage *msgs)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.ninvalidations = nmsgs;
+	change->data.inval.invalidations = (SharedInvalidationMessage *)
+		MemoryContextAlloc(rb->context,
+						   sizeof(SharedInvalidationMessage) * nmsgs);
+	memcpy(change->data.inval.invalidations, msgs,
+		   sizeof(SharedInvalidationMessage) * nmsgs);
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Setup the invalidation of the toplevel transaction.
  *
@@ -2234,12 +2275,12 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  * in the changestream but we don't know which those are.
  */
 static void
-ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
 {
 	int			i;
 
-	for (i = 0; i < txn->ninvalidations; i++)
-		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	for (i = 0; i < nmsgs; i++)
+		LocalExecuteInvalidationMessage(&msgs[i]);
 }
 
 /*
@@ -2589,6 +2630,24 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				char	   *data;
+				Size		inval_size = sizeof(SharedInvalidationMessage) *
+										change->data.inval.ninvalidations;
+
+				sz += inval_size;
+
+				ReorderBufferSerializeReserve(rb, sz);
+				data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+
+				/* might have been reallocated above */
+				ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+				memcpy(data, change->data.inval.invalidations, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -2736,6 +2795,11 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			sz += sizeof(SharedInvalidationMessage) *
+					change->data.inval.ninvalidations;
+			break;
+
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -3002,6 +3066,20 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				Size	inval_size = sizeof(SharedInvalidationMessage) *
+									change->data.inval.ninvalidations;
+
+				change->data.inval.invalidations =
+						MemoryContextAlloc(rb->context, inval_size);
+
+				/* read the message */
+				memcpy(change->data.inval.invalidations, data, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33be6..cba5b6c64b 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transaction.  As of now it was
+ *	enough to log invalidation only at commit because we are only decoding the
+ *	transaction at the commit time.   We only need to log the catalog cache and
+ *	relcache invalidation.  There can not be any active MVCC scan in logical
+ *	decoding so we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
 	if (transInvalInfo == NULL)
 		return;
 
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
@@ -1501,3 +1513,40 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage	*invalMessages;
+	int	nmsgs = 0;
+
+	if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+	{
+		ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+								 MakeSharedInvalidMessagesArray);
+		invalMessages = SharedInvalidMessagesArray;
+		nmsgs  = numSharedInvalidMessagesArray;
+		SharedInvalidMessagesArray = NULL;
+		numSharedInvalidMessagesArray = 0;
+	}
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b3816c..b822c5e4b2 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -197,6 +197,17 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+/*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4dc9..af35287896 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,14 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations;		/* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation
+														 * message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +468,8 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  int nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
2.23.0

v18-0005-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v18-0005-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From 0b378ffc8bf286a4e9788d135c1061eaaa19d539 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 1 May 2020 19:56:35 +0530
Subject: [PATCH v18 05/12] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c   |  38 +-
 .../replication/logical/reorderbuffer.c       | 712 ++++++++++++++++--
 src/include/replication/reorderbuffer.h       |  36 +
 3 files changed, 699 insertions(+), 87 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aa..160b167adb 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b889edf461..bc5821b2bf 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -236,6 +236,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +245,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +381,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -772,6 +785,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+/*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -865,6 +910,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -988,7 +1036,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1024,6 +1072,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1038,6 +1089,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1315,6 +1369,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		dlist_delete(&txn->base_snapshot_node);
 	}
 
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
 	/*
 	 * Remove TXN from its containing list.
 	 *
@@ -1340,9 +1403,94 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferReturnTXN(rb, txn);
 }
 
+/*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
 /*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
  */
 static void
 ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
-
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
 
 	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
@@ -1491,59 +1636,76 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error that we can ignore.  We
+ * might have already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the abort we will
+ * stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+	 * aborted. That will happen during catalog access.  Also reset the
+	 * sysbegin_called flag.
 	 */
-	if (txn->base_snapshot == NULL)
+	if (!TransactionIdDidCommit(xid))
 	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
-		return;
+		CheckXidAlive = xid;
+		bsysscan = false;
 	}
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
 
-	/* build data to be able to lookup the CommandIds of catalog tuples */
+	/*
+	 * build data to be able to lookup the CommandIds of catalog tuples
+	 */
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
@@ -1564,14 +1726,17 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction("stream");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* start streaming this chunk of transaction */
+		if (streaming)
+			rb->stream_start(rb, txn);
+		else
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1579,6 +1744,19 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1588,8 +1766,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * use as a normal record. It'll be cleaned up at the end
 					 * of INSERT processing.
 					 */
-					if (specinsert == NULL)
-						elog(ERROR, "invalid ordering of speculative insertion changes");
 					Assert(specinsert->data.tp.oldtuple == NULL);
 					change = specinsert;
 					change->action = REORDER_BUFFER_CHANGE_INSERT;
@@ -1655,7 +1831,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						if (streaming)
+						{
+							rb->stream_change(rb, txn, relation, change);
+
+							/* Remember that we have sent some data for this xid. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_change(rb, txn, relation, change);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1676,8 +1860,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						 * freed/reused while restoring spooled data from
 						 * disk.
 						 */
-						Assert(change->data.tp.newtuple != NULL);
-
 						dlist_delete(&change->node);
 						ReorderBufferToastAppendChunk(rb, txn, relation,
 													  change);
@@ -1695,7 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1753,7 +1935,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						if (streaming)
+						{
+							rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+							/* Remember that we have sent some data. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_truncate(rb, txn, nrelations, relations, change);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					if (streaming)
+						rb->stream_message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
+					else
+						rb->message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1796,7 +1992,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1858,14 +2053,46 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			/*
+			 * Set the last of the stream as the final lsn before calling
+			 * stream stop.
+			 */
+			if (!XLogRecPtrIsInvalid(prev_lsn))
+				txn->final_lsn = prev_lsn;
+			rb->stream_stop(rb, txn);
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+		{
+			txn->command_id = command_id;
+
+			/* Avoid copying if it's already copied. */
+			if (snapshot_now->copied)
+				txn->snapshot_now = snapshot_now;
+			else
+				txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+														  txn, command_id);
+		}
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1884,14 +2111,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,17 +2146,130 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		if (streaming)
+		{
+			/* Discard the changes that we just streamed. */
+			ReorderBufferTruncateTXN(rb, txn);
 
-		PG_RE_THROW();
+			/* Re-throw only if it's not an abort. */
+			if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+			{
+				MemoryContextSwitchTo(ecxt);
+				PG_RE_THROW();
+			}
+			else
+			{
+				FlushErrorState();
+				FreeErrorData(errdata);
+				errdata = NULL;
+
+				/* remember the command ID and snapshot for the streaming run */
+				txn->command_id = command_id;
+
+				/* Avoid copying if it's already copied. */
+				if (snapshot_now->copied)
+					txn->snapshot_now = snapshot_now;
+				else
+					txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+															  txn, command_id);
+				/*
+				 * Set the last last of the stream as the final lsn before
+				 * calling stream stop.
+				 */
+				txn->final_lsn = prev_lsn;
+				rb->stream_stop(rb, txn);
+				ReorderBufferToastReset(rb, txn);
+				if (specinsert != NULL)
+				{
+					ReorderBufferReturnChange(rb, specinsert);
+					specinsert = NULL;
+				}
+			}
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * We currently can only decode a transaction's contents when its commit
+ * record is read because that's the only place where we know about cache
+ * invalidations. Thus, once a toplevel commit is read, we iterate over the top
+ * and subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot snapshot_now;
+	CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
@@ -1946,6 +2294,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2015,6 +2370,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2150,8 +2512,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2159,6 +2530,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2170,19 +2542,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2211,6 +2592,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2398,6 +2787,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
@@ -2418,15 +2839,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2746,6 +3198,102 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot snapshot_now;
+	CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e102840486..6d65986a82 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -171,6 +171,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -190,6 +191,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -224,6 +243,16 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	final_lsn;
 
+	/*
+	 * Have we sent any changes for this transaction in output plugin?
+	 */
+	bool		any_data_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
@@ -254,6 +283,13 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
+	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
 	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
-- 
2.23.0

v18-0003-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=v18-0003-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From 284b454ee65f9edd9af7a9a30a39e97794adfc6e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v18 03/12] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 ++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 +++++
 src/include/replication/reorderbuffer.h   |  57 ++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c948856e..64f651fa72 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index bad3bfe620..65244b1019 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5adf253583..497d8a9c36 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +204,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -862,6 +910,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->final_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 3b7ca7f1da..f24e2468ac 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -79,6 +79,11 @@ typedef struct LogicalDecodingContext
 	 */
 	void	   *output_writer_private;
 
+	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236c57..0d0a94a648 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
 /*
  * Output plugin callbacks
  */
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index af35287896..e102840486 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -354,6 +354,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -392,6 +438,17 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
 	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
-- 
2.23.0

v18-0006-Add-support-for-streaming-to-built-in-replicatio.patchapplication/octet-stream; name=v18-0006-Add-support-for-streaming-to-built-in-replicatio.patchDownload
From 05c62aadd8dbd6aa9bf4c0999a2af1e6866a0fac Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 16 Apr 2020 01:55:22 -0700
Subject: [PATCH v18 06/12] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml      |    4 +-
 doc/src/sgml/ref/create_subscription.sgml     |   12 +
 src/backend/catalog/pg_subscription.c         |    1 +
 src/backend/commands/subscriptioncmds.c       |   45 +-
 src/backend/postmaster/pgstat.c               |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |    3 +
 src/backend/replication/logical/launcher.c    |    1 -
 src/backend/replication/logical/logical.c     |    4 +-
 src/backend/replication/logical/proto.c       |  140 ++-
 src/backend/replication/logical/worker.c      | 1033 +++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c   |  318 ++++-
 src/backend/replication/slotfuncs.c           |    6 +
 src/backend/replication/walsender.c           |    6 +
 src/include/catalog/pg_subscription.h         |    3 +
 src/include/pgstat.h                          |    6 +-
 src/include/replication/logicalproto.h        |   42 +-
 src/include/replication/walreceiver.h         |    1 +
 src/test/subscription/t/009_stream_simple.pl  |   86 ++
 src/test/subscription/t/010_stream_subxact.pl |  102 ++
 src/test/subscription/t/011_stream_ddl.pl     |   95 ++
 .../t/012_stream_subxact_abort.pl             |   82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |   84 ++
 22 files changed, 2043 insertions(+), 43 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 1b8beadbaa..95b7c24ef9 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -164,8 +164,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c244fb..3349cc4bfc 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731115..f28482f0f4 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 7f156673f7..65b6b76164 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3f8105c6eb..df3f64c7ba 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4133,6 +4133,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9bb6..5257ab0394 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e987..8156a42ace 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,7 +14,6 @@
  *
  *-------------------------------------------------------------------------
  */
-
 #include "postgres.h"
 
 #include "access/heapam.h"
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 497d8a9c36..dfc681df43 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1148,7 +1148,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1193,7 +1193,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = txn->final_lsn;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd171..5242ac0efe 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a1224d..026cd48bd0 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,6 +54,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -64,6 +86,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +94,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -110,12 +134,57 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool	in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;						/* XID of the subxact */
+	off_t           offset;					/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +256,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -552,6 +657,319 @@ apply_handle_origin(StringInfo s)
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+			return;
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	ensure_transaction();
+
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -565,6 +983,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1001,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1040,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1158,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1303,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1676,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1817,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1477,6 +1929,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 	}
 }
 
+/*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
 /*
  * Apply main loop.
  */
@@ -1493,6 +1961,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1941,6 +2412,567 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3138,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 77b85fc655..811706a34f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,54 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +119,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +145,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +208,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +237,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +260,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +281,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +370,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +427,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +469,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +490,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +522,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +542,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +566,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +586,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +611,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +643,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -587,6 +723,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -623,6 +844,34 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -750,11 +999,44 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -790,7 +1072,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index ae751e94e7..c02c1b620f 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -155,6 +155,12 @@ create_logical_replication_slot(char *name, char *plugin,
 									read_local_xlog_page, NULL, NULL,
 									NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
 	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 8b55bbfcb2..65bb628438 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1009,6 +1009,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndPrepareWrite, WalSndWriteData,
 										WalSndUpdateProgress);
 
+		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
 		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d42d8..617b9094d4 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8041d9988..3b3e1fde6f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -981,7 +981,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561be9..89158ed46f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f1aa6e9977..70d39f880d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v18-0009-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v18-0009-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From fbf3b7f90e7b8587ddb62d317c2acf130432ea1e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v18 09/12] Add TAP test for streaming vs. DDL

---
 .../subscription/t/014_stream_through_ddl.pl  | 98 +++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000000..b8d78b1972
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v18-0010-Bugfix-handling-of-incomplete-toast-tuple.patchapplication/octet-stream; name=v18-0010-Bugfix-handling-of-incomplete-toast-tuple.patchDownload
From 60ad2568488194c1e9aa9a6b5f3524a46bfb9141 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 28 Feb 2020 11:07:46 +0530
Subject: [PATCH v18 10/12] Bugfix handling of incomplete toast tuple

---
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/executor/execReplication.c        |   2 -
 src/backend/replication/logical/decode.c      |  17 +-
 .../replication/logical/reorderbuffer.c       | 186 +++++++++++-------
 src/include/access/heapam_xlog.h              |   1 +
 src/include/replication/reorderbuffer.h       |  24 ++-
 6 files changed, 151 insertions(+), 82 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8371ec6e81..8028820d0e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1955,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 1418746eb8..b8461966f9 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -57,8 +57,6 @@ build_replindex_scan_key(ScanKey skey, Relation rel, Relation idxrel,
 	int2vector *indkey = &idxrel->rd_index->indkey;
 	bool		hasnulls = false;
 
-	Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel));
-
 	indclassDatum = SysCacheGetAttr(INDEXRELID, idxrel->rd_indextuple,
 									Anum_pg_index_indclass, &isnull);
 	Assert(!isnull);
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 69c1f45ef6..c841687c66 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -727,7 +727,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -794,7 +796,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -851,7 +854,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -887,7 +891,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -987,7 +991,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1025,7 +1029,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d80ad04363..e7e1aece80 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -178,6 +178,11 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define ChangeIsInsertOrUpdate(action) \
+			(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+			((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+			((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT))
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -654,11 +659,14 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
-	ReorderBufferTXN *txn;
+	ReorderBufferTXN *txn, *toptxn;
+	bool	can_stream = false;
 
-	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+	toptxn = txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
 
 	change->lsn = lsn;
 	change->txn = txn;
@@ -668,9 +676,49 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Otherwise, if
+	 * we have toast insert bit set and this is insert/update then clear the
+	 * bit.
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(txn) &&
+			ChangeIsInsertOrUpdate(change->action))
+	{
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+		can_stream = true;
+	}
+
+	/*
+	 * If this is a speculative insert then set the corresponding bit.
+	 * Otherwise, if we have speculative insert bit set and this is spec
+	 * confirm record then clear the bit.
+	 */
+	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM)
+	{
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+		can_stream = true;
+	}
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/*
+	 * If streaming is enable and we have serialized this transaction because
+	 * it had incomplete tuple.  So if now we have got the complete tuple we
+	 * can stream it.
+	 */
+	if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
+		&& !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
+	{
+		ReorderBufferStreamTXN(rb, toptxn);
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
+	}
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -700,7 +748,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1865,8 +1913,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						 * disk.
 						 */
 						dlist_delete(&change->node);
-						ReorderBufferToastAppendChunk(rb, txn, relation,
-													  change);
+							ReorderBufferToastAppendChunk(rb, txn, relation,
+														  change);
 					}
 
 			change_done:
@@ -2463,7 +2511,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2512,7 +2560,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2535,6 +2583,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2549,8 +2598,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2558,12 +2612,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2624,7 +2686,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 	memcpy(change->data.inval.invalidations, msgs,
 		   sizeof(SharedInvalidationMessage) * nmsgs);
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -2811,15 +2873,16 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		/*
+		 * if the current transaction is larger and doesn't have incomplete data
+		 * remember it.
+		 */
+		if (((!largest) || (txn->total_size > largest->total_size)) &&
+			((txn->total_size > 0) && !(rbtxn_has_toast_insert(txn)) &&
+			!(rbtxn_has_spec_insert(txn))))
+		largest = txn;
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2837,66 +2900,51 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
 	ReorderBufferTXN *txn;
 
-	/* bail out if we haven't exceeded the memory limit */
-	if (rb->size < logical_decoding_work_mem * 1024L)
-		return;
-
-	/*
-	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by streaming, if supported. Otherwise spill to disk.
-	 */
-	if (ReorderBufferCanStream(rb))
+	/* Loop until we reach under the memory limit. */
+	while (rb->size >= logical_decoding_work_mem * 1024L)
 	{
 		/*
-		 * Pick the largest toplevel transaction and evict it from memory by
-		 * streaming the already decoded part.
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTopTXN(rb);
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn && !txn->toptxn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		ReorderBufferStreamTXN(rb, txn);
-	}
-	else
-	{
-		/*
-		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
-		 */
-		txn = ReorderBufferLargestTXN(rb);
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
-		ReorderBufferSerializeTXN(rb, txn);
+		/*
+		 * After eviction, the transaction should have no entries in memory, and
+		 * should use 0 bytes for changes.
+		 *
+		 * XXX Checking the size is fine for both cases - spill to disk and
+		 * streaming. But for streaming we should really check nentries_mem for
+		 * all subtransactions too.
+		 */
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
 	}
 
-	/*
-	 * After eviction, the transaction should have no entries in memory, and
-	 * should use 0 bytes for changes.
-	 *
-	 * XXX Checking the size is fine for both cases - spill to disk and
-	 * streaming. But for streaming we should really check nentries_mem for
-	 * all subtransactions too.
-	 */
-	Assert(txn->size == 0);
-	Assert(txn->nentries_mem == 0);
-
-	/*
-	 * And furthermore, evicting the transaction should get us below the
-	 * memory limit again - it is not possible that we're still exceeding the
-	 * memory limit after evicting the transaction.
-	 *
-	 * This follows from the simple fact that the selected transaction is at
-	 * least as large as the most recent change (which caused us to go over
-	 * the memory limit). So by evicting it we're definitely back below the
-	 * memory limit.
-	 */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cdb12..aa17f7df84 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 603f325663..ba2ab7185c 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -172,6 +172,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -191,6 +193,17 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -199,10 +212,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -355,6 +364,9 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -545,7 +557,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
2.23.0

v18-0007-Track-statistics-for-streaming.patchapplication/octet-stream; name=v18-0007-Track-statistics-for-streaming.patchDownload
From 04070297f7f3534707b97458ca238df6a6f47629 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 2 Apr 2020 13:19:29 +0530
Subject: [PATCH v18 07/12] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                  | 25 +++++++++++++++
 src/backend/catalog/system_views.sql          |  5 ++-
 .../replication/logical/reorderbuffer.c       | 13 ++++++++
 src/backend/replication/walsender.c           | 32 ++++++++++++++++---
 src/include/catalog/pg_proc.dat               |  6 ++--
 src/include/replication/reorderbuffer.h       | 13 +++++---
 src/include/replication/walsender_private.h   |  5 +++
 src/test/regress/expected/rules.out           |  7 ++--
 8 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 252300db14..fb9c8e59af 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2063,6 +2063,31 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       may get spilled repeatedly, and this counter gets incremented on every
       such invocation.</entry>
     </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
+
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2bd5f5ea14..8f34ce8deb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -788,7 +788,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bc5821b2bf..d80ad04363 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -331,6 +331,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3289,6 +3293,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
 
+	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 65bb628438..058ee2b8b2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1345,7 +1345,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1366,7 +1366,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2415,6 +2416,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3256,7 +3260,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3314,6 +3318,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillCount;
 		int64		spillBytes;
 		bool		is_sync_standby;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 		int			j;
@@ -3339,6 +3346,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		/*
@@ -3441,6 +3451,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3689,11 +3704,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4bce3ad8de..9fb1ffe2c8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5237,9 +5237,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6d65986a82..603f325663 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -517,15 +517,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 734acec2a4..b997d1710e 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -83,6 +83,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 8876025aaa..0c4952a1fa 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2005,9 +2005,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_slru| SELECT s.name,
     s.blks_zeroed,
-- 
2.23.0

v18-0008-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v18-0008-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From 26ed40559494a9b613b43784e31c634ce029a29f Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v18 08/12] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 6da7f71ca3..086d0c7f02 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -70,7 +70,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2fbc..94c71f8ae2 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 34ab11e7fe..ad3ed13ffc 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9181..a6fae9c3f1 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17a78..202871a658 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10a19..70c86b22ac 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6d63..f9c8d1d348 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 81520a7332..0c9c6b3dd4 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bdba9..21f50c7012 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133f69..30561d8f96 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae38592b..9a6bac6822 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..ed56fbf96c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cba4c..4df1ddef63 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1a8a..c3caff6149 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7c2f..c62eb521e7 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30f59..2be7542831 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0578..2da9607a7d 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9435..96ffc091b0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
2.23.0

v18-0011-Provide-new-api-to-get-the-streaming-changes.patchapplication/octet-stream; name=v18-0011-Provide-new-api-to-get-the-streaming-changes.patchDownload
From ad56cfdda7c66e8a12c9c3f912f31e79ef5ecccf Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Sat, 2 May 2020 11:41:59 +0530
Subject: [PATCH v18 11/12] Provide new api to get the streaming changes

---
 .gitignore                                    |  1 +
 src/backend/catalog/system_views.sql          |  8 +++++++
 .../replication/logical/logicalfuncs.c        | 23 +++++++++++++++----
 src/include/catalog/pg_proc.dat               |  9 ++++++++
 4 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/.gitignore b/.gitignore
index 794e35b73c..6083744c07 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8f34ce8deb..dd488cb2f8 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1242,6 +1242,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
 
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_peek_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fded8e8290..debb91b457 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -108,7 +108,8 @@ check_permissions(void)
  * Helper function for the various SQL callable logical decoding functions.
  */
 static Datum
-pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool binary)
+pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
+								 bool binary, bool streaming)
 {
 	Name		name;
 	XLogRecPtr	upto_lsn;
@@ -250,6 +251,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or has been invalidated.")));
 
+		/* If called has not asked for streaming changes then disable it. */
+		ctx->streaming &= streaming;
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
@@ -360,7 +364,16 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 Datum
 pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, false);
+}
+
+/*
+ * SQL function to get the streaming changes as text, consuming the data.
+ */
+Datum
+pg_logical_slot_get_streaming_changes(PG_FUNCTION_ARGS)
+{
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, true);
 }
 
 /*
@@ -369,7 +382,7 @@ pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, false, false);
 }
 
 /*
@@ -378,7 +391,7 @@ pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, true, false);
 }
 
 /*
@@ -387,7 +400,7 @@ pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, true, false);
 }
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9fb1ffe2c8..3dfc5c10fc 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10117,6 +10117,15 @@
   proargmodes => '{i,i,i,v,o,o,o}',
   proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
   prosrc => 'pg_logical_slot_get_binary_changes' },
+{ oid => '6150', descr => 'get streaming changes from replication slot',
+  proname => 'pg_logical_slot_get_streaming_changes', procost => '1000',
+  prorows => '1000', provariadic => 'text', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'name pg_lsn int4 _text',
+  proallargtypes => '{name,pg_lsn,int4,_text,pg_lsn,xid,text}',
+  proargmodes => '{i,i,i,v,o,o,o}',
+  proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
+  prosrc => 'pg_logical_slot_get_streaming_changes' },
 { oid => '3784', descr => 'peek at changes from replication slot',
   proname => 'pg_logical_slot_peek_changes', procost => '1000',
   prorows => '1000', provariadic => 'text', proisstrict => 'f',
-- 
2.23.0

v18-0012-Add-streaming-option-in-pg_dump.patchapplication/octet-stream; name=v18-0012-Add-streaming-option-in-pg_dump.patchDownload
From 7342b64082907b18fd6f3ddb6b5a4290733183be Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v18 12/12] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 5db4f5761d..11db7b79d7 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4210,6 +4210,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4244,8 +4245,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4258,6 +4259,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4274,6 +4276,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4351,6 +4354,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 61c909e06d..5c5b072a99 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
2.23.0

#294Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#293)
12 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, May 5, 2020 at 4:06 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Apr 30, 2020 at 12:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

But can't they access other catalogs like pg_publication*? I think
the basic thing we want to ensure here is that all historic accesses
always use systable* APIs to access catalogs. We can ensure that via
having Asserts (or elog(ERROR, ..) in heap/tableam APIs.

Yeah, it can. So I have changed it now, actually along with
CheckXidLive, I have kept one more flag so whenever CheckXidLive is
set and we pass through systable_beginscan we will set that flag. So
while accessing the tableam API we will set if CheckXidLive is set
then another flag must also be set otherwise we through an error.

Okay, I have reviewed these changes and below are my comments:

Review of  v17-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
--------------------------------------------------------------------
1.
+ /*
+ * If CheckXidAlive is set then set a flag that this call is passed through
+ * systable_beginscan.  See detailed  comments at snapmgr.c where these
+ * variables are declared.
+ */
+ if (TransactionIdIsValid(CheckXidAlive))
+ sysbegin_called = true;

a. How about calling this variable as bsysscan or sysscan instead of
sysbegin_called?

Done

b. There is an extra space between detailed and comments. A similar
change is required at other place where this comment is used.

Done

c. How about writing the first line as "If CheckXidAlive is set then
set a flag to indicate that system table scan is in-progress."

2.
-     Any actions leading to transaction ID assignment are prohibited.
That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system
catalog tables in
+     the output plugins has to be done via the
<literal>systable_*</literal> scan
+     APIs only. The user tables should not be accesed in the output
plugins anyways.
+     Access via the <literal>heap_*</literal> scan APIs will error out.

The line "The user tables should not be accesed in the output plugins
anyways." seems a bit of out of place. I don't think this is required
here. If you read the previous paragraph in the same document it is
written: "Read only access to relations is permitted as long as only
relations are accessed that either have been created by
<command>initdb</command> in the <literal>pg_catalog</literal> schema,
or have been marked as user provided catalog tables using ...". I
think that is sufficient to convey the information that the newly
added line by you is trying to convey.

Right.

3.
+ /*
+ * We don't expect direct calls to this routine when CheckXidAlive is a
+ * valid transaction id, this should only come through systable_* call.
+ * CheckXidAlive is set during logical decoding of a transactions.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) && !sysbegin_called))
+ elog(ERROR, "unexpected heap_getnext call during logical decoding");

How about changing this comment as "We don't expect direct calls to
heap_getnext with valid CheckXidAlive for catalog or regular tables.
See detailed comments at snapmgr.c where these variables are
declared."? Change the similar comment used in other places in the
patch.

For this specific API, we can also say "Normally we have such a check
at tableam level API but this is called from many places so we need to
ensure it here."

Done

4.
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we error
+ * out.  We can't directly use TransactionIdDidAbort as after crash such
+ * transaction might not have been marked as aborted.  See detailed  comments
+ * at snapmgr.c where the variable is declared.
+ */
+static inline void
+HandleConcurrentAbort()

Can we change the comments as "Error out, if CheckXidAlive is aborted.
We can't directly use TransactionIdDidAbort as after crash such
transaction might not have been marked as aborted."

After this add one empty line and then we can say something like:
"This is a special API to check if CheckXidAlive is aborted in system
table scan APIs. See detailed comments at snapmgr.c where the
variable is declared."

5. Shouldn't we add a check in table_scan_sample_next_block and
table_scan_sample_next_tuple APIs as well?

Done

6.
/*
+ * An xid value pointing to a possibly ongoing (sub)transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.  If CheckXidAlive is set
+ * then we will set sysbegin_called flag when we call systable_beginscan.  This
+ * is to ensure that from the pgoutput plugin we should never directly access
+ * the tableam or heap apis because we are checking for the concurrent abort
+ * only in systable_* apis.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool sysbegin_called = false;

Can we change the above comment as "CheckXidAlive is a xid value
pointing to a possibly ongoing (sub)transaction. Currently, it is
used in logical decoding. It's possible that such transactions can
get aborted while the decoding is ongoing in which case we skip
decoding that particular transaction. To ensure that we check whether
the CheckXidAlive is aborted after fetching the tuple from system
tables. We also ensure that during logical decoding we never directly
access the tableam or heap APIs because we are checking for the
concurrent aborts only in systable_* APIs."

Done

I have also fixed one issue in the patch
v18-0010-Bugfix-handling-of-incomplete-toast-tuple.patch.

Basically, the check, in ReorderBufferLargestTopTXN for selecting the
largest top transaction was incorrect so I have fixed that.

There was one unrelated bug fix in v18-0010 patch reported by Neha
Sharma offlist so sending the updated version.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v19-0005-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v19-0005-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From 0b378ffc8bf286a4e9788d135c1061eaaa19d539 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 1 May 2020 19:56:35 +0530
Subject: [PATCH v19 05/12] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c   |  38 +-
 .../replication/logical/reorderbuffer.c       | 712 ++++++++++++++++--
 src/include/replication/reorderbuffer.h       |  36 +
 3 files changed, 699 insertions(+), 87 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aa..160b167adb 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b889edf461..bc5821b2bf 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -236,6 +236,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +245,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +381,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -772,6 +785,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+/*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -865,6 +910,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -988,7 +1036,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1024,6 +1072,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1038,6 +1089,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1315,6 +1369,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		dlist_delete(&txn->base_snapshot_node);
 	}
 
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
 	/*
 	 * Remove TXN from its containing list.
 	 *
@@ -1340,9 +1403,94 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferReturnTXN(rb, txn);
 }
 
+/*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
 /*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
  */
 static void
 ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
-
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
 
 	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
@@ -1491,59 +1636,76 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error that we can ignore.  We
+ * might have already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the abort we will
+ * stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+	 * aborted. That will happen during catalog access.  Also reset the
+	 * sysbegin_called flag.
 	 */
-	if (txn->base_snapshot == NULL)
+	if (!TransactionIdDidCommit(xid))
 	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
-		return;
+		CheckXidAlive = xid;
+		bsysscan = false;
 	}
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
 
-	/* build data to be able to lookup the CommandIds of catalog tuples */
+	/*
+	 * build data to be able to lookup the CommandIds of catalog tuples
+	 */
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
@@ -1564,14 +1726,17 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction("stream");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* start streaming this chunk of transaction */
+		if (streaming)
+			rb->stream_start(rb, txn);
+		else
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1579,6 +1744,19 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1588,8 +1766,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * use as a normal record. It'll be cleaned up at the end
 					 * of INSERT processing.
 					 */
-					if (specinsert == NULL)
-						elog(ERROR, "invalid ordering of speculative insertion changes");
 					Assert(specinsert->data.tp.oldtuple == NULL);
 					change = specinsert;
 					change->action = REORDER_BUFFER_CHANGE_INSERT;
@@ -1655,7 +1831,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						if (streaming)
+						{
+							rb->stream_change(rb, txn, relation, change);
+
+							/* Remember that we have sent some data for this xid. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_change(rb, txn, relation, change);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1676,8 +1860,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						 * freed/reused while restoring spooled data from
 						 * disk.
 						 */
-						Assert(change->data.tp.newtuple != NULL);
-
 						dlist_delete(&change->node);
 						ReorderBufferToastAppendChunk(rb, txn, relation,
 													  change);
@@ -1695,7 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1753,7 +1935,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						if (streaming)
+						{
+							rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+							/* Remember that we have sent some data. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_truncate(rb, txn, nrelations, relations, change);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					if (streaming)
+						rb->stream_message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
+					else
+						rb->message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1796,7 +1992,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1858,14 +2053,46 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			/*
+			 * Set the last of the stream as the final lsn before calling
+			 * stream stop.
+			 */
+			if (!XLogRecPtrIsInvalid(prev_lsn))
+				txn->final_lsn = prev_lsn;
+			rb->stream_stop(rb, txn);
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+		{
+			txn->command_id = command_id;
+
+			/* Avoid copying if it's already copied. */
+			if (snapshot_now->copied)
+				txn->snapshot_now = snapshot_now;
+			else
+				txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+														  txn, command_id);
+		}
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1884,14 +2111,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,17 +2146,130 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		if (streaming)
+		{
+			/* Discard the changes that we just streamed. */
+			ReorderBufferTruncateTXN(rb, txn);
 
-		PG_RE_THROW();
+			/* Re-throw only if it's not an abort. */
+			if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+			{
+				MemoryContextSwitchTo(ecxt);
+				PG_RE_THROW();
+			}
+			else
+			{
+				FlushErrorState();
+				FreeErrorData(errdata);
+				errdata = NULL;
+
+				/* remember the command ID and snapshot for the streaming run */
+				txn->command_id = command_id;
+
+				/* Avoid copying if it's already copied. */
+				if (snapshot_now->copied)
+					txn->snapshot_now = snapshot_now;
+				else
+					txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+															  txn, command_id);
+				/*
+				 * Set the last last of the stream as the final lsn before
+				 * calling stream stop.
+				 */
+				txn->final_lsn = prev_lsn;
+				rb->stream_stop(rb, txn);
+				ReorderBufferToastReset(rb, txn);
+				if (specinsert != NULL)
+				{
+					ReorderBufferReturnChange(rb, specinsert);
+					specinsert = NULL;
+				}
+			}
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * We currently can only decode a transaction's contents when its commit
+ * record is read because that's the only place where we know about cache
+ * invalidations. Thus, once a toplevel commit is read, we iterate over the top
+ * and subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot snapshot_now;
+	CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
@@ -1946,6 +2294,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2015,6 +2370,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2150,8 +2512,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2159,6 +2530,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2170,19 +2542,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2211,6 +2592,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2398,6 +2787,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
@@ -2418,15 +2839,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2746,6 +3198,102 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot snapshot_now;
+	CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e102840486..6d65986a82 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -171,6 +171,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -190,6 +191,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -224,6 +243,16 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	final_lsn;
 
+	/*
+	 * Have we sent any changes for this transaction in output plugin?
+	 */
+	bool		any_data_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
@@ -254,6 +283,13 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
+	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
 	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
-- 
2.23.0

v19-0001-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=v19-0001-Immediately-WAL-log-assignments.patchDownload
From 95cb7a2f0ee195bb1ce74f66ba413d88b38cdb69 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 20 Mar 2020 15:03:01 +0530
Subject: [PATCH v19 01/12] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 46 ++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 +++
 src/backend/replication/logical/decode.c | 39 ++++++++++----------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 99 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3984dd3e1a..c2604bb514 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5118,6 +5120,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6020,3 +6023,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 4259309dba..3c49954b57 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 79ff976474..7b5257fe81 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1189,6 +1189,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1227,6 +1228,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3abf8..122c581d0f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel xid is valid, we need to assign the subxact to the
+	 * toplevel xact. We need to do this for all records, hence we do it before
+	 * the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(XLogRecGetTopXid(r)))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04babc2..8645b3816c 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0a12afb59e..3289ad753a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 4582196e18..756f6df8cf 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -150,6 +150,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -285,6 +287,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0194..2f0c8bf589 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
2.23.0

v19-0002-Issue-individual-invalidations-with-wal_level-lo.patchapplication/octet-stream; name=v19-0002-Issue-individual-invalidations-with-wal_level-lo.patchDownload
From 384d8bcfe984df9620310c091fadc535c2f47024 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Tue, 14 Apr 2020 11:11:37 +0530
Subject: [PATCH v19 02/12] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.
---
 src/backend/access/rmgrdesc/xactdesc.c        |  40 +++++++
 src/backend/access/transam/xact.c             |   7 ++
 src/backend/replication/logical/decode.c      |  16 +++
 .../replication/logical/reorderbuffer.c       | 104 +++++++++++++++---
 src/backend/utils/cache/inval.c               |  49 +++++++++
 src/include/access/xact.h                     |  13 ++-
 src/include/replication/reorderbuffer.h       |  11 ++
 7 files changed, 226 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index fbc5942578..17c06f7062 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c2604bb514..8e6b1a6ebc 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6020,6 +6020,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 122c581d0f..69c1f45ef6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,22 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				Assert(TransactionIdIsValid(xid));
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->nmsgs, invals->msgs);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4594cf9509..b889edf461 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -220,7 +220,7 @@ static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
 static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
-static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
 
 /*
  * ---------------------------------------
@@ -455,6 +455,11 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 				pfree(change->data.msg.message);
 			change->data.msg.message = NULL;
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			if (change->data.inval.invalidations)
+				pfree(change->data.inval.invalidations);
+			change->data.inval.invalidations = NULL;
+			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			if (change->data.snapshot)
 			{
@@ -1814,17 +1819,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation messages locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					ReorderBufferExecuteInvalidations(
+							change->data.inval.ninvalidations,
+							change->data.inval.invalidations);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -1866,7 +1878,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1892,7 +1905,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2202,6 +2216,33 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	txn->ntuplecids++;
 }
 
+/*
+ * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn, int nmsgs,
+							 SharedInvalidationMessage *msgs)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.ninvalidations = nmsgs;
+	change->data.inval.invalidations = (SharedInvalidationMessage *)
+		MemoryContextAlloc(rb->context,
+						   sizeof(SharedInvalidationMessage) * nmsgs);
+	memcpy(change->data.inval.invalidations, msgs,
+		   sizeof(SharedInvalidationMessage) * nmsgs);
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Setup the invalidation of the toplevel transaction.
  *
@@ -2234,12 +2275,12 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  * in the changestream but we don't know which those are.
  */
 static void
-ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
 {
 	int			i;
 
-	for (i = 0; i < txn->ninvalidations; i++)
-		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	for (i = 0; i < nmsgs; i++)
+		LocalExecuteInvalidationMessage(&msgs[i]);
 }
 
 /*
@@ -2589,6 +2630,24 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				char	   *data;
+				Size		inval_size = sizeof(SharedInvalidationMessage) *
+										change->data.inval.ninvalidations;
+
+				sz += inval_size;
+
+				ReorderBufferSerializeReserve(rb, sz);
+				data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+
+				/* might have been reallocated above */
+				ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+				memcpy(data, change->data.inval.invalidations, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -2736,6 +2795,11 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			sz += sizeof(SharedInvalidationMessage) *
+					change->data.inval.ninvalidations;
+			break;
+
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -3002,6 +3066,20 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				Size	inval_size = sizeof(SharedInvalidationMessage) *
+									change->data.inval.ninvalidations;
+
+				change->data.inval.invalidations =
+						MemoryContextAlloc(rb->context, inval_size);
+
+				/* read the message */
+				memcpy(change->data.inval.invalidations, data, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33be6..cba5b6c64b 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transaction.  As of now it was
+ *	enough to log invalidation only at commit because we are only decoding the
+ *	transaction at the commit time.   We only need to log the catalog cache and
+ *	relcache invalidation.  There can not be any active MVCC scan in logical
+ *	decoding so we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
 	if (transInvalInfo == NULL)
 		return;
 
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
@@ -1501,3 +1513,40 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage	*invalMessages;
+	int	nmsgs = 0;
+
+	if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+	{
+		ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+								 MakeSharedInvalidMessagesArray);
+		invalMessages = SharedInvalidMessagesArray;
+		nmsgs  = numSharedInvalidMessagesArray;
+		SharedInvalidMessagesArray = NULL;
+		numSharedInvalidMessagesArray = 0;
+	}
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b3816c..b822c5e4b2 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -197,6 +197,17 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+/*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4dc9..af35287896 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,14 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations;		/* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation
+														 * message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +468,8 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  int nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
2.23.0

v19-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchapplication/octet-stream; name=v19-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchDownload
From f26cc42abff8183c9240d6191b56d164cddf93da Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 10:55:19 +0530
Subject: [PATCH v19 04/12] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml  |  9 +++--
 src/backend/access/heap/heapam.c   | 10 ++++++
 src/backend/access/index/genam.c   | 53 ++++++++++++++++++++++++++++
 src/backend/access/table/tableam.c |  8 +++++
 src/backend/utils/time/snapmgr.c   | 13 +++++++
 src/include/access/tableam.h       | 55 ++++++++++++++++++++++++++++++
 src/include/utils/snapmgr.h        |  2 ++
 7 files changed, 147 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 65244b1019..909a2139b6 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,9 +432,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0d4ed602d7..8371ec6e81 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments at snapmgr.c
+	 * where these variables are declared.  Normally we have such a check at
+	 * tableam level API but this is called from many places so we need to
+	 * ensure it here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae39a..446b8cbc86 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,9 +430,36 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments at snapmgr.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
+/*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments at snapmgr.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
 /*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments at snapmgr.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index c814733b22..892d8db7ab 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -230,6 +230,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	Relation	rel = scan->rs_rd;
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
+	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
 	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..ad1d567172 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -153,6 +153,19 @@ static Snapshot SecondarySnapshot = NULL;
 static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction. To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool bsysscan = false;
+
 /*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 94903dd8de..fe4811e2db 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
+#include "utils/snapmgr.h"
 #include "utils/snapshot.h"
 
 
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1015,6 +1025,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1054,6 +1071,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1710,6 +1735,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1727,6 +1760,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1745,6 +1786,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1761,6 +1809,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce84..5af6df698b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,6 +145,8 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
+extern bool	bsysscan;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
 extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
 extern void TeardownHistoricSnapshot(bool is_error);
-- 
2.23.0

v19-0003-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=v19-0003-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From 284b454ee65f9edd9af7a9a30a39e97794adfc6e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v19 03/12] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 ++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 +++++
 src/include/replication/reorderbuffer.h   |  57 ++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c948856e..64f651fa72 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index bad3bfe620..65244b1019 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5adf253583..497d8a9c36 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +204,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -862,6 +910,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->final_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 3b7ca7f1da..f24e2468ac 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -79,6 +79,11 @@ typedef struct LogicalDecodingContext
 	 */
 	void	   *output_writer_private;
 
+	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236c57..0d0a94a648 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
 /*
  * Output plugin callbacks
  */
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index af35287896..e102840486 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -354,6 +354,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -392,6 +438,17 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
 	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
-- 
2.23.0

v19-0007-Track-statistics-for-streaming.patchapplication/octet-stream; name=v19-0007-Track-statistics-for-streaming.patchDownload
From 04070297f7f3534707b97458ca238df6a6f47629 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 2 Apr 2020 13:19:29 +0530
Subject: [PATCH v19 07/12] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                  | 25 +++++++++++++++
 src/backend/catalog/system_views.sql          |  5 ++-
 .../replication/logical/reorderbuffer.c       | 13 ++++++++
 src/backend/replication/walsender.c           | 32 ++++++++++++++++---
 src/include/catalog/pg_proc.dat               |  6 ++--
 src/include/replication/reorderbuffer.h       | 13 +++++---
 src/include/replication/walsender_private.h   |  5 +++
 src/test/regress/expected/rules.out           |  7 ++--
 8 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 252300db14..fb9c8e59af 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2063,6 +2063,31 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       may get spilled repeatedly, and this counter gets incremented on every
       such invocation.</entry>
     </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
+
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2bd5f5ea14..8f34ce8deb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -788,7 +788,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bc5821b2bf..d80ad04363 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -331,6 +331,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3289,6 +3293,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
 
+	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 65bb628438..058ee2b8b2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1345,7 +1345,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1366,7 +1366,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2415,6 +2416,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3256,7 +3260,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3314,6 +3318,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillCount;
 		int64		spillBytes;
 		bool		is_sync_standby;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 		int			j;
@@ -3339,6 +3346,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		/*
@@ -3441,6 +3451,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3689,11 +3704,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4bce3ad8de..9fb1ffe2c8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5237,9 +5237,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6d65986a82..603f325663 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -517,15 +517,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 734acec2a4..b997d1710e 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -83,6 +83,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 8876025aaa..0c4952a1fa 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2005,9 +2005,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_slru| SELECT s.name,
     s.blks_zeroed,
-- 
2.23.0

v19-0008-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v19-0008-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From 26ed40559494a9b613b43784e31c634ce029a29f Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v19 08/12] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 6da7f71ca3..086d0c7f02 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -70,7 +70,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2fbc..94c71f8ae2 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 34ab11e7fe..ad3ed13ffc 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9181..a6fae9c3f1 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17a78..202871a658 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10a19..70c86b22ac 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6d63..f9c8d1d348 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 81520a7332..0c9c6b3dd4 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bdba9..21f50c7012 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133f69..30561d8f96 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae38592b..9a6bac6822 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..ed56fbf96c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cba4c..4df1ddef63 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1a8a..c3caff6149 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7c2f..c62eb521e7 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30f59..2be7542831 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0578..2da9607a7d 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9435..96ffc091b0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
2.23.0

v19-0009-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v19-0009-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From fbf3b7f90e7b8587ddb62d317c2acf130432ea1e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v19 09/12] Add TAP test for streaming vs. DDL

---
 .../subscription/t/014_stream_through_ddl.pl  | 98 +++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000000..b8d78b1972
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v19-0010-Bugfix-handling-of-incomplete-toast-tuple.patchapplication/octet-stream; name=v19-0010-Bugfix-handling-of-incomplete-toast-tuple.patchDownload
From 802033af0b7bad540fafec19cb74278624321703 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 28 Feb 2020 11:07:46 +0530
Subject: [PATCH v19 10/12] Bugfix handling of incomplete toast tuple

---
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/replication/logical/decode.c      |  17 +-
 .../replication/logical/reorderbuffer.c       | 186 +++++++++++-------
 src/include/access/heapam_xlog.h              |   1 +
 src/include/replication/reorderbuffer.h       |  24 ++-
 5 files changed, 151 insertions(+), 80 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8371ec6e81..8028820d0e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1955,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 69c1f45ef6..c841687c66 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -727,7 +727,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -794,7 +796,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -851,7 +854,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -887,7 +891,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -987,7 +991,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1025,7 +1029,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d80ad04363..e7e1aece80 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -178,6 +178,11 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define ChangeIsInsertOrUpdate(action) \
+			(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+			((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+			((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT))
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -654,11 +659,14 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
-	ReorderBufferTXN *txn;
+	ReorderBufferTXN *txn, *toptxn;
+	bool	can_stream = false;
 
-	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+	toptxn = txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
 
 	change->lsn = lsn;
 	change->txn = txn;
@@ -668,9 +676,49 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Otherwise, if
+	 * we have toast insert bit set and this is insert/update then clear the
+	 * bit.
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(txn) &&
+			ChangeIsInsertOrUpdate(change->action))
+	{
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+		can_stream = true;
+	}
+
+	/*
+	 * If this is a speculative insert then set the corresponding bit.
+	 * Otherwise, if we have speculative insert bit set and this is spec
+	 * confirm record then clear the bit.
+	 */
+	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM)
+	{
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+		can_stream = true;
+	}
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/*
+	 * If streaming is enable and we have serialized this transaction because
+	 * it had incomplete tuple.  So if now we have got the complete tuple we
+	 * can stream it.
+	 */
+	if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
+		&& !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
+	{
+		ReorderBufferStreamTXN(rb, toptxn);
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
+	}
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -700,7 +748,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1865,8 +1913,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						 * disk.
 						 */
 						dlist_delete(&change->node);
-						ReorderBufferToastAppendChunk(rb, txn, relation,
-													  change);
+							ReorderBufferToastAppendChunk(rb, txn, relation,
+														  change);
 					}
 
 			change_done:
@@ -2463,7 +2511,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2512,7 +2560,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2535,6 +2583,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2549,8 +2598,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2558,12 +2612,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2624,7 +2686,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 	memcpy(change->data.inval.invalidations, msgs,
 		   sizeof(SharedInvalidationMessage) * nmsgs);
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -2811,15 +2873,16 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		/*
+		 * if the current transaction is larger and doesn't have incomplete data
+		 * remember it.
+		 */
+		if (((!largest) || (txn->total_size > largest->total_size)) &&
+			((txn->total_size > 0) && !(rbtxn_has_toast_insert(txn)) &&
+			!(rbtxn_has_spec_insert(txn))))
+		largest = txn;
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2837,66 +2900,51 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
 	ReorderBufferTXN *txn;
 
-	/* bail out if we haven't exceeded the memory limit */
-	if (rb->size < logical_decoding_work_mem * 1024L)
-		return;
-
-	/*
-	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by streaming, if supported. Otherwise spill to disk.
-	 */
-	if (ReorderBufferCanStream(rb))
+	/* Loop until we reach under the memory limit. */
+	while (rb->size >= logical_decoding_work_mem * 1024L)
 	{
 		/*
-		 * Pick the largest toplevel transaction and evict it from memory by
-		 * streaming the already decoded part.
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTopTXN(rb);
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn && !txn->toptxn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		ReorderBufferStreamTXN(rb, txn);
-	}
-	else
-	{
-		/*
-		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
-		 */
-		txn = ReorderBufferLargestTXN(rb);
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
-		ReorderBufferSerializeTXN(rb, txn);
+		/*
+		 * After eviction, the transaction should have no entries in memory, and
+		 * should use 0 bytes for changes.
+		 *
+		 * XXX Checking the size is fine for both cases - spill to disk and
+		 * streaming. But for streaming we should really check nentries_mem for
+		 * all subtransactions too.
+		 */
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
 	}
 
-	/*
-	 * After eviction, the transaction should have no entries in memory, and
-	 * should use 0 bytes for changes.
-	 *
-	 * XXX Checking the size is fine for both cases - spill to disk and
-	 * streaming. But for streaming we should really check nentries_mem for
-	 * all subtransactions too.
-	 */
-	Assert(txn->size == 0);
-	Assert(txn->nentries_mem == 0);
-
-	/*
-	 * And furthermore, evicting the transaction should get us below the
-	 * memory limit again - it is not possible that we're still exceeding the
-	 * memory limit after evicting the transaction.
-	 *
-	 * This follows from the simple fact that the selected transaction is at
-	 * least as large as the most recent change (which caused us to go over
-	 * the memory limit). So by evicting it we're definitely back below the
-	 * memory limit.
-	 */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cdb12..aa17f7df84 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 603f325663..ba2ab7185c 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -172,6 +172,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -191,6 +193,17 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -199,10 +212,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -355,6 +364,9 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -545,7 +557,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
2.23.0

v19-0006-Add-support-for-streaming-to-built-in-replicatio.patchapplication/octet-stream; name=v19-0006-Add-support-for-streaming-to-built-in-replicatio.patchDownload
From 05c62aadd8dbd6aa9bf4c0999a2af1e6866a0fac Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 16 Apr 2020 01:55:22 -0700
Subject: [PATCH v19 06/12] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml      |    4 +-
 doc/src/sgml/ref/create_subscription.sgml     |   12 +
 src/backend/catalog/pg_subscription.c         |    1 +
 src/backend/commands/subscriptioncmds.c       |   45 +-
 src/backend/postmaster/pgstat.c               |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |    3 +
 src/backend/replication/logical/launcher.c    |    1 -
 src/backend/replication/logical/logical.c     |    4 +-
 src/backend/replication/logical/proto.c       |  140 ++-
 src/backend/replication/logical/worker.c      | 1033 +++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c   |  318 ++++-
 src/backend/replication/slotfuncs.c           |    6 +
 src/backend/replication/walsender.c           |    6 +
 src/include/catalog/pg_subscription.h         |    3 +
 src/include/pgstat.h                          |    6 +-
 src/include/replication/logicalproto.h        |   42 +-
 src/include/replication/walreceiver.h         |    1 +
 src/test/subscription/t/009_stream_simple.pl  |   86 ++
 src/test/subscription/t/010_stream_subxact.pl |  102 ++
 src/test/subscription/t/011_stream_ddl.pl     |   95 ++
 .../t/012_stream_subxact_abort.pl             |   82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |   84 ++
 22 files changed, 2043 insertions(+), 43 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 1b8beadbaa..95b7c24ef9 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -164,8 +164,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c244fb..3349cc4bfc 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731115..f28482f0f4 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 7f156673f7..65b6b76164 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3f8105c6eb..df3f64c7ba 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4133,6 +4133,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9bb6..5257ab0394 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e987..8156a42ace 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,7 +14,6 @@
  *
  *-------------------------------------------------------------------------
  */
-
 #include "postgres.h"
 
 #include "access/heapam.h"
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 497d8a9c36..dfc681df43 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1148,7 +1148,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1193,7 +1193,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = txn->final_lsn;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd171..5242ac0efe 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a1224d..026cd48bd0 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,6 +54,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -64,6 +86,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +94,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -110,12 +134,57 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool	in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;						/* XID of the subxact */
+	off_t           offset;					/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +256,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -552,6 +657,319 @@ apply_handle_origin(StringInfo s)
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+			return;
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	ensure_transaction();
+
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -565,6 +983,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1001,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1040,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1158,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1303,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1676,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1817,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1477,6 +1929,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 	}
 }
 
+/*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
 /*
  * Apply main loop.
  */
@@ -1493,6 +1961,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1941,6 +2412,567 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3138,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 77b85fc655..811706a34f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,54 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +119,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +145,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +208,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +237,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +260,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +281,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +370,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +427,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +469,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +490,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +522,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +542,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +566,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +586,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +611,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +643,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -587,6 +723,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -623,6 +844,34 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -750,11 +999,44 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -790,7 +1072,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index ae751e94e7..c02c1b620f 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -155,6 +155,12 @@ create_logical_replication_slot(char *name, char *plugin,
 									read_local_xlog_page, NULL, NULL,
 									NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
 	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 8b55bbfcb2..65bb628438 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1009,6 +1009,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndPrepareWrite, WalSndWriteData,
 										WalSndUpdateProgress);
 
+		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
 		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d42d8..617b9094d4 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8041d9988..3b3e1fde6f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -981,7 +981,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561be9..89158ed46f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f1aa6e9977..70d39f880d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v19-0011-Provide-new-api-to-get-the-streaming-changes.patchapplication/octet-stream; name=v19-0011-Provide-new-api-to-get-the-streaming-changes.patchDownload
From 20ebc9bcac059cc9cf7af35d1eabb0f5e79bb6b2 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Sat, 2 May 2020 11:41:59 +0530
Subject: [PATCH v19 11/12] Provide new api to get the streaming changes

---
 .gitignore                                    |  1 +
 src/backend/catalog/system_views.sql          |  8 +++++++
 .../replication/logical/logicalfuncs.c        | 23 +++++++++++++++----
 src/include/catalog/pg_proc.dat               |  9 ++++++++
 4 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/.gitignore b/.gitignore
index 794e35b73c..6083744c07 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8f34ce8deb..dd488cb2f8 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1242,6 +1242,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
 
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_peek_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fded8e8290..debb91b457 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -108,7 +108,8 @@ check_permissions(void)
  * Helper function for the various SQL callable logical decoding functions.
  */
 static Datum
-pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool binary)
+pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
+								 bool binary, bool streaming)
 {
 	Name		name;
 	XLogRecPtr	upto_lsn;
@@ -250,6 +251,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or has been invalidated.")));
 
+		/* If called has not asked for streaming changes then disable it. */
+		ctx->streaming &= streaming;
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
@@ -360,7 +364,16 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 Datum
 pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, false);
+}
+
+/*
+ * SQL function to get the streaming changes as text, consuming the data.
+ */
+Datum
+pg_logical_slot_get_streaming_changes(PG_FUNCTION_ARGS)
+{
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, true);
 }
 
 /*
@@ -369,7 +382,7 @@ pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, false, false);
 }
 
 /*
@@ -378,7 +391,7 @@ pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, true, false);
 }
 
 /*
@@ -387,7 +400,7 @@ pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, true, false);
 }
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9fb1ffe2c8..3dfc5c10fc 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10117,6 +10117,15 @@
   proargmodes => '{i,i,i,v,o,o,o}',
   proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
   prosrc => 'pg_logical_slot_get_binary_changes' },
+{ oid => '6150', descr => 'get streaming changes from replication slot',
+  proname => 'pg_logical_slot_get_streaming_changes', procost => '1000',
+  prorows => '1000', provariadic => 'text', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'name pg_lsn int4 _text',
+  proallargtypes => '{name,pg_lsn,int4,_text,pg_lsn,xid,text}',
+  proargmodes => '{i,i,i,v,o,o,o}',
+  proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
+  prosrc => 'pg_logical_slot_get_streaming_changes' },
 { oid => '3784', descr => 'peek at changes from replication slot',
   proname => 'pg_logical_slot_peek_changes', procost => '1000',
   prorows => '1000', provariadic => 'text', proisstrict => 'f',
-- 
2.23.0

v19-0012-Add-streaming-option-in-pg_dump.patchapplication/octet-stream; name=v19-0012-Add-streaming-option-in-pg_dump.patchDownload
From 3019fb4cab0803aba215468e9417be63f9c7e8c3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v19 12/12] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 5db4f5761d..11db7b79d7 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4210,6 +4210,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4244,8 +4245,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4258,6 +4259,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4274,6 +4276,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4351,6 +4354,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 61c909e06d..5c5b072a99 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
2.23.0

#295Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#294)
12 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, May 5, 2020 at 7:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have fixed one more issue in 0010 patch. The issue was that once
the transaction is serialized due to the incomplete toast after
streaming the serialized store was not cleaned up so it was streaming
the same tuple multiple times.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v20-0002-Issue-individual-invalidations-with-wal_level-lo.patchapplication/octet-stream; name=v20-0002-Issue-individual-invalidations-with-wal_level-lo.patchDownload
From 384d8bcfe984df9620310c091fadc535c2f47024 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Tue, 14 Apr 2020 11:11:37 +0530
Subject: [PATCH v20 02/12] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.
---
 src/backend/access/rmgrdesc/xactdesc.c        |  40 +++++++
 src/backend/access/transam/xact.c             |   7 ++
 src/backend/replication/logical/decode.c      |  16 +++
 .../replication/logical/reorderbuffer.c       | 104 +++++++++++++++---
 src/backend/utils/cache/inval.c               |  49 +++++++++
 src/include/access/xact.h                     |  13 ++-
 src/include/replication/reorderbuffer.h       |  11 ++
 7 files changed, 226 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index fbc5942578..17c06f7062 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c2604bb514..8e6b1a6ebc 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6020,6 +6020,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 122c581d0f..69c1f45ef6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,22 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				Assert(TransactionIdIsValid(xid));
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->nmsgs, invals->msgs);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4594cf9509..b889edf461 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -220,7 +220,7 @@ static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
 static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
-static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
 
 /*
  * ---------------------------------------
@@ -455,6 +455,11 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 				pfree(change->data.msg.message);
 			change->data.msg.message = NULL;
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			if (change->data.inval.invalidations)
+				pfree(change->data.inval.invalidations);
+			change->data.inval.invalidations = NULL;
+			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			if (change->data.snapshot)
 			{
@@ -1814,17 +1819,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation messages locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					ReorderBufferExecuteInvalidations(
+							change->data.inval.ninvalidations,
+							change->data.inval.invalidations);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -1866,7 +1878,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1892,7 +1905,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2202,6 +2216,33 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	txn->ntuplecids++;
 }
 
+/*
+ * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn, int nmsgs,
+							 SharedInvalidationMessage *msgs)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.ninvalidations = nmsgs;
+	change->data.inval.invalidations = (SharedInvalidationMessage *)
+		MemoryContextAlloc(rb->context,
+						   sizeof(SharedInvalidationMessage) * nmsgs);
+	memcpy(change->data.inval.invalidations, msgs,
+		   sizeof(SharedInvalidationMessage) * nmsgs);
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Setup the invalidation of the toplevel transaction.
  *
@@ -2234,12 +2275,12 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  * in the changestream but we don't know which those are.
  */
 static void
-ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
 {
 	int			i;
 
-	for (i = 0; i < txn->ninvalidations; i++)
-		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	for (i = 0; i < nmsgs; i++)
+		LocalExecuteInvalidationMessage(&msgs[i]);
 }
 
 /*
@@ -2589,6 +2630,24 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				char	   *data;
+				Size		inval_size = sizeof(SharedInvalidationMessage) *
+										change->data.inval.ninvalidations;
+
+				sz += inval_size;
+
+				ReorderBufferSerializeReserve(rb, sz);
+				data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+
+				/* might have been reallocated above */
+				ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+				memcpy(data, change->data.inval.invalidations, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -2736,6 +2795,11 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			sz += sizeof(SharedInvalidationMessage) *
+					change->data.inval.ninvalidations;
+			break;
+
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -3002,6 +3066,20 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				Size	inval_size = sizeof(SharedInvalidationMessage) *
+									change->data.inval.ninvalidations;
+
+				change->data.inval.invalidations =
+						MemoryContextAlloc(rb->context, inval_size);
+
+				/* read the message */
+				memcpy(change->data.inval.invalidations, data, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33be6..cba5b6c64b 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transaction.  As of now it was
+ *	enough to log invalidation only at commit because we are only decoding the
+ *	transaction at the commit time.   We only need to log the catalog cache and
+ *	relcache invalidation.  There can not be any active MVCC scan in logical
+ *	decoding so we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
 	if (transInvalInfo == NULL)
 		return;
 
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
@@ -1501,3 +1513,40 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage	*invalMessages;
+	int	nmsgs = 0;
+
+	if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+	{
+		ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+								 MakeSharedInvalidMessagesArray);
+		invalMessages = SharedInvalidMessagesArray;
+		nmsgs  = numSharedInvalidMessagesArray;
+		SharedInvalidMessagesArray = NULL;
+		numSharedInvalidMessagesArray = 0;
+	}
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b3816c..b822c5e4b2 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -197,6 +197,17 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+/*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4dc9..af35287896 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,14 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations;		/* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation
+														 * message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +468,8 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  int nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
2.23.0

v20-0001-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=v20-0001-Immediately-WAL-log-assignments.patchDownload
From 95cb7a2f0ee195bb1ce74f66ba413d88b38cdb69 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 20 Mar 2020 15:03:01 +0530
Subject: [PATCH v20 01/12] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 46 ++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 +++
 src/backend/replication/logical/decode.c | 39 ++++++++++----------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 99 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3984dd3e1a..c2604bb514 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5118,6 +5120,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6020,3 +6023,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 4259309dba..3c49954b57 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 79ff976474..7b5257fe81 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1189,6 +1189,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1227,6 +1228,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3abf8..122c581d0f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel xid is valid, we need to assign the subxact to the
+	 * toplevel xact. We need to do this for all records, hence we do it before
+	 * the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(XLogRecGetTopXid(r)))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04babc2..8645b3816c 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0a12afb59e..3289ad753a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 4582196e18..756f6df8cf 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -150,6 +150,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -285,6 +287,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0194..2f0c8bf589 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
2.23.0

v20-0003-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=v20-0003-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From 284b454ee65f9edd9af7a9a30a39e97794adfc6e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v20 03/12] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 ++++++
 doc/src/sgml/logicaldecoding.sgml         | 209 +++++++++++++
 src/backend/replication/logical/logical.c | 361 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 +++++
 src/include/replication/reorderbuffer.h   |  57 ++++
 6 files changed, 801 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c948856e..64f651fa72 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index bad3bfe620..65244b1019 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,91 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5adf253583..497d8a9c36 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,21 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +204,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similarly to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -862,6 +910,319 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	/* state.report_location = apply_lsn; */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->final_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 3b7ca7f1da..f24e2468ac 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -79,6 +79,11 @@ typedef struct LogicalDecodingContext
 	 */
 	void	   *output_writer_private;
 
+	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236c57..0d0a94a648 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
 /*
  * Output plugin callbacks
  */
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index af35287896..e102840486 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -354,6 +354,52 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -392,6 +438,17 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
 	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
-- 
2.23.0

v20-0005-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v20-0005-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From 0b378ffc8bf286a4e9788d135c1061eaaa19d539 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 1 May 2020 19:56:35 +0530
Subject: [PATCH v20 05/12] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c   |  38 +-
 .../replication/logical/reorderbuffer.c       | 712 ++++++++++++++++--
 src/include/replication/reorderbuffer.h       |  36 +
 3 files changed, 699 insertions(+), 87 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aa..160b167adb 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b889edf461..bc5821b2bf 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -236,6 +236,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +245,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +381,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -772,6 +785,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+/*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -865,6 +910,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -988,7 +1036,7 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  */
 
 /*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)
@@ -1024,6 +1072,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1038,6 +1089,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1315,6 +1369,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		dlist_delete(&txn->base_snapshot_node);
 	}
 
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
 	/*
 	 * Remove TXN from its containing list.
 	 *
@@ -1340,9 +1403,94 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferReturnTXN(rb, txn);
 }
 
+/*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * any changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
 /*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
  */
 static void
 ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
-		return;
-
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
 
 	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
@@ -1491,59 +1636,76 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error that we can ignore.  We
+ * might have already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the abort we will
+ * stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+	 * aborted. That will happen during catalog access.  Also reset the
+	 * sysbegin_called flag.
 	 */
-	if (txn->base_snapshot == NULL)
+	if (!TransactionIdDidCommit(xid))
 	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
-		return;
+		CheckXidAlive = xid;
+		bsysscan = false;
 	}
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
 
-	/* build data to be able to lookup the CommandIds of catalog tuples */
+	/*
+	 * build data to be able to lookup the CommandIds of catalog tuples
+	 */
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
@@ -1564,14 +1726,17 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction("stream");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* start streaming this chunk of transaction */
+		if (streaming)
+			rb->stream_start(rb, txn);
+		else
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1579,6 +1744,19 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1588,8 +1766,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 * use as a normal record. It'll be cleaned up at the end
 					 * of INSERT processing.
 					 */
-					if (specinsert == NULL)
-						elog(ERROR, "invalid ordering of speculative insertion changes");
 					Assert(specinsert->data.tp.oldtuple == NULL);
 					change = specinsert;
 					change->action = REORDER_BUFFER_CHANGE_INSERT;
@@ -1655,7 +1831,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						if (streaming)
+						{
+							rb->stream_change(rb, txn, relation, change);
+
+							/* Remember that we have sent some data for this xid. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_change(rb, txn, relation, change);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1676,8 +1860,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						 * freed/reused while restoring spooled data from
 						 * disk.
 						 */
-						Assert(change->data.tp.newtuple != NULL);
-
 						dlist_delete(&change->node);
 						ReorderBufferToastAppendChunk(rb, txn, relation,
 													  change);
@@ -1695,7 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1753,7 +1935,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						if (streaming)
+						{
+							rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+							/* Remember that we have sent some data. */
+							change->txn->any_data_sent = true;
+						}
+						else
+							rb->apply_truncate(rb, txn, nrelations, relations, change);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					if (streaming)
+						rb->stream_message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
+					else
+						rb->message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1796,7 +1992,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1858,14 +2053,46 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			/*
+			 * Set the last of the stream as the final lsn before calling
+			 * stream stop.
+			 */
+			if (!XLogRecPtrIsInvalid(prev_lsn))
+				txn->final_lsn = prev_lsn;
+			rb->stream_stop(rb, txn);
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+		{
+			txn->command_id = command_id;
+
+			/* Avoid copying if it's already copied. */
+			if (snapshot_now->copied)
+				txn->snapshot_now = snapshot_now;
+			else
+				txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+														  txn, command_id);
+		}
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1884,14 +2111,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,17 +2146,130 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		if (streaming)
+		{
+			/* Discard the changes that we just streamed. */
+			ReorderBufferTruncateTXN(rb, txn);
 
-		PG_RE_THROW();
+			/* Re-throw only if it's not an abort. */
+			if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+			{
+				MemoryContextSwitchTo(ecxt);
+				PG_RE_THROW();
+			}
+			else
+			{
+				FlushErrorState();
+				FreeErrorData(errdata);
+				errdata = NULL;
+
+				/* remember the command ID and snapshot for the streaming run */
+				txn->command_id = command_id;
+
+				/* Avoid copying if it's already copied. */
+				if (snapshot_now->copied)
+					txn->snapshot_now = snapshot_now;
+				else
+					txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+															  txn, command_id);
+				/*
+				 * Set the last last of the stream as the final lsn before
+				 * calling stream stop.
+				 */
+				txn->final_lsn = prev_lsn;
+				rb->stream_stop(rb, txn);
+				ReorderBufferToastReset(rb, txn);
+				if (specinsert != NULL)
+				{
+					ReorderBufferReturnChange(rb, specinsert);
+					specinsert = NULL;
+				}
+			}
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * We currently can only decode a transaction's contents when its commit
+ * record is read because that's the only place where we know about cache
+ * invalidations. Thus, once a toplevel commit is read, we iterate over the top
+ * and subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot snapshot_now;
+	CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
@@ -1946,6 +2294,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2015,6 +2370,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2150,8 +2512,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2159,6 +2530,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2170,19 +2542,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2211,6 +2592,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2398,6 +2787,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
@@ -2418,15 +2839,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2746,6 +3198,102 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot snapshot_now;
+	CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e102840486..6d65986a82 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -171,6 +171,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -190,6 +191,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -224,6 +243,16 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	final_lsn;
 
+	/*
+	 * Have we sent any changes for this transaction in output plugin?
+	 */
+	bool		any_data_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
@@ -254,6 +283,13 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
+	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
 	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
-- 
2.23.0

v20-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchapplication/octet-stream; name=v20-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchDownload
From f26cc42abff8183c9240d6191b56d164cddf93da Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 10:55:19 +0530
Subject: [PATCH v20 04/12] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml  |  9 +++--
 src/backend/access/heap/heapam.c   | 10 ++++++
 src/backend/access/index/genam.c   | 53 ++++++++++++++++++++++++++++
 src/backend/access/table/tableam.c |  8 +++++
 src/backend/utils/time/snapmgr.c   | 13 +++++++
 src/include/access/tableam.h       | 55 ++++++++++++++++++++++++++++++
 src/include/utils/snapmgr.h        |  2 ++
 7 files changed, 147 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 65244b1019..909a2139b6 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,9 +432,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0d4ed602d7..8371ec6e81 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments at snapmgr.c
+	 * where these variables are declared.  Normally we have such a check at
+	 * tableam level API but this is called from many places so we need to
+	 * ensure it here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae39a..446b8cbc86 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,9 +430,36 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments at snapmgr.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
+/*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments at snapmgr.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
 /*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments at snapmgr.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index c814733b22..892d8db7ab 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -230,6 +230,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	Relation	rel = scan->rs_rd;
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
+	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
 	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..ad1d567172 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -153,6 +153,19 @@ static Snapshot SecondarySnapshot = NULL;
 static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction. To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool bsysscan = false;
+
 /*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 94903dd8de..fe4811e2db 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
+#include "utils/snapmgr.h"
 #include "utils/snapshot.h"
 
 
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1015,6 +1025,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1054,6 +1071,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1710,6 +1735,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1727,6 +1760,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1745,6 +1786,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1761,6 +1809,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive  for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce84..5af6df698b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,6 +145,8 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
+extern bool	bsysscan;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
 extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
 extern void TeardownHistoricSnapshot(bool is_error);
-- 
2.23.0

v20-0007-Track-statistics-for-streaming.patchapplication/octet-stream; name=v20-0007-Track-statistics-for-streaming.patchDownload
From 04070297f7f3534707b97458ca238df6a6f47629 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 2 Apr 2020 13:19:29 +0530
Subject: [PATCH v20 07/12] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                  | 25 +++++++++++++++
 src/backend/catalog/system_views.sql          |  5 ++-
 .../replication/logical/reorderbuffer.c       | 13 ++++++++
 src/backend/replication/walsender.c           | 32 ++++++++++++++++---
 src/include/catalog/pg_proc.dat               |  6 ++--
 src/include/replication/reorderbuffer.h       | 13 +++++---
 src/include/replication/walsender_private.h   |  5 +++
 src/test/regress/expected/rules.out           |  7 ++--
 8 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 252300db14..fb9c8e59af 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2063,6 +2063,31 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       may get spilled repeatedly, and this counter gets incremented on every
       such invocation.</entry>
     </row>
+    <row>
+     <entry><structfield>stream_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of in-progress transactions streamed to subscriber after
+      memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+      Streaming only works with toplevel transactions (subtransactions can't
+      be streamed independently), so the counter does not get incremented for
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times in-progress transactions were streamed to subscriber.
+      Transactions may get streamed repeatedly, and this counter gets incremented
+      on every such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>stream_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded in-progress transaction data streamed to subscriber.
+      </entry>
+    </row>
+
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2bd5f5ea14..8f34ce8deb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -788,7 +788,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bc5821b2bf..d80ad04363 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -331,6 +331,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3289,6 +3293,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
 
+	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 65bb628438..058ee2b8b2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1345,7 +1345,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1366,7 +1366,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2415,6 +2416,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3256,7 +3260,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3314,6 +3318,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillCount;
 		int64		spillBytes;
 		bool		is_sync_standby;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 		int			j;
@@ -3339,6 +3346,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		/*
@@ -3441,6 +3451,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3689,11 +3704,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4bce3ad8de..9fb1ffe2c8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5237,9 +5237,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6d65986a82..603f325663 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -517,15 +517,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 734acec2a4..b997d1710e 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -83,6 +83,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 8876025aaa..0c4952a1fa 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2005,9 +2005,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_slru| SELECT s.name,
     s.blks_zeroed,
-- 
2.23.0

v20-0008-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v20-0008-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From 26ed40559494a9b613b43784e31c634ce029a29f Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v20 08/12] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 6da7f71ca3..086d0c7f02 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -70,7 +70,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2fbc..94c71f8ae2 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 34ab11e7fe..ad3ed13ffc 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9181..a6fae9c3f1 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17a78..202871a658 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10a19..70c86b22ac 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6d63..f9c8d1d348 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 81520a7332..0c9c6b3dd4 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bdba9..21f50c7012 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133f69..30561d8f96 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae38592b..9a6bac6822 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..ed56fbf96c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cba4c..4df1ddef63 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1a8a..c3caff6149 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7c2f..c62eb521e7 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30f59..2be7542831 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0578..2da9607a7d 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9435..96ffc091b0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
2.23.0

v20-0009-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v20-0009-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From fbf3b7f90e7b8587ddb62d317c2acf130432ea1e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v20 09/12] Add TAP test for streaming vs. DDL

---
 .../subscription/t/014_stream_through_ddl.pl  | 98 +++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000000..b8d78b1972
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v20-0006-Add-support-for-streaming-to-built-in-replicatio.patchapplication/octet-stream; name=v20-0006-Add-support-for-streaming-to-built-in-replicatio.patchDownload
From 05c62aadd8dbd6aa9bf4c0999a2af1e6866a0fac Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 16 Apr 2020 01:55:22 -0700
Subject: [PATCH v20 06/12] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml      |    4 +-
 doc/src/sgml/ref/create_subscription.sgml     |   12 +
 src/backend/catalog/pg_subscription.c         |    1 +
 src/backend/commands/subscriptioncmds.c       |   45 +-
 src/backend/postmaster/pgstat.c               |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |    3 +
 src/backend/replication/logical/launcher.c    |    1 -
 src/backend/replication/logical/logical.c     |    4 +-
 src/backend/replication/logical/proto.c       |  140 ++-
 src/backend/replication/logical/worker.c      | 1033 +++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c   |  318 ++++-
 src/backend/replication/slotfuncs.c           |    6 +
 src/backend/replication/walsender.c           |    6 +
 src/include/catalog/pg_subscription.h         |    3 +
 src/include/pgstat.h                          |    6 +-
 src/include/replication/logicalproto.h        |   42 +-
 src/include/replication/walreceiver.h         |    1 +
 src/test/subscription/t/009_stream_simple.pl  |   86 ++
 src/test/subscription/t/010_stream_subxact.pl |  102 ++
 src/test/subscription/t/011_stream_ddl.pl     |   95 ++
 .../t/012_stream_subxact_abort.pl             |   82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |   84 ++
 22 files changed, 2043 insertions(+), 43 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 1b8beadbaa..95b7c24ef9 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -164,8 +164,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c244fb..3349cc4bfc 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731115..f28482f0f4 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 7f156673f7..65b6b76164 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3f8105c6eb..df3f64c7ba 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4133,6 +4133,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9bb6..5257ab0394 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e987..8156a42ace 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,7 +14,6 @@
  *
  *-------------------------------------------------------------------------
  */
-
 #include "postgres.h"
 
 #include "access/heapam.h"
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 497d8a9c36..dfc681df43 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1148,7 +1148,7 @@ stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_start";
-	/* state.report_location = apply_lsn; */
+	state.report_location = InvalidXLogRecPtr;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
@@ -1193,7 +1193,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_stop";
-	/* state.report_location = apply_lsn; */
+	state.report_location = txn->final_lsn;
 	errcallback.callback = output_plugin_error_callback;
 	errcallback.arg = (void *) &state;
 	errcallback.previous = error_context_stack;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd171..5242ac0efe 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a1224d..026cd48bd0 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,6 +54,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -64,6 +86,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +94,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -110,12 +134,57 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool	in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;						/* XID of the subxact */
+	off_t           offset;					/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +256,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -552,6 +657,319 @@ apply_handle_origin(StringInfo s)
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+			return;
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	ensure_transaction();
+
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -565,6 +983,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1001,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1040,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1158,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1303,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1676,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1817,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1477,6 +1929,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 	}
 }
 
+/*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
 /*
  * Apply main loop.
  */
@@ -1493,6 +1961,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1941,6 +2412,567 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3138,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 77b85fc655..811706a34f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,54 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +119,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +145,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +208,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +237,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +260,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +281,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +370,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +427,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +469,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +490,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +522,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +542,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +566,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +586,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +611,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +643,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -587,6 +723,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -623,6 +844,34 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -750,11 +999,44 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -790,7 +1072,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index ae751e94e7..c02c1b620f 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -155,6 +155,12 @@ create_logical_replication_slot(char *name, char *plugin,
 									read_local_xlog_page, NULL, NULL,
 									NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
 	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 8b55bbfcb2..65bb628438 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1009,6 +1009,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndPrepareWrite, WalSndWriteData,
 										WalSndUpdateProgress);
 
+		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
 		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d42d8..617b9094d4 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8041d9988..3b3e1fde6f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -981,7 +981,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561be9..89158ed46f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f1aa6e9977..70d39f880d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v20-0011-Provide-new-api-to-get-the-streaming-changes.patchapplication/octet-stream; name=v20-0011-Provide-new-api-to-get-the-streaming-changes.patchDownload
From 05bc4ad08edde37125199e0648f6d3baf5988ad7 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Sat, 2 May 2020 11:41:59 +0530
Subject: [PATCH v20 11/12] Provide new api to get the streaming changes

---
 .gitignore                                    |  1 +
 src/backend/catalog/system_views.sql          |  8 +++++++
 .../replication/logical/logicalfuncs.c        | 23 +++++++++++++++----
 src/include/catalog/pg_proc.dat               |  9 ++++++++
 4 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/.gitignore b/.gitignore
index 794e35b73c..6083744c07 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8f34ce8deb..dd488cb2f8 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1242,6 +1242,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
 
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_peek_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fded8e8290..debb91b457 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -108,7 +108,8 @@ check_permissions(void)
  * Helper function for the various SQL callable logical decoding functions.
  */
 static Datum
-pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool binary)
+pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
+								 bool binary, bool streaming)
 {
 	Name		name;
 	XLogRecPtr	upto_lsn;
@@ -250,6 +251,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or has been invalidated.")));
 
+		/* If called has not asked for streaming changes then disable it. */
+		ctx->streaming &= streaming;
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
@@ -360,7 +364,16 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 Datum
 pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, false);
+}
+
+/*
+ * SQL function to get the streaming changes as text, consuming the data.
+ */
+Datum
+pg_logical_slot_get_streaming_changes(PG_FUNCTION_ARGS)
+{
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, true);
 }
 
 /*
@@ -369,7 +382,7 @@ pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, false, false);
 }
 
 /*
@@ -378,7 +391,7 @@ pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, true, false);
 }
 
 /*
@@ -387,7 +400,7 @@ pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, true, false);
 }
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9fb1ffe2c8..3dfc5c10fc 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10117,6 +10117,15 @@
   proargmodes => '{i,i,i,v,o,o,o}',
   proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
   prosrc => 'pg_logical_slot_get_binary_changes' },
+{ oid => '6150', descr => 'get streaming changes from replication slot',
+  proname => 'pg_logical_slot_get_streaming_changes', procost => '1000',
+  prorows => '1000', provariadic => 'text', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'name pg_lsn int4 _text',
+  proallargtypes => '{name,pg_lsn,int4,_text,pg_lsn,xid,text}',
+  proargmodes => '{i,i,i,v,o,o,o}',
+  proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
+  prosrc => 'pg_logical_slot_get_streaming_changes' },
 { oid => '3784', descr => 'peek at changes from replication slot',
   proname => 'pg_logical_slot_peek_changes', procost => '1000',
   prorows => '1000', provariadic => 'text', proisstrict => 'f',
-- 
2.23.0

v20-0010-Bugfix-handling-of-incomplete-toast-tuple.patchapplication/octet-stream; name=v20-0010-Bugfix-handling-of-incomplete-toast-tuple.patchDownload
From e04f9eb82abddb7d1fbb07cb352f90770cb9d1c4 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 28 Feb 2020 11:07:46 +0530
Subject: [PATCH v20 10/12] Bugfix handling of incomplete toast tuple

---
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/replication/logical/decode.c      |  17 +-
 .../replication/logical/reorderbuffer.c       | 193 +++++++++++-------
 src/include/access/heapam_xlog.h              |   1 +
 src/include/replication/reorderbuffer.h       |  24 ++-
 5 files changed, 158 insertions(+), 80 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8371ec6e81..8028820d0e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1955,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 69c1f45ef6..c841687c66 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -727,7 +727,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -794,7 +796,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -851,7 +854,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -887,7 +891,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -987,7 +991,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1025,7 +1029,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d80ad04363..c7c2aaf0c1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -178,6 +178,11 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define ChangeIsInsertOrUpdate(action) \
+			(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+			((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+			((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT))
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -654,11 +659,14 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
-	ReorderBufferTXN *txn;
+	ReorderBufferTXN *txn, *toptxn;
+	bool	can_stream = false;
 
-	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+	toptxn = txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
 
 	change->lsn = lsn;
 	change->txn = txn;
@@ -668,9 +676,49 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Otherwise, if
+	 * we have toast insert bit set and this is insert/update then clear the
+	 * bit.
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(txn) &&
+			ChangeIsInsertOrUpdate(change->action))
+	{
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+		can_stream = true;
+	}
+
+	/*
+	 * If this is a speculative insert then set the corresponding bit.
+	 * Otherwise, if we have speculative insert bit set and this is spec
+	 * confirm record then clear the bit.
+	 */
+	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM)
+	{
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+		can_stream = true;
+	}
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/*
+	 * If streaming is enable and we have serialized this transaction because
+	 * it had incomplete tuple.  So if now we have got the complete tuple we
+	 * can stream it.
+	 */
+	if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
+		&& !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
+	{
+		ReorderBufferStreamTXN(rb, toptxn);
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
+	}
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -700,7 +748,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1476,6 +1524,13 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
+	/* remove entries spilled to disk */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
+
 	/* also reset the number of entries in the transaction */
 	txn->nentries_mem = 0;
 	txn->nentries = 0;
@@ -1865,8 +1920,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						 * disk.
 						 */
 						dlist_delete(&change->node);
-						ReorderBufferToastAppendChunk(rb, txn, relation,
-													  change);
+							ReorderBufferToastAppendChunk(rb, txn, relation,
+														  change);
 					}
 
 			change_done:
@@ -2463,7 +2518,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2512,7 +2567,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2535,6 +2590,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2549,8 +2605,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2558,12 +2619,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2624,7 +2693,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 	memcpy(change->data.inval.invalidations, msgs,
 		   sizeof(SharedInvalidationMessage) * nmsgs);
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -2811,15 +2880,16 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		/*
+		 * if the current transaction is larger and doesn't have incomplete data
+		 * remember it.
+		 */
+		if (((!largest) || (txn->total_size > largest->total_size)) &&
+			((txn->total_size > 0) && !(rbtxn_has_toast_insert(txn)) &&
+			!(rbtxn_has_spec_insert(txn))))
+		largest = txn;
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2837,66 +2907,51 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
 	ReorderBufferTXN *txn;
 
-	/* bail out if we haven't exceeded the memory limit */
-	if (rb->size < logical_decoding_work_mem * 1024L)
-		return;
-
-	/*
-	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by streaming, if supported. Otherwise spill to disk.
-	 */
-	if (ReorderBufferCanStream(rb))
+	/* Loop until we reach under the memory limit. */
+	while (rb->size >= logical_decoding_work_mem * 1024L)
 	{
 		/*
-		 * Pick the largest toplevel transaction and evict it from memory by
-		 * streaming the already decoded part.
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTopTXN(rb);
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn && !txn->toptxn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		ReorderBufferStreamTXN(rb, txn);
-	}
-	else
-	{
-		/*
-		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
-		 */
-		txn = ReorderBufferLargestTXN(rb);
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
-		ReorderBufferSerializeTXN(rb, txn);
+		/*
+		 * After eviction, the transaction should have no entries in memory, and
+		 * should use 0 bytes for changes.
+		 *
+		 * XXX Checking the size is fine for both cases - spill to disk and
+		 * streaming. But for streaming we should really check nentries_mem for
+		 * all subtransactions too.
+		 */
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
 	}
 
-	/*
-	 * After eviction, the transaction should have no entries in memory, and
-	 * should use 0 bytes for changes.
-	 *
-	 * XXX Checking the size is fine for both cases - spill to disk and
-	 * streaming. But for streaming we should really check nentries_mem for
-	 * all subtransactions too.
-	 */
-	Assert(txn->size == 0);
-	Assert(txn->nentries_mem == 0);
-
-	/*
-	 * And furthermore, evicting the transaction should get us below the
-	 * memory limit again - it is not possible that we're still exceeding the
-	 * memory limit after evicting the transaction.
-	 *
-	 * This follows from the simple fact that the selected transaction is at
-	 * least as large as the most recent change (which caused us to go over
-	 * the memory limit). So by evicting it we're definitely back below the
-	 * memory limit.
-	 */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cdb12..aa17f7df84 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 603f325663..ba2ab7185c 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -172,6 +172,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -191,6 +193,17 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -199,10 +212,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -355,6 +364,9 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -545,7 +557,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
2.23.0

v20-0012-Add-streaming-option-in-pg_dump.patchapplication/octet-stream; name=v20-0012-Add-streaming-option-in-pg_dump.patchDownload
From 2f7116125980321bc802283e64c6b405bd908e92 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v20 12/12] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 5db4f5761d..11db7b79d7 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4210,6 +4210,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4244,8 +4245,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4258,6 +4259,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4274,6 +4276,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4351,6 +4354,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 61c909e06d..5c5b072a99 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
2.23.0

#296Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#295)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, May 7, 2020 at 6:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 5, 2020 at 7:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have fixed one more issue in 0010 patch. The issue was that once
the transaction is serialized due to the incomplete toast after
streaming the serialized store was not cleaned up so it was streaming
the same tuple multiple times.

I have reviewed a few patches (003, 004, and 005) and below are my comments.

v20-0003-Extend-the-output-plugin-API-with-stream-methods
----------------------------------------------------------------------------------------
1.
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ Relation relation,
+ ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+   int nrelations, Relation relations[],
+   ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}

In the above and similar APIs, there are parameters like relation
which are not used. I think you should add some comments atop these
APIs to explain why it is so? I guess it is because we want to keep
them similar to non-stream version of APIs and we can't display
relation or other information as the transaction is still in-progress.

2.
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by
<varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by
amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>

I think we need to explain here the cases/exception where we need to
spill even when stream is enabled and check if this is per latest
implementation, otherwise, update it.

3.
+ * To support streaming, we require change/commit/abort callbacks. The
+ * message callback is optional, similarly to regular output plugins.

/similarly/similar

4.
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ Assert(!ctx->fast_forward);
+
+ /* We're only supposed to call this when streaming is supported. */
+ Assert(ctx->streaming);
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "stream_start";
+ /* state.report_location = apply_lsn; */

Why can't we supply the report_location here? I think here we need to
report txn->first_lsn if this is the very first stream and
txn->final_lsn if it is any consecutive one.

5.
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ Assert(!ctx->fast_forward);
+
+ /* We're only supposed to call this when streaming is supported. */
+ Assert(ctx->streaming);
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "stream_stop";
+ /* state.report_location = apply_lsn; */

Can't we report txn->final_lsn here?

6. I think it will be good if we can provide an example of streaming
changes via test_decoding at
https://www.postgresql.org/docs/devel/test-decoding.html. I think we
can also explain there why the user is not expected to see the actual
data in the stream.

v20-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
----------------------------------------------------------------------------------------
7.
+ /*
+ * We don't expect direct calls to table_tuple_get_latest_tid with valid
+ * CheckXidAlive  for catalog or regular tables.

There is an extra space between 'CheckXidAlive' and 'for'. I can see
similar problems in other places as well where this comment is used,
fix those as well.

8.
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction. To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */

In this comment, there is an inconsistency in the space used after
completing the sentence. In the part "transaction. To", single space
is used whereas at other places two spaces are used after a full stop.

v20-0005-Implement-streaming-mode-in-ReorderBuffer
-----------------------------------------------------------------------------
9.
Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

I think the above part of the commit message needs to be updated.

10.
Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

I don't think this part of the commit message is correct as we
sometimes need to spill even during streaming. Please check the
entire commit message and update according to the latest
implementation.

11.
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
  */
 static void
 ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
*rb, ReorderBufferTXN *txn)
  dlist_iter iter;
  HASHCTL hash_ctl;

- if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
- return;
-

I don't understand this change. Why would "INSERT followed by
TRUNCATE" could lead to a tuple which can come for decode before its
CID? The patch has made changes based on this assumption in
HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the
behavior could be dependent on whether we are streaming the changes
for in-progress xact or at the commit of a transaction. We might want
to generate a test to once validate this behavior.

Also, the comment refers to tqual.c which is wrong as this API is now
in heapam_visibility.c.

12.
+ * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+ * aborted. That will happen during catalog access.  Also reset the
+ * sysbegin_called flag.
  */
- if (txn->base_snapshot == NULL)
+ if (!TransactionIdDidCommit(xid))
  {
- Assert(txn->ninvalidations == 0);
- ReorderBufferCleanupTXN(rb, txn);
- return;
+ CheckXidAlive = xid;
+ bsysscan = false;
  }

In the comment, the flag name 'sysbegin_called' should be bsysscan.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#297Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#296)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 7, 2020 at 6:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 5, 2020 at 7:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have fixed one more issue in 0010 patch. The issue was that once
the transaction is serialized due to the incomplete toast after
streaming the serialized store was not cleaned up so it was streaming
the same tuple multiple times.

I have reviewed a few patches (003, 004, and 005) and below are my comments.

Thanks for the review, I am replying some of the comments where I have
confusion, others are fine.

v20-0003-Extend-the-output-plugin-API-with-stream-methods
----------------------------------------------------------------------------------------
1.
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ Relation relation,
+ ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+   int nrelations, Relation relations[],
+   ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}

In the above and similar APIs, there are parameters like relation
which are not used. I think you should add some comments atop these
APIs to explain why it is so? I guess it is because we want to keep
them similar to non-stream version of APIs and we can't display
relation or other information as the transaction is still in-progress.

I think because the interfaces are designed that way because other
decoding plugins might need it e.g. in pgoutput we need change and
relation but not here. We have other similar examples also e.g.
pg_decode_message has the parameter txn but not used. Do you think we
still need to add comments?

4.
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ Assert(!ctx->fast_forward);
+
+ /* We're only supposed to call this when streaming is supported. */
+ Assert(ctx->streaming);
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "stream_start";
+ /* state.report_location = apply_lsn; */

Why can't we supply the report_location here? I think here we need to
report txn->first_lsn if this is the very first stream and
txn->final_lsn if it is any consecutive one.

I am not sure about this, Because for the very first stream we will
report the location of the first lsn of the stream and for the
consecutive stream we will report the last lsn in the stream.

11.
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
*/
static void
ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
*rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;

- if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
- return;
-

I don't understand this change. Why would "INSERT followed by
TRUNCATE" could lead to a tuple which can come for decode before its
CID?

Actually, even if we haven't decoded the DDL operation but in the
actual system table the tuple might have been deleted from the next
operation. e.g. while we are streaming the INSERT it is possible that
the truncate has already deleted that tuple and set the max for the
tuple. So before streaming patch, we were only streaming the INSERT
only on commit so by that time we had got all the operation which has
done DDL and we would have already prepared tuple CID hash.

The patch has made changes based on this assumption in

HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the
behavior could be dependent on whether we are streaming the changes
for in-progress xact or at the commit of a transaction. We might want
to generate a test to once validate this behavior.

We have already added the test case for the same, 011_stream_ddl.pl in
test/subscription

Also, the comment refers to tqual.c which is wrong as this API is now
in heapam_visibility.c.

Ok, will fix.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#298Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#297)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, May 13, 2020 at 11:35 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

v20-0003-Extend-the-output-plugin-API-with-stream-methods
----------------------------------------------------------------------------------------
1.
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ Relation relation,
+ ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+   int nrelations, Relation relations[],
+   ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}

In the above and similar APIs, there are parameters like relation
which are not used. I think you should add some comments atop these
APIs to explain why it is so? I guess it is because we want to keep
them similar to non-stream version of APIs and we can't display
relation or other information as the transaction is still in-progress.

I think because the interfaces are designed that way because other
decoding plugins might need it e.g. in pgoutput we need change and
relation but not here. We have other similar examples also e.g.
pg_decode_message has the parameter txn but not used. Do you think we
still need to add comments?

In that case, we can leave but lets ensure that we are not exposing
any parameter which is not used and if there is any due to some
reason, we should document it. I will also look into this.

4.
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ Assert(!ctx->fast_forward);
+
+ /* We're only supposed to call this when streaming is supported. */
+ Assert(ctx->streaming);
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "stream_start";
+ /* state.report_location = apply_lsn; */

Why can't we supply the report_location here? I think here we need to
report txn->first_lsn if this is the very first stream and
txn->final_lsn if it is any consecutive one.

I am not sure about this, Because for the very first stream we will
report the location of the first lsn of the stream and for the
consecutive stream we will report the last lsn in the stream.

Yeah, that doesn't seem to be consistent. How about if get it as an
additional parameter? The caller can pass the lsn of the very first
change it is trying to decode in this stream.

11.
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
*/
static void
ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
*rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;

- if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
- return;
-

I don't understand this change. Why would "INSERT followed by
TRUNCATE" could lead to a tuple which can come for decode before its
CID?

Actually, even if we haven't decoded the DDL operation but in the
actual system table the tuple might have been deleted from the next
operation. e.g. while we are streaming the INSERT it is possible that
the truncate has already deleted that tuple and set the max for the
tuple. So before streaming patch, we were only streaming the INSERT
only on commit so by that time we had got all the operation which has
done DDL and we would have already prepared tuple CID hash.

Okay, but I think for that case how good is that we always allow CID
hash table to be built even if there are no catalog changes in TXN
(see changes in ReorderBufferBuildTupleCidHash). Can't we detect that
while resolving the cmin/cmax?

Few more comments for v20-0005-Implement-streaming-mode-in-ReorderBuffer:
----------------------------------------------------------------------------------------------------------------
1.
/*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)

It seems to me the above comment change is not required as per the latest patch.

2.
 * For subtransactions, we only mark them as streamed when there are
+ * any changes in them.
+ *
+ * We do it this way because of aborts - we don't want to send aborts
+ * for XIDs the downstream is not aware of. And of course, it always
+ * knows about the toplevel xact (we send the XID in all messages),
+ * but we never stream XIDs of empty subxacts.
+ */
+ if ((!txn->toptxn) || (txn->nentries_mem != 0))
+ txn->txn_flags |= RBTXN_IS_STREAMED;

/when there are any changes in them/when there are changes in them. I
think we don't need 'any' in the above sentence.

3.
And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error that we can ignore.  We
+ * might have already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the abort we will
+ * stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)

In the above comment, I don't think it is right to say that we ignore
the error raised due to the aborted transaction. We need to say that
we discard the already streamed changes on such an error.

4.
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
  /*
- * If this transaction has no snapshot, it didn't make any changes to the
- * database, so there's nothing to decode.  Note that
- * ReorderBufferCommitChild will have transferred any snapshots from
- * subtransactions if there were any.
+ * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+ * aborted. That will happen during catalog access.  Also reset the
+ * sysbegin_called flag.
  */
- if (txn->base_snapshot == NULL)
+ if (!TransactionIdDidCommit(xid))
  {
- Assert(txn->ninvalidations == 0);
- ReorderBufferCleanupTXN(rb, txn);
- return;
+ CheckXidAlive = xid;
+ bsysscan = false;
  }

I think this function is inline as it needs to be called for each
change. If that is the case and otherwise also, isn't it better that
we check if passed xid is the same as CheckXidAlive before checking
TransactionIdDidCommit as TransactionIdDidCommit can be costly and
calling it for each change might not be a good idea?

5.
setup CheckXidAlive if it's not committed yet. We don't check if the xid
+ * aborted. That will happen during catalog access.  Also reset the
+ * sysbegin_called flag.

/if the xid aborted/if the xid is aborted. missing comma after Also.

6.
ReorderBufferProcessTXN()
{
..
- /* build data to be able to lookup the CommandIds of catalog tuples */
+ /*
+ * build data to be able to lookup the CommandIds of catalog tuples
+ */
  ReorderBufferBuildTupleCidHash(rb, txn);
..
}

Is there a need to change the formatting of the comment?

7.
ReorderBufferProcessTXN()
{
..
if (using_subtxn)
- BeginInternalSubTransaction("replay");
+ BeginInternalSubTransaction("stream");
else
StartTransactionCommand();
..
}

I am not sure changing unconditionally "replay" to "stream" is a good
idea. How about something like BeginInternalSubTransaction(streaming
? "stream" : "replay");?

8.
@@ -1588,8 +1766,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
  * use as a normal record. It'll be cleaned up at the end
  * of INSERT processing.
  */
- if (specinsert == NULL)
- elog(ERROR, "invalid ordering of speculative insertion changes");

You have removed this check but all other handling of specinsert is
same as far as this patch is concerned. Why so?

9.
@@ -1676,8 +1860,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
  * freed/reused while restoring spooled data from
  * disk.
  */
- Assert(change->data.tp.newtuple != NULL);
-
  dlist_delete(&change->node);

Why is this Assert removed?

10.
@@ -1753,7 +1935,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
relations[nrelations++] = relation;
}

- rb->apply_truncate(rb, txn, nrelations, relations, change);
+ if (streaming)
+ {
+ rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+ /* Remember that we have sent some data. */
+ change->txn->any_data_sent = true;
+ }
+ else
+ rb->apply_truncate(rb, txn, nrelations, relations, change);

Can we encapsulate this in a separate function like
ReorderBufferApplyTruncate or something like that? Basically, rather
than having streaming check in this function, lets do it in some other
internal function. And we can likewise do it for all the streaming
checks in this function or at least whereever it is feasible. That
will make this function look clean.

11.
+ * We currently can only decode a transaction's contents when its commit
+ * record is read because that's the only place where we know about cache
+ * invalidations. Thus, once a toplevel commit is read, we iterate over the top
+ * and subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
{
..

I think the above comment needs to be updated after this patch. This
API can now be used during the decode of both a in-progress and a
committed transaction.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#299Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#298)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, May 13, 2020 at 4:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, May 13, 2020 at 11:35 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

v20-0003-Extend-the-output-plugin-API-with-stream-methods
----------------------------------------------------------------------------------------
1.
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ Relation relation,
+ ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+   int nrelations, Relation relations[],
+   ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}

In the above and similar APIs, there are parameters like relation
which are not used. I think you should add some comments atop these
APIs to explain why it is so? I guess it is because we want to keep
them similar to non-stream version of APIs and we can't display
relation or other information as the transaction is still in-progress.

I think because the interfaces are designed that way because other
decoding plugins might need it e.g. in pgoutput we need change and
relation but not here. We have other similar examples also e.g.
pg_decode_message has the parameter txn but not used. Do you think we
still need to add comments?

In that case, we can leave but lets ensure that we are not exposing
any parameter which is not used and if there is any due to some
reason, we should document it. I will also look into this.

4.
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ Assert(!ctx->fast_forward);
+
+ /* We're only supposed to call this when streaming is supported. */
+ Assert(ctx->streaming);
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "stream_start";
+ /* state.report_location = apply_lsn; */

Why can't we supply the report_location here? I think here we need to
report txn->first_lsn if this is the very first stream and
txn->final_lsn if it is any consecutive one.

I am not sure about this, Because for the very first stream we will
report the location of the first lsn of the stream and for the
consecutive stream we will report the last lsn in the stream.

Yeah, that doesn't seem to be consistent. How about if get it as an
additional parameter? The caller can pass the lsn of the very first
change it is trying to decode in this stream.

Hmm, I think we need to call ReorderBufferIterTXNInit and
ReorderBufferIterTXNNext and get the first change of the stream after
that we shall call stream start then we can find out the first LSN of
the stream. I will see how to do so that it doesn't look awkward.
Basically, as of now, our code is of this layout.

1. stream_start;
2. ReorderBufferIterTXNInit(rb, txn, &iterstate);
while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
{
stream changes
}
3. stream stop

So if we want to know the first lsn of this stream then we shall do
something like this

1. ReorderBufferIterTXNInit(rb, txn, &iterstate);
while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
{
2. if first_change
stream_start;

stream changes
}
3. stream stop

11.
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
*/
static void
ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
*rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;

- if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
- return;
-

I don't understand this change. Why would "INSERT followed by
TRUNCATE" could lead to a tuple which can come for decode before its
CID?

Actually, even if we haven't decoded the DDL operation but in the
actual system table the tuple might have been deleted from the next
operation. e.g. while we are streaming the INSERT it is possible that
the truncate has already deleted that tuple and set the max for the
tuple. So before streaming patch, we were only streaming the INSERT
only on commit so by that time we had got all the operation which has
done DDL and we would have already prepared tuple CID hash.

Okay, but I think for that case how good is that we always allow CID
hash table to be built even if there are no catalog changes in TXN
(see changes in ReorderBufferBuildTupleCidHash). Can't we detect that
while resolving the cmin/cmax?

Maybe in ResolveCminCmaxDuringDecoding we can see if tuplecid_data is
NULL then we can return as unresolved and then caller can take a call
based on that.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#300Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#299)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, May 13, 2020 at 9:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, May 13, 2020 at 4:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

4.
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ Assert(!ctx->fast_forward);
+
+ /* We're only supposed to call this when streaming is supported. */
+ Assert(ctx->streaming);
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "stream_start";
+ /* state.report_location = apply_lsn; */

Why can't we supply the report_location here? I think here we need to
report txn->first_lsn if this is the very first stream and
txn->final_lsn if it is any consecutive one.

I am not sure about this, Because for the very first stream we will
report the location of the first lsn of the stream and for the
consecutive stream we will report the last lsn in the stream.

Yeah, that doesn't seem to be consistent. How about if get it as an
additional parameter? The caller can pass the lsn of the very first
change it is trying to decode in this stream.

Hmm, I think we need to call ReorderBufferIterTXNInit and
ReorderBufferIterTXNNext and get the first change of the stream after
that we shall call stream start then we can find out the first LSN of
the stream. I will see how to do so that it doesn't look awkward.
Basically, as of now, our code is of this layout.

1. stream_start;
2. ReorderBufferIterTXNInit(rb, txn, &iterstate);
while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
{
stream changes
}
3. stream stop

So if we want to know the first lsn of this stream then we shall do
something like this

1. ReorderBufferIterTXNInit(rb, txn, &iterstate);
while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
{
2. if first_change
stream_start;

stream changes
}
3. stream stop

Yeah, something like that would work. I think you need to see it is
first change for 'streaming' mode.

11.
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
*/
static void
ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
*rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;

- if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
- return;
-

I don't understand this change. Why would "INSERT followed by
TRUNCATE" could lead to a tuple which can come for decode before its
CID?

Actually, even if we haven't decoded the DDL operation but in the
actual system table the tuple might have been deleted from the next
operation. e.g. while we are streaming the INSERT it is possible that
the truncate has already deleted that tuple and set the max for the
tuple. So before streaming patch, we were only streaming the INSERT
only on commit so by that time we had got all the operation which has
done DDL and we would have already prepared tuple CID hash.

Okay, but I think for that case how good is that we always allow CID
hash table to be built even if there are no catalog changes in TXN
(see changes in ReorderBufferBuildTupleCidHash). Can't we detect that
while resolving the cmin/cmax?

Maybe in ResolveCminCmaxDuringDecoding we can see if tuplecid_data is
NULL then we can return as unresolved and then caller can take a call
based on that.

Yeah, and add appropriate comments about why we are doing so and in
what kind of scenario that can happen.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#301Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#296)
12 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 7, 2020 at 6:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 5, 2020 at 7:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have fixed one more issue in 0010 patch. The issue was that once
the transaction is serialized due to the incomplete toast after
streaming the serialized store was not cleaned up so it was streaming
the same tuple multiple times.

I have reviewed a few patches (003, 004, and 005) and below are my comments.

v20-0003-Extend-the-output-plugin-API-with-stream-methods
----------------------------------------------------------------------------------------
2.
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by
<varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by
amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>

I think we need to explain here the cases/exception where we need to
spill even when stream is enabled and check if this is per latest
implementation, otherwise, update it.

Done

3.
+ * To support streaming, we require change/commit/abort callbacks. The
+ * message callback is optional, similarly to regular output plugins.

/similarly/similar

Done

4.
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ Assert(!ctx->fast_forward);
+
+ /* We're only supposed to call this when streaming is supported. */
+ Assert(ctx->streaming);
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "stream_start";
+ /* state.report_location = apply_lsn; */

Why can't we supply the report_location here? I think here we need to
report txn->first_lsn if this is the very first stream and
txn->final_lsn if it is any consecutive one.

Done

5.
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ Assert(!ctx->fast_forward);
+
+ /* We're only supposed to call this when streaming is supported. */
+ Assert(ctx->streaming);
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "stream_stop";
+ /* state.report_location = apply_lsn; */

Can't we report txn->final_lsn here

We are already setting this to the txn->final_ls in 0006 patch, but I
have moved it into this patch now.

6. I think it will be good if we can provide an example of streaming
changes via test_decoding at
https://www.postgresql.org/docs/devel/test-decoding.html. I think we
can also explain there why the user is not expected to see the actual
data in the stream.

I have a few problems to solve here.
- With streaming transaction also shall we show the actual values or
we shall do like it is currently in the patch
(appendStringInfo(ctx->out, "streaming change for TXN %u",
txn->xid);). I think we should show the actual values instead of what
we are doing now.
- In the example we can not show a real example, because of the
in-progress transaction to show the changes, we might have to
implement a lot of tuple. I think we can show the partial output?

v20-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
----------------------------------------------------------------------------------------
7.
+ /*
+ * We don't expect direct calls to table_tuple_get_latest_tid with valid
+ * CheckXidAlive  for catalog or regular tables.

There is an extra space between 'CheckXidAlive' and 'for'. I can see
similar problems in other places as well where this comment is used,
fix those as well.

Done

8.
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction. To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */

In this comment, there is an inconsistency in the space used after
completing the sentence. In the part "transaction. To", single space
is used whereas at other places two spaces are used after a full stop.

Done

v20-0005-Implement-streaming-mode-in-ReorderBuffer
-----------------------------------------------------------------------------
9.
Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

I think the above part of the commit message needs to be updated.

Done

10.
Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

I don't think this part of the commit message is correct as we
sometimes need to spill even during streaming. Please check the
entire commit message and update according to the latest
implementation.

Done

11.
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
*/
static void
ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
*rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;

- if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
- return;
-

I don't understand this change. Why would "INSERT followed by
TRUNCATE" could lead to a tuple which can come for decode before its
CID? The patch has made changes based on this assumption in
HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the
behavior could be dependent on whether we are streaming the changes
for in-progress xact or at the commit of a transaction. We might want
to generate a test to once validate this behavior.

Also, the comment refers to tqual.c which is wrong as this API is now
in heapam_visibility.c.

Done.

12.
+ * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+ * aborted. That will happen during catalog access.  Also reset the
+ * sysbegin_called flag.
*/
- if (txn->base_snapshot == NULL)
+ if (!TransactionIdDidCommit(xid))
{
- Assert(txn->ninvalidations == 0);
- ReorderBufferCleanupTXN(rb, txn);
- return;
+ CheckXidAlive = xid;
+ bsysscan = false;
}

In the comment, the flag name 'sysbegin_called' should be bsysscan.

Done

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v21-0003-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=v21-0003-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From 4bcd163c32ec89d133442f57214922612d5812fc Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v21 03/12] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 ++++++
 doc/src/sgml/logicaldecoding.sgml         | 213 +++++++++++++
 src/backend/replication/logical/logical.c | 363 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++
 src/include/replication/reorderbuffer.h   |  58 ++++
 6 files changed, 808 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c948856e..64f651fa72 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index bad3bfe620..1b56daa4bb 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index dc69e5ce5f..96f3859485 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,22 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +205,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similar to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +915,320 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->final_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475e5d..deef31825d 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -79,6 +79,11 @@ typedef struct LogicalDecodingContext
 	 */
 	void	   *output_writer_private;
 
+	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236c57..0d0a94a648 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
 /*
  * Output plugin callbacks
  */
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index af35287896..0851e1ad78 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -354,6 +354,53 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn);
+
 struct ReorderBuffer
 {
 	/*
@@ -392,6 +439,17 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
 	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
-- 
2.23.0

v21-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchapplication/octet-stream; name=v21-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchDownload
From d45c7c4f13d1ababfedd5cdc5417335eac8cc5b9 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 10:55:19 +0530
Subject: [PATCH v21 04/12] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml  |  9 +++--
 src/backend/access/heap/heapam.c   | 10 ++++++
 src/backend/access/index/genam.c   | 53 ++++++++++++++++++++++++++++
 src/backend/access/table/tableam.c |  8 +++++
 src/backend/utils/time/snapmgr.c   | 13 +++++++
 src/include/access/tableam.h       | 55 ++++++++++++++++++++++++++++++
 src/include/utils/snapmgr.h        |  2 ++
 7 files changed, 147 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 1b56daa4bb..5f7394f3c1 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,9 +432,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 94eb37d48d..2d77107c4f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments at snapmgr.c
+	 * where these variables are declared.  Normally we have such a check at
+	 * tableam level API but this is called from many places so we need to
+	 * ensure it here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae39a..446b8cbc86 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,9 +430,36 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments at snapmgr.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
+/*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments at snapmgr.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
 /*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments at snapmgr.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index c814733b22..2f52b407c6 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -230,6 +230,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	Relation	rel = scan->rs_rd;
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
+	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
 	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..9f1ecd123f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -153,6 +153,19 @@ static Snapshot SecondarySnapshot = NULL;
 static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool bsysscan = false;
+
 /*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8c34935c34..9d890d3c4b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
+#include "utils/snapmgr.h"
 #include "utils/snapshot.h"
 
 
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1015,6 +1025,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1054,6 +1071,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1710,6 +1735,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1727,6 +1760,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1745,6 +1786,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1761,6 +1809,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce84..5af6df698b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,6 +145,8 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
+extern bool	bsysscan;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
 extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
 extern void TeardownHistoricSnapshot(bool is_error);
-- 
2.23.0

v21-0001-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=v21-0001-Immediately-WAL-log-assignments.patchDownload
From 63bed6b2ed7844dd78eb2934b572466cfc671284 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 20 Mar 2020 15:03:01 +0530
Subject: [PATCH v21 01/12] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 46 ++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 +++
 src/backend/replication/logical/decode.c | 39 ++++++++++----------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 99 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index cd30b62d36..3af8e81af1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5118,6 +5120,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6020,3 +6023,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f09e..53be2b3059 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5995798b58..560ec27fa0 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1195,6 +1195,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1233,6 +1234,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3abf8..122c581d0f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel xid is valid, we need to assign the subxact to the
+	 * toplevel xact. We need to do this for all records, hence we do it before
+	 * the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(XLogRecGetTopXid(r)))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04babc2..8645b3816c 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e917dfe92d..26426cc779 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index c21b0ba972..83170a663c 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -191,6 +191,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -308,6 +310,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0194..2f0c8bf589 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
2.23.0

v21-0002-Issue-individual-invalidations-with-wal_level-lo.patchapplication/octet-stream; name=v21-0002-Issue-individual-invalidations-with-wal_level-lo.patchDownload
From 44afef284200ccb48460ff7d819930858658a466 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Tue, 14 Apr 2020 11:11:37 +0530
Subject: [PATCH v21 02/12] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.
---
 src/backend/access/rmgrdesc/xactdesc.c        |  40 +++++++
 src/backend/access/transam/xact.c             |   7 ++
 src/backend/replication/logical/decode.c      |  16 +++
 .../replication/logical/reorderbuffer.c       | 104 +++++++++++++++---
 src/backend/utils/cache/inval.c               |  49 +++++++++
 src/include/access/xact.h                     |  13 ++-
 src/include/replication/reorderbuffer.h       |  11 ++
 7 files changed, 226 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce75565f..7ab0d11ea9 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3af8e81af1..e576b10055 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6020,6 +6020,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 122c581d0f..69c1f45ef6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,22 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				Assert(TransactionIdIsValid(xid));
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->nmsgs, invals->msgs);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4594cf9509..b889edf461 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -220,7 +220,7 @@ static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
 static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
-static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
 
 /*
  * ---------------------------------------
@@ -455,6 +455,11 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 				pfree(change->data.msg.message);
 			change->data.msg.message = NULL;
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			if (change->data.inval.invalidations)
+				pfree(change->data.inval.invalidations);
+			change->data.inval.invalidations = NULL;
+			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			if (change->data.snapshot)
 			{
@@ -1814,17 +1819,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation messages locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					ReorderBufferExecuteInvalidations(
+							change->data.inval.ninvalidations,
+							change->data.inval.invalidations);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -1866,7 +1878,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1892,7 +1905,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2202,6 +2216,33 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	txn->ntuplecids++;
 }
 
+/*
+ * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn, int nmsgs,
+							 SharedInvalidationMessage *msgs)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.ninvalidations = nmsgs;
+	change->data.inval.invalidations = (SharedInvalidationMessage *)
+		MemoryContextAlloc(rb->context,
+						   sizeof(SharedInvalidationMessage) * nmsgs);
+	memcpy(change->data.inval.invalidations, msgs,
+		   sizeof(SharedInvalidationMessage) * nmsgs);
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Setup the invalidation of the toplevel transaction.
  *
@@ -2234,12 +2275,12 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  * in the changestream but we don't know which those are.
  */
 static void
-ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
 {
 	int			i;
 
-	for (i = 0; i < txn->ninvalidations; i++)
-		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	for (i = 0; i < nmsgs; i++)
+		LocalExecuteInvalidationMessage(&msgs[i]);
 }
 
 /*
@@ -2589,6 +2630,24 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				char	   *data;
+				Size		inval_size = sizeof(SharedInvalidationMessage) *
+										change->data.inval.ninvalidations;
+
+				sz += inval_size;
+
+				ReorderBufferSerializeReserve(rb, sz);
+				data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+
+				/* might have been reallocated above */
+				ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+				memcpy(data, change->data.inval.invalidations, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -2736,6 +2795,11 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			sz += sizeof(SharedInvalidationMessage) *
+					change->data.inval.ninvalidations;
+			break;
+
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -3002,6 +3066,20 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				Size	inval_size = sizeof(SharedInvalidationMessage) *
+									change->data.inval.ninvalidations;
+
+				change->data.inval.invalidations =
+						MemoryContextAlloc(rb->context, inval_size);
+
+				/* read the message */
+				memcpy(change->data.inval.invalidations, data, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33be6..cba5b6c64b 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transaction.  As of now it was
+ *	enough to log invalidation only at commit because we are only decoding the
+ *	transaction at the commit time.   We only need to log the catalog cache and
+ *	relcache invalidation.  There can not be any active MVCC scan in logical
+ *	decoding so we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
 	if (transInvalInfo == NULL)
 		return;
 
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
@@ -1501,3 +1513,40 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage	*invalMessages;
+	int	nmsgs = 0;
+
+	if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+	{
+		ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+								 MakeSharedInvalidMessagesArray);
+		invalMessages = SharedInvalidMessagesArray;
+		nmsgs  = numSharedInvalidMessagesArray;
+		SharedInvalidMessagesArray = NULL;
+		numSharedInvalidMessagesArray = 0;
+	}
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b3816c..b822c5e4b2 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -197,6 +197,17 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+/*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4dc9..af35287896 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,14 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations;		/* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation
+														 * message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +468,8 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  int nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
2.23.0

v21-0005-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v21-0005-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From 6bdc0252313498e07c8f93f62d7b5fa0de465ee4 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 1 May 2020 19:56:35 +0530
Subject: [PATCH v21 05/12] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes
we have in memory and invoke new stream API methods. This happens
in ReorderBufferStreamTXN() using about the same logic as in
ReorderBufferCommit() logic.  However, sometime if we have incomplete
toast or speculative insert we spill to the disk because we can not
generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c   |  38 +-
 .../replication/logical/reorderbuffer.c       | 739 ++++++++++++++++--
 src/include/replication/reorderbuffer.h       |  36 +
 3 files changed, 735 insertions(+), 78 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aa..160b167adb 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b889edf461..e7249f874d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -236,6 +237,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +246,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +382,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -772,6 +786,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+/*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -865,6 +911,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -1024,6 +1073,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1038,6 +1090,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1315,6 +1370,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		dlist_delete(&txn->base_snapshot_node);
 	}
 
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
 	/*
 	 * Remove TXN from its containing list.
 	 *
@@ -1340,6 +1404,80 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferReturnTXN(rb, txn);
 }
 
+/*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
 /*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1491,57 +1629,118 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort messageto truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
+	/*
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
+	 */
+	if (TransactionIdEquals(CheckXidAlive, xid))
+		return;
 
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * Setup CheckXidAlive only if it's in progress. We don't check if the xid
+	 * is aborted. That will happen during catalog access.  Also reset the
+	 * bsysscan flag.
 	 */
-	if (txn->base_snapshot == NULL)
+	if (TransactionIdIsInProgress(xid))
 	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
-		return;
+		CheckXidAlive = xid;
+		bsysscan = false;
 	}
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+	{
+		rb->stream_change(rb, txn, relation, change);
+
+		/* Remember that we have sent some data for this xid. */
+		change->txn->any_data_sent = true;
+	}
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+	{
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+		/* Remember that we have sent some data. */
+		change->txn->any_data_sent = true;
+	}
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1564,21 +1763,43 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
-
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
 		{
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * start stream or begin the transaction.  If this is the first
+			 * change in the current stream.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+					rb->stream_start(rb, txn, change->lsn);
+				else
+					rb->begin(rb, txn);
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1655,7 +1876,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1695,7 +1917,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1753,7 +1975,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1762,10 +1987,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					if (streaming)
+						rb->stream_message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
+					else
+						rb->message(rb, txn, change->lsn, true,
+										   change->data.msg.prefix,
+										   change->data.msg.message_size,
+										   change->data.msg.message);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1796,7 +2027,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1858,14 +2088,46 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			/*
+			 * Set the last of the stream as the final lsn before calling
+			 * stream stop.
+			 */
+			if (!XLogRecPtrIsInvalid(prev_lsn))
+				txn->final_lsn = prev_lsn;
+			rb->stream_stop(rb, txn);
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+		{
+			txn->command_id = command_id;
+
+			/* Avoid copying if it's already copied. */
+			if (snapshot_now->copied)
+				txn->snapshot_now = snapshot_now;
+			else
+				txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+														  txn, command_id);
+		}
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1884,14 +2146,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,17 +2181,129 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		if (streaming)
+		{
+			/* Discard the changes that we just streamed. */
+			ReorderBufferTruncateTXN(rb, txn);
 
-		PG_RE_THROW();
+			/* Re-throw only if it's not an abort. */
+			if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+			{
+				MemoryContextSwitchTo(ecxt);
+				PG_RE_THROW();
+			}
+			else
+			{
+				FlushErrorState();
+				FreeErrorData(errdata);
+				errdata = NULL;
+
+				/* remember the command ID and snapshot for the streaming run */
+				txn->command_id = command_id;
+
+				/* Avoid copying if it's already copied. */
+				if (snapshot_now->copied)
+					txn->snapshot_now = snapshot_now;
+				else
+					txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+															  txn, command_id);
+				/*
+				 * Set the last last of the stream as the final lsn before
+				 * calling stream stop.
+				 */
+				txn->final_lsn = prev_lsn;
+				rb->stream_stop(rb, txn);
+				ReorderBufferToastReset(rb, txn);
+				if (specinsert != NULL)
+				{
+					ReorderBufferReturnChange(rb, specinsert);
+					specinsert = NULL;
+				}
+			}
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.  We iterate over the top and
+ * subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot snapshot_now;
+	CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 *
+	 * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
@@ -1946,6 +2328,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2015,6 +2404,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node
+	 * about the abort only if we have sent any data for this transaction.
+	 */
+	if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2150,8 +2546,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2159,6 +2564,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2170,19 +2576,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2211,6 +2626,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2295,6 +2711,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2398,6 +2821,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
@@ -2418,15 +2873,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2746,6 +3232,102 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot snapshot_now;
+	CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -3864,6 +4446,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases we assume the CIDs is from the future command
+	 * and return as unresolve.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0851e1ad78..da32fbfd1c 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -171,6 +171,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -190,6 +191,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -224,6 +243,16 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	final_lsn;
 
+	/*
+	 * Have we sent any changes for this transaction in output plugin?
+	 */
+	bool		any_data_sent;
+
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
@@ -254,6 +283,13 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
+	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
 	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
-- 
2.23.0

v21-0006-Add-support-for-streaming-to-built-in-replicatio.patchapplication/octet-stream; name=v21-0006-Add-support-for-streaming-to-built-in-replicatio.patchDownload
From 9fb82bc9caadac4a9553f18c47c34ff0f30714a3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 14 May 2020 21:27:46 +0530
Subject: [PATCH v21 06/12] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml      |    4 +-
 doc/src/sgml/ref/create_subscription.sgml     |   12 +
 src/backend/catalog/pg_subscription.c         |    1 +
 src/backend/commands/subscriptioncmds.c       |   45 +-
 src/backend/postmaster/pgstat.c               |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |    3 +
 src/backend/replication/logical/launcher.c    |    1 -
 src/backend/replication/logical/proto.c       |  140 ++-
 src/backend/replication/logical/worker.c      | 1033 +++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c   |  318 ++++-
 src/backend/replication/slotfuncs.c           |    6 +
 src/backend/replication/walsender.c           |    6 +
 src/include/catalog/pg_subscription.h         |    3 +
 src/include/pgstat.h                          |    6 +-
 src/include/replication/logicalproto.h        |   42 +-
 src/include/replication/walreceiver.h         |    1 +
 src/test/subscription/t/009_stream_simple.pl  |   86 ++
 src/test/subscription/t/010_stream_subxact.pl |  102 ++
 src/test/subscription/t/011_stream_ddl.pl     |   95 ++
 .../t/012_stream_subxact_abort.pl             |   82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |   84 ++
 21 files changed, 2041 insertions(+), 41 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 1b8beadbaa..95b7c24ef9 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -164,8 +164,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c244fb..3349cc4bfc 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731115..f28482f0f4 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9ebb026187..9065a1be1b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e246be388b..90182a0181 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4139,6 +4139,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9bb6..5257ab0394 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e987..8156a42ace 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,7 +14,6 @@
  *
  *-------------------------------------------------------------------------
  */
-
 #include "postgres.h"
 
 #include "access/heapam.h"
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd171..5242ac0efe 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a1224d..026cd48bd0 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,6 +54,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -64,6 +86,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +94,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -110,12 +134,57 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool	in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;						/* XID of the subxact */
+	off_t           offset;					/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +256,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -552,6 +657,319 @@ apply_handle_origin(StringInfo s)
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+			return;
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	ensure_transaction();
+
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -565,6 +983,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1001,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1040,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1158,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1303,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1676,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1817,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1477,6 +1929,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 	}
 }
 
+/*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
 /*
  * Apply main loop.
  */
@@ -1493,6 +1961,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1941,6 +2412,567 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3138,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 15379e3118..a94b4a0136 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,54 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +119,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +145,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +208,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +237,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +260,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +281,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +370,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +427,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +469,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +490,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +522,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +542,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +566,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +586,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +611,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +643,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -587,6 +723,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -623,6 +844,34 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -753,11 +1002,44 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -793,7 +1075,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 6fed3cfd23..e1344ab4cc 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -157,6 +157,12 @@ create_logical_replication_slot(char *name, char *plugin,
 											   .segment_close = wal_segment_close),
 									NULL, NULL, NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
 	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index a4ca8daea7..6def1b96c9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1020,6 +1020,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndPrepareWrite, WalSndWriteData,
 										WalSndUpdateProgress);
 
+		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
 		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d42d8..617b9094d4 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ae9a39573c..70826c1cef 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -980,7 +980,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561be9..89158ed46f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index ac1acbb27e..95132062c6 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v21-0008-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v21-0008-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From ec24e7547b0d46fefadf56c92c0237742e77a9b6 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v21 08/12] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 6da7f71ca3..086d0c7f02 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -70,7 +70,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2fbc..94c71f8ae2 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f871a..4ba80869b9 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9181..a6fae9c3f1 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17a78..202871a658 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10a19..70c86b22ac 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6d63..f9c8d1d348 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334ed89..cdf9b8e7bb 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bdba9..21f50c7012 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133f69..30561d8f96 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae38592b..9a6bac6822 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..ed56fbf96c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cba4c..4df1ddef63 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1a8a..c3caff6149 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7c2f..c62eb521e7 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30f59..2be7542831 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0578..2da9607a7d 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9435..96ffc091b0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
2.23.0

v21-0009-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v21-0009-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From a10a2298862062a7bc33740e8701239139f40c40 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v21 09/12] Add TAP test for streaming vs. DDL

---
 .../subscription/t/014_stream_through_ddl.pl  | 98 +++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000000..b8d78b1972
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v21-0010-Bugfix-handling-of-incomplete-toast-tuple.patchapplication/octet-stream; name=v21-0010-Bugfix-handling-of-incomplete-toast-tuple.patchDownload
From dfe82b185d3564efe3c2da405ae9f3fe69a66507 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 14 May 2020 21:28:56 +0530
Subject: [PATCH v21 10/12] Bugfix handling of incomplete toast tuple

---
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/replication/logical/decode.c      |  17 +-
 .../replication/logical/reorderbuffer.c       | 193 +++++++++++-------
 src/include/access/heapam_xlog.h              |   1 +
 src/include/replication/reorderbuffer.h       |  24 ++-
 5 files changed, 158 insertions(+), 80 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2d77107c4f..3927448f46 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1955,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 69c1f45ef6..c841687c66 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -727,7 +727,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -794,7 +796,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -851,7 +854,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -887,7 +891,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -987,7 +991,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1025,7 +1029,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c2a012dd3c..14366aa265 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -179,6 +179,11 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define ChangeIsInsertOrUpdate(action) \
+			(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+			((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+			((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT))
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -655,11 +660,14 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
-	ReorderBufferTXN *txn;
+	ReorderBufferTXN *txn, *toptxn;
+	bool	can_stream = false;
 
-	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+	toptxn = txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
 
 	change->lsn = lsn;
 	change->txn = txn;
@@ -669,9 +677,49 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Otherwise, if
+	 * we have toast insert bit set and this is insert/update then clear the
+	 * bit.
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(txn) &&
+			ChangeIsInsertOrUpdate(change->action))
+	{
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+		can_stream = true;
+	}
+
+	/*
+	 * If this is a speculative insert then set the corresponding bit.
+	 * Otherwise, if we have speculative insert bit set and this is spec
+	 * confirm record then clear the bit.
+	 */
+	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM)
+	{
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+		can_stream = true;
+	}
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/*
+	 * If streaming is enable and we have serialized this transaction because
+	 * it had incomplete tuple.  So if now we have got the complete tuple we
+	 * can stream it.
+	 */
+	if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
+		&& !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
+	{
+		ReorderBufferStreamTXN(rb, toptxn);
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
+	}
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -701,7 +749,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1477,6 +1525,13 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
+	/* remove entries spilled to disk */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
+
 	/* also reset the number of entries in the transaction */
 	txn->nentries_mem = 0;
 	txn->nentries = 0;
@@ -1905,8 +1960,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						Assert(change->data.tp.newtuple != NULL);
 
 						dlist_delete(&change->node);
-						ReorderBufferToastAppendChunk(rb, txn, relation,
-													  change);
+							ReorderBufferToastAppendChunk(rb, txn, relation,
+														  change);
 					}
 
 			change_done:
@@ -2497,7 +2552,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2546,7 +2601,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2569,6 +2624,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2583,8 +2639,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2592,12 +2653,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2658,7 +2727,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 	memcpy(change->data.inval.invalidations, msgs,
 		   sizeof(SharedInvalidationMessage) * nmsgs);
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -2845,15 +2914,16 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		/*
+		 * if the current transaction is larger and doesn't have incomplete data
+		 * remember it.
+		 */
+		if (((!largest) || (txn->total_size > largest->total_size)) &&
+			((txn->total_size > 0) && !(rbtxn_has_toast_insert(txn)) &&
+			!(rbtxn_has_spec_insert(txn))))
+		largest = txn;
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2871,66 +2941,51 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
 	ReorderBufferTXN *txn;
 
-	/* bail out if we haven't exceeded the memory limit */
-	if (rb->size < logical_decoding_work_mem * 1024L)
-		return;
-
-	/*
-	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by streaming, if supported. Otherwise spill to disk.
-	 */
-	if (ReorderBufferCanStream(rb))
+	/* Loop until we reach under the memory limit. */
+	while (rb->size >= logical_decoding_work_mem * 1024L)
 	{
 		/*
-		 * Pick the largest toplevel transaction and evict it from memory by
-		 * streaming the already decoded part.
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTopTXN(rb);
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn && !txn->toptxn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		ReorderBufferStreamTXN(rb, txn);
-	}
-	else
-	{
-		/*
-		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
-		 */
-		txn = ReorderBufferLargestTXN(rb);
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
-		ReorderBufferSerializeTXN(rb, txn);
+		/*
+		 * After eviction, the transaction should have no entries in memory, and
+		 * should use 0 bytes for changes.
+		 *
+		 * XXX Checking the size is fine for both cases - spill to disk and
+		 * streaming. But for streaming we should really check nentries_mem for
+		 * all subtransactions too.
+		 */
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
 	}
 
-	/*
-	 * After eviction, the transaction should have no entries in memory, and
-	 * should use 0 bytes for changes.
-	 *
-	 * XXX Checking the size is fine for both cases - spill to disk and
-	 * streaming. But for streaming we should really check nentries_mem for
-	 * all subtransactions too.
-	 */
-	Assert(txn->size == 0);
-	Assert(txn->nentries_mem == 0);
-
-	/*
-	 * And furthermore, evicting the transaction should get us below the
-	 * memory limit again - it is not possible that we're still exceeding the
-	 * memory limit after evicting the transaction.
-	 *
-	 * This follows from the simple fact that the selected transaction is at
-	 * least as large as the most recent change (which caused us to go over
-	 * the memory limit). So by evicting it we're definitely back below the
-	 * memory limit.
-	 */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cdb12..aa17f7df84 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 34f93d600b..8ebd6c6755 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -172,6 +172,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -191,6 +193,17 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -199,10 +212,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -355,6 +364,9 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -546,7 +558,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
2.23.0

v21-0007-Track-statistics-for-streaming.patchapplication/octet-stream; name=v21-0007-Track-statistics-for-streaming.patchDownload
From 94be15a2f7893f7decafaa4a20eff194778a7194 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Fri, 15 May 2020 14:07:04 +0530
Subject: [PATCH v21 07/12] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                  | 33 +++++++++++++++++++
 src/backend/catalog/system_views.sql          |  5 ++-
 .../replication/logical/reorderbuffer.c       | 13 ++++++++
 src/backend/replication/walsender.c           | 32 +++++++++++++++---
 src/include/catalog/pg_proc.dat               |  6 ++--
 src/include/replication/reorderbuffer.h       | 13 +++++---
 src/include/replication/walsender_private.h   |  5 +++
 src/test/regress/expected/rules.out           |  7 ++--
 8 files changed, 99 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 87502a49b6..5b64410e7d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2404,6 +2404,39 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        Amount of decoded transaction data spilled to disk.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_txns</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of in-progress transactions streamed to subscriber after
+       memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+       Streaming only works with toplevel transactions (subtransactions can't
+       be streamed independently), so the counter does not get incremented for
+       subtransactions.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_count</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times in-progress transactions were streamed to subscriber.
+       Transactions may get streamed repeatedly, and this counter gets incremented
+       on every such invocation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_bytes</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Amount of decoded in-progress transaction data streamed to subscriber.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2bd5f5ea14..8f34ce8deb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -788,7 +788,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e7249f874d..c2a012dd3c 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -332,6 +332,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3323,6 +3327,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
 
+	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 6def1b96c9..5d23691930 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1359,7 +1359,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1380,7 +1380,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2429,6 +2430,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3264,7 +3268,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3322,6 +3326,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillCount;
 		int64		spillBytes;
 		bool		is_sync_standby;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 		int			j;
@@ -3347,6 +3354,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		/*
@@ -3449,6 +3459,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3697,11 +3712,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9edae40ed8..5a8826cc67 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5237,9 +5237,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index da32fbfd1c..34f93d600b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -518,15 +518,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 734acec2a4..b997d1710e 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -83,6 +83,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 8876025aaa..0c4952a1fa 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2005,9 +2005,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_slru| SELECT s.name,
     s.blks_zeroed,
-- 
2.23.0

v21-0012-Add-streaming-option-in-pg_dump.patchapplication/octet-stream; name=v21-0012-Add-streaming-option-in-pg_dump.patchDownload
From 1131fbc71b84045553bd90333dd70f9d3b9ae1dd Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v21 12/12] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index f33c2463a7..b6ae988b02 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4209,6 +4209,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4243,8 +4244,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4257,6 +4258,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4273,6 +4275,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4350,6 +4353,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 5f70400b25..3ccb6be953 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
2.23.0

v21-0011-Provide-new-api-to-get-the-streaming-changes.patchapplication/octet-stream; name=v21-0011-Provide-new-api-to-get-the-streaming-changes.patchDownload
From 15d57fc08ff343394b7f0c2a1251d5ec12a518ed Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Sat, 2 May 2020 11:41:59 +0530
Subject: [PATCH v21 11/12] Provide new api to get the streaming changes

---
 .gitignore                                    |  1 +
 src/backend/catalog/system_views.sql          |  8 +++++++
 .../replication/logical/logicalfuncs.c        | 23 +++++++++++++++----
 src/include/catalog/pg_proc.dat               |  9 ++++++++
 4 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/.gitignore b/.gitignore
index 794e35b73c..6083744c07 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8f34ce8deb..dd488cb2f8 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1242,6 +1242,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
 
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_peek_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index b99c94e848..70c28ffa91 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -108,7 +108,8 @@ check_permissions(void)
  * Helper function for the various SQL callable logical decoding functions.
  */
 static Datum
-pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool binary)
+pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
+								 bool binary, bool streaming)
 {
 	Name		name;
 	XLogRecPtr	upto_lsn;
@@ -252,6 +253,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or has been invalidated.")));
 
+		/* If called has not asked for streaming changes then disable it. */
+		ctx->streaming &= streaming;
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
@@ -362,7 +366,16 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 Datum
 pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, false);
+}
+
+/*
+ * SQL function to get the streaming changes as text, consuming the data.
+ */
+Datum
+pg_logical_slot_get_streaming_changes(PG_FUNCTION_ARGS)
+{
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, true);
 }
 
 /*
@@ -371,7 +384,7 @@ pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, false, false);
 }
 
 /*
@@ -380,7 +393,7 @@ pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, true, false);
 }
 
 /*
@@ -389,7 +402,7 @@ pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, true, false);
 }
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5a8826cc67..586c9621e2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10115,6 +10115,15 @@
   proargmodes => '{i,i,i,v,o,o,o}',
   proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
   prosrc => 'pg_logical_slot_get_binary_changes' },
+{ oid => '6150', descr => 'get streaming changes from replication slot',
+  proname => 'pg_logical_slot_get_streaming_changes', procost => '1000',
+  prorows => '1000', provariadic => 'text', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'name pg_lsn int4 _text',
+  proallargtypes => '{name,pg_lsn,int4,_text,pg_lsn,xid,text}',
+  proargmodes => '{i,i,i,v,o,o,o}',
+  proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
+  prosrc => 'pg_logical_slot_get_streaming_changes' },
 { oid => '3784', descr => 'peek at changes from replication slot',
   proname => 'pg_logical_slot_peek_changes', procost => '1000',
   prorows => '1000', provariadic => 'text', proisstrict => 'f',
-- 
2.23.0

#302Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#298)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, May 13, 2020 at 4:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, May 13, 2020 at 11:35 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

v20-0003-Extend-the-output-plugin-API-with-stream-methods
----------------------------------------------------------------------------------------
1.
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ Relation relation,
+ ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+   int nrelations, Relation relations[],
+   ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}

In the above and similar APIs, there are parameters like relation
which are not used. I think you should add some comments atop these
APIs to explain why it is so? I guess it is because we want to keep
them similar to non-stream version of APIs and we can't display
relation or other information as the transaction is still in-progress.

I think because the interfaces are designed that way because other
decoding plugins might need it e.g. in pgoutput we need change and
relation but not here. We have other similar examples also e.g.
pg_decode_message has the parameter txn but not used. Do you think we
still need to add comments?

In that case, we can leave but lets ensure that we are not exposing
any parameter which is not used and if there is any due to some
reason, we should document it. I will also look into this.

Ok

4.
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ Assert(!ctx->fast_forward);
+
+ /* We're only supposed to call this when streaming is supported. */
+ Assert(ctx->streaming);
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "stream_start";
+ /* state.report_location = apply_lsn; */

Why can't we supply the report_location here? I think here we need to
report txn->first_lsn if this is the very first stream and
txn->final_lsn if it is any consecutive one.

I am not sure about this, Because for the very first stream we will
report the location of the first lsn of the stream and for the
consecutive stream we will report the last lsn in the stream.

Yeah, that doesn't seem to be consistent. How about if get it as an
additional parameter? The caller can pass the lsn of the very first
change it is trying to decode in this stream.

Done

11.
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
*/
static void
ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
*rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;

- if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
- return;
-

I don't understand this change. Why would "INSERT followed by
TRUNCATE" could lead to a tuple which can come for decode before its
CID?

Actually, even if we haven't decoded the DDL operation but in the
actual system table the tuple might have been deleted from the next
operation. e.g. while we are streaming the INSERT it is possible that
the truncate has already deleted that tuple and set the max for the
tuple. So before streaming patch, we were only streaming the INSERT
only on commit so by that time we had got all the operation which has
done DDL and we would have already prepared tuple CID hash.

Okay, but I think for that case how good is that we always allow CID
hash table to be built even if there are no catalog changes in TXN
(see changes in ReorderBufferBuildTupleCidHash). Can't we detect that
while resolving the cmin/cmax?

Done

Few more comments for v20-0005-Implement-streaming-mode-in-ReorderBuffer:
----------------------------------------------------------------------------------------------------------------
1.
/*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
*/
static int
ReorderBufferIterCompare(Datum a, Datum b, void *arg)

It seems to me the above comment change is not required as per the latest patch.

Done

2.
* For subtransactions, we only mark them as streamed when there are
+ * any changes in them.
+ *
+ * We do it this way because of aborts - we don't want to send aborts
+ * for XIDs the downstream is not aware of. And of course, it always
+ * knows about the toplevel xact (we send the XID in all messages),
+ * but we never stream XIDs of empty subxacts.
+ */
+ if ((!txn->toptxn) || (txn->nentries_mem != 0))
+ txn->txn_flags |= RBTXN_IS_STREAMED;

/when there are any changes in them/when there are changes in them. I
think we don't need 'any' in the above sentence.

Done

3.
And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error that we can ignore.  We
+ * might have already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the abort we will
+ * stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)

In the above comment, I don't think it is right to say that we ignore
the error raised due to the aborted transaction. We need to say that
we discard the already streamed changes on such an error.

Done.

4.
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
/*
- * If this transaction has no snapshot, it didn't make any changes to the
- * database, so there's nothing to decode.  Note that
- * ReorderBufferCommitChild will have transferred any snapshots from
- * subtransactions if there were any.
+ * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+ * aborted. That will happen during catalog access.  Also reset the
+ * sysbegin_called flag.
*/
- if (txn->base_snapshot == NULL)
+ if (!TransactionIdDidCommit(xid))
{
- Assert(txn->ninvalidations == 0);
- ReorderBufferCleanupTXN(rb, txn);
- return;
+ CheckXidAlive = xid;
+ bsysscan = false;
}

I think this function is inline as it needs to be called for each
change. If that is the case and otherwise also, isn't it better that
we check if passed xid is the same as CheckXidAlive before checking
TransactionIdDidCommit as TransactionIdDidCommit can be costly and
calling it for each change might not be a good idea?

Done, Also I think it is good the check the TransactionIdIsInProgress
instead of !TransactionIdDidCommit. I have changed that as well.

5.
setup CheckXidAlive if it's not committed yet. We don't check if the xid
+ * aborted. That will happen during catalog access.  Also reset the
+ * sysbegin_called flag.

/if the xid aborted/if the xid is aborted. missing comma after Also.

Done

6.
ReorderBufferProcessTXN()
{
..
- /* build data to be able to lookup the CommandIds of catalog tuples */
+ /*
+ * build data to be able to lookup the CommandIds of catalog tuples
+ */
ReorderBufferBuildTupleCidHash(rb, txn);
..
}

Is there a need to change the formatting of the comment?

No need changed back.

7.
ReorderBufferProcessTXN()
{
..
if (using_subtxn)
- BeginInternalSubTransaction("replay");
+ BeginInternalSubTransaction("stream");
else
StartTransactionCommand();
..
}

I am not sure changing unconditionally "replay" to "stream" is a good
idea. How about something like BeginInternalSubTransaction(streaming
? "stream" : "replay");?

Done

8.
@@ -1588,8 +1766,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
* use as a normal record. It'll be cleaned up at the end
* of INSERT processing.
*/
- if (specinsert == NULL)
- elog(ERROR, "invalid ordering of speculative insertion changes");

You have removed this check but all other handling of specinsert is
same as far as this patch is concerned. Why so?

Seems like a merge issue, or the leftover from the old design of the
toast handling where we were streaming with the partial tuple.
fixed now.

9.
@@ -1676,8 +1860,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
* freed/reused while restoring spooled data from
* disk.
*/
- Assert(change->data.tp.newtuple != NULL);
-
dlist_delete(&change->node);

Why is this Assert removed?

Same cause as above so fixed.

10.
@@ -1753,7 +1935,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
relations[nrelations++] = relation;
}

- rb->apply_truncate(rb, txn, nrelations, relations, change);
+ if (streaming)
+ {
+ rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+ /* Remember that we have sent some data. */
+ change->txn->any_data_sent = true;
+ }
+ else
+ rb->apply_truncate(rb, txn, nrelations, relations, change);

Can we encapsulate this in a separate function like
ReorderBufferApplyTruncate or something like that? Basically, rather
than having streaming check in this function, lets do it in some other
internal function. And we can likewise do it for all the streaming
checks in this function or at least whereever it is feasible. That
will make this function look clean.

Done for truncate and change. I think we can create a few more such
functions for
start/stop and cleanup handling on error. I will work on that.

11.
+ * We currently can only decode a transaction's contents when its commit
+ * record is read because that's the only place where we know about cache
+ * invalidations. Thus, once a toplevel commit is read, we iterate over the top
+ * and subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
{
..

I think the above comment needs to be updated after this patch. This
API can now be used during the decode of both a in-progress and a
committed transaction.

Done

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#303Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#301)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, May 15, 2020 at 2:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

6. I think it will be good if we can provide an example of streaming
changes via test_decoding at
https://www.postgresql.org/docs/devel/test-decoding.html. I think we
can also explain there why the user is not expected to see the actual
data in the stream.

I have a few problems to solve here.
- With streaming transaction also shall we show the actual values or
we shall do like it is currently in the patch
(appendStringInfo(ctx->out, "streaming change for TXN %u",
txn->xid);). I think we should show the actual values instead of what
we are doing now.

I think why we don't want to display the tuple at this stage is
because it is not clear by this time if the transaction will commit or
abort. I am not sure if displaying the contents of aborted
transactions is a good idea but if there is a reason for doing so, we
can do it later as well.

- In the example we can not show a real example, because of the
in-progress transaction to show the changes, we might have to
implement a lot of tuple. I think we can show the partial output?

I think we can display what API will actually display, what is the
confusion here.

I have a few more comments on the previous version of patch
v20-0005-Implement-streaming-mode-in-ReorderBuffer. If you have fixed
any, then leave those and fix others.

Review comments:
------------------------------
1.
@@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
TransactionId xid,
}

  case REORDER_BUFFER_CHANGE_MESSAGE:
- rb->message(rb, txn, change->lsn, true,
- change->data.msg.prefix,
- change->data.msg.message_size,
- change->data.msg.message);
+ if (streaming)
+ rb->stream_message(rb, txn, change->lsn, true,
+    change->data.msg.prefix,
+    change->data.msg.message_size,
+    change->data.msg.message);
+ else
+ rb->message(rb, txn, change->lsn, true,
+    change->data.msg.prefix,
+    change->data.msg.message_size,
+    change->data.msg.message);

Don't we need to set any_data_sent flag while streaming messages as we
do for other types of changes?

2.
+ if (streaming)
+ {
+ /*
+ * Set the last of the stream as the final lsn before calling
+ * stream stop.
+ */
+ if (!XLogRecPtrIsInvalid(prev_lsn))
+ txn->final_lsn = prev_lsn;
+ rb->stream_stop(rb, txn);
+ }

I am not sure if it is good to use final_lsn for this purpose. See
comments for this variable in reorderbuffer.h. Basically, it is used
for a specific purpose on different occasions. Now, if we want to
start using it for a new purpose, we need to study its interaction
with all other places and update the comments as well. Can we pass an
additional parameter to stream_stop() instead?

3.
+ /* remember the command ID and snapshot for the streaming run */
+ txn->command_id = command_id;
+
+ /* Avoid copying if it's already copied. */
+ if (snapshot_now->copied)
+ txn->snapshot_now = snapshot_now;
+ else
+ txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+   txn, command_id);

This code is used at two different places, can we try to keep this in
a single function.

4.
In ReorderBufferProcessTXN(), the patch is calling stream_stop in both
the try and catch block. If there is an error after calling it in a
try block, we might call it again via catch. I think that will lead
to sending a stop message twice. Won't that be a problem? See the
usage of iterstate in the catch block, we have made it safe from a
similar problem.

5.
+ if (streaming)
+ {
+ /* Discard the changes that we just streamed. */
+ ReorderBufferTruncateTXN(rb, txn);
- PG_RE_THROW();
+ /* Re-throw only if it's not an abort. */
+ if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+ {
+ MemoryContextSwitchTo(ecxt);
+ PG_RE_THROW();
+ }
+ else
+ {
+ FlushErrorState();
+ FreeErrorData(errdata);
+ errdata = NULL;
+

I think here we can write few comments on why we are doing error-code
specific handling, basically, explain a bit about concurrent abort
handling and or refer to the part of comments where it is explained.

6.
PG_CATCH();
  {
+ MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+ ErrorData  *errdata = CopyErrorData();

I don't understand the usage of memory context in this part of the
code. Basically, you are switching to CurrentMemoryContext here, do
some error handling and then again reset back to some random context
before rethrowing the error. If there is some purpose for it, then it
might be better if you can write a few comments to explain the same.

7.
+ReorderBufferCommit()
{
..
+ /*
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
+ *
+ * XXX Called after everything (origin ID and LSN, ...) is stored in the
+ * transaction, so we don't pass that directly.
+ *
+ * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+ */
+ if (rbtxn_is_streamed(txn))
+ {
+ ReorderBufferStreamCommit(rb, txn);
+ return;
+ }
+
..
}

"XXX Somewhat hackish redirection, perhaps needs to be refactored?"
What kind of refactoring we can do here? To me, it looks okay.

8.
@@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
*rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);

  txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+ /*
+ * TOCHECK: Mark toplevel transaction as having catalog changes too
+ * if one of its children has.
+ */
+ if (txn->toptxn != NULL)
+ txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }

Why are we marking top transaction here?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#304Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#303)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 15, 2020 at 2:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

6. I think it will be good if we can provide an example of streaming
changes via test_decoding at
https://www.postgresql.org/docs/devel/test-decoding.html. I think we
can also explain there why the user is not expected to see the actual
data in the stream.

I have a few problems to solve here.
- With streaming transaction also shall we show the actual values or
we shall do like it is currently in the patch
(appendStringInfo(ctx->out, "streaming change for TXN %u",
txn->xid);). I think we should show the actual values instead of what
we are doing now.

I think why we don't want to display the tuple at this stage is
because it is not clear by this time if the transaction will commit or
abort. I am not sure if displaying the contents of aborted
transactions is a good idea but if there is a reason for doing so, we
can do it later as well.

Ok.

- In the example we can not show a real example, because of the
in-progress transaction to show the changes, we might have to
implement a lot of tuple. I think we can show the partial output?

I think we can display what API will actually display, what is the
confusion here.

What, I meant is that even with the logical_decoding_work_mem=64kb, we
need to have quite a few changes in a transaction to stream it so the
example output will be quite big in size. So I told we might not show
the real example instead we will just show a few lines and cut the
remaining. But, I got your point we can just show how it will look
like.

I have a few more comments on the previous version of patch
v20-0005-Implement-streaming-mode-in-ReorderBuffer. If you have fixed
any, then leave those and fix others.

Review comments:
------------------------------
1.
@@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
TransactionId xid,
}

case REORDER_BUFFER_CHANGE_MESSAGE:
- rb->message(rb, txn, change->lsn, true,
- change->data.msg.prefix,
- change->data.msg.message_size,
- change->data.msg.message);
+ if (streaming)
+ rb->stream_message(rb, txn, change->lsn, true,
+    change->data.msg.prefix,
+    change->data.msg.message_size,
+    change->data.msg.message);
+ else
+ rb->message(rb, txn, change->lsn, true,
+    change->data.msg.prefix,
+    change->data.msg.message_size,
+    change->data.msg.message);

Don't we need to set any_data_sent flag while streaming messages as we
do for other types of changes?

Actually, pgoutput plugin don't send any data on stream_message. But,
I agree that how other plugin will handle. I will analyze this part
again, maybe we have to such flag at the plugin level and whether stop
is sent to not can also be handled at the plugin level.

2.
+ if (streaming)
+ {
+ /*
+ * Set the last of the stream as the final lsn before calling
+ * stream stop.
+ */
+ if (!XLogRecPtrIsInvalid(prev_lsn))
+ txn->final_lsn = prev_lsn;
+ rb->stream_stop(rb, txn);
+ }

I am not sure if it is good to use final_lsn for this purpose. See
comments for this variable in reorderbuffer.h. Basically, it is used
for a specific purpose on different occasions. Now, if we want to
start using it for a new purpose, we need to study its interaction
with all other places and update the comments as well. Can we pass an
additional parameter to stream_stop() instead?

I think it was in sycn with the spill code right? I mean the last
change we spill is set as the final_lsn and same is done here.

Other comments looks fine so I will work on them and reply separatly.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#305Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#304)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, May 15, 2020 at 4:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

- In the example we can not show a real example, because of the
in-progress transaction to show the changes, we might have to
implement a lot of tuple. I think we can show the partial output?

I think we can display what API will actually display, what is the
confusion here.

What, I meant is that even with the logical_decoding_work_mem=64kb, we
need to have quite a few changes in a transaction to stream it so the
example output will be quite big in size. So I told we might not show
the real example instead we will just show a few lines and cut the
remaining. But, I got your point we can just show how it will look
like.

Right.

I have a few more comments on the previous version of patch
v20-0005-Implement-streaming-mode-in-ReorderBuffer. If you have fixed
any, then leave those and fix others.

Review comments:
------------------------------
1.
@@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
TransactionId xid,
}

case REORDER_BUFFER_CHANGE_MESSAGE:
- rb->message(rb, txn, change->lsn, true,
- change->data.msg.prefix,
- change->data.msg.message_size,
- change->data.msg.message);
+ if (streaming)
+ rb->stream_message(rb, txn, change->lsn, true,
+    change->data.msg.prefix,
+    change->data.msg.message_size,
+    change->data.msg.message);
+ else
+ rb->message(rb, txn, change->lsn, true,
+    change->data.msg.prefix,
+    change->data.msg.message_size,
+    change->data.msg.message);

Don't we need to set any_data_sent flag while streaming messages as we
do for other types of changes?

Actually, pgoutput plugin don't send any data on stream_message. But,
I agree that how other plugin will handle. I will analyze this part
again, maybe we have to such flag at the plugin level and whether stop
is sent to not can also be handled at the plugin level.

Okay, lets discuss this after your analysis.

2.
+ if (streaming)
+ {
+ /*
+ * Set the last of the stream as the final lsn before calling
+ * stream stop.
+ */
+ if (!XLogRecPtrIsInvalid(prev_lsn))
+ txn->final_lsn = prev_lsn;
+ rb->stream_stop(rb, txn);
+ }

I am not sure if it is good to use final_lsn for this purpose. See
comments for this variable in reorderbuffer.h. Basically, it is used
for a specific purpose on different occasions. Now, if we want to
start using it for a new purpose, we need to study its interaction
with all other places and update the comments as well. Can we pass an
additional parameter to stream_stop() instead?

I think it was in sycn with the spill code right? I mean the last
change we spill is set as the final_lsn and same is done here.

But we use final_lsn in ReorderBufferRestoreCleanup() for serialized
changes. Now, in some case if we first do serialization, then perform
streaming and then tried to call ReorderBufferRestoreCleanup(), it
might not work as intended. Now, this might not happen today but I
don't think we have any protection to avoid that.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#306Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#305)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, May 15, 2020 at 4:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 15, 2020 at 4:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

- In the example we can not show a real example, because of the
in-progress transaction to show the changes, we might have to
implement a lot of tuple. I think we can show the partial output?

I think we can display what API will actually display, what is the
confusion here.

What, I meant is that even with the logical_decoding_work_mem=64kb, we
need to have quite a few changes in a transaction to stream it so the
example output will be quite big in size. So I told we might not show
the real example instead we will just show a few lines and cut the
remaining. But, I got your point we can just show how it will look
like.

Right.

I have a few more comments on the previous version of patch
v20-0005-Implement-streaming-mode-in-ReorderBuffer. If you have fixed
any, then leave those and fix others.

Review comments:
------------------------------
1.
@@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
TransactionId xid,
}

case REORDER_BUFFER_CHANGE_MESSAGE:
- rb->message(rb, txn, change->lsn, true,
- change->data.msg.prefix,
- change->data.msg.message_size,
- change->data.msg.message);
+ if (streaming)
+ rb->stream_message(rb, txn, change->lsn, true,
+    change->data.msg.prefix,
+    change->data.msg.message_size,
+    change->data.msg.message);
+ else
+ rb->message(rb, txn, change->lsn, true,
+    change->data.msg.prefix,
+    change->data.msg.message_size,
+    change->data.msg.message);

Don't we need to set any_data_sent flag while streaming messages as we
do for other types of changes?

Actually, pgoutput plugin don't send any data on stream_message. But,
I agree that how other plugin will handle. I will analyze this part
again, maybe we have to such flag at the plugin level and whether stop
is sent to not can also be handled at the plugin level.

Okay, lets discuss this after your analysis.

2.
+ if (streaming)
+ {
+ /*
+ * Set the last of the stream as the final lsn before calling
+ * stream stop.
+ */
+ if (!XLogRecPtrIsInvalid(prev_lsn))
+ txn->final_lsn = prev_lsn;
+ rb->stream_stop(rb, txn);
+ }

I am not sure if it is good to use final_lsn for this purpose. See
comments for this variable in reorderbuffer.h. Basically, it is used
for a specific purpose on different occasions. Now, if we want to
start using it for a new purpose, we need to study its interaction
with all other places and update the comments as well. Can we pass an
additional parameter to stream_stop() instead?

I think it was in sycn with the spill code right? I mean the last
change we spill is set as the final_lsn and same is done here.

But we use final_lsn in ReorderBufferRestoreCleanup() for serialized
changes. Now, in some case if we first do serialization, then perform
streaming and then tried to call ReorderBufferRestoreCleanup(),it
might not work as intended. Now, this might not happen today but I
don't think we have any protection to avoid that.

If streaming is complete then we will remove the serialize flag so it
will not cause any issue. However, we can avoid setting final_lsn
here and pass some parameters to the stream_stop about the last lsn of
the stream.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#307Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#303)
12 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 15, 2020 at 2:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

6. I think it will be good if we can provide an example of streaming
changes via test_decoding at
https://www.postgresql.org/docs/devel/test-decoding.html. I think we
can also explain there why the user is not expected to see the actual
data in the stream.

I have a few problems to solve here.
- With streaming transaction also shall we show the actual values or
we shall do like it is currently in the patch
(appendStringInfo(ctx->out, "streaming change for TXN %u",
txn->xid);). I think we should show the actual values instead of what
we are doing now.

I think why we don't want to display the tuple at this stage is
because it is not clear by this time if the transaction will commit or
abort. I am not sure if displaying the contents of aborted
transactions is a good idea but if there is a reason for doing so, we
can do it later as well.

- In the example we can not show a real example, because of the
in-progress transaction to show the changes, we might have to
implement a lot of tuple. I think we can show the partial output?

I think we can display what API will actually display, what is the
confusion here.

Added example in the v22-0011 patch where I have added the API to get
streaming changes.

I have a few more comments on the previous version of patch
v20-0005-Implement-streaming-mode-in-ReorderBuffer. If you have fixed
any, then leave those and fix others.

Review comments:
------------------------------
1.
@@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
TransactionId xid,
}

case REORDER_BUFFER_CHANGE_MESSAGE:
- rb->message(rb, txn, change->lsn, true,
- change->data.msg.prefix,
- change->data.msg.message_size,
- change->data.msg.message);
+ if (streaming)
+ rb->stream_message(rb, txn, change->lsn, true,
+    change->data.msg.prefix,
+    change->data.msg.message_size,
+    change->data.msg.message);
+ else
+ rb->message(rb, txn, change->lsn, true,
+    change->data.msg.prefix,
+    change->data.msg.message_size,
+    change->data.msg.message);

Don't we need to set any_data_sent flag while streaming messages as we
do for other types of changes?

I think any_data_sent, was added to avoid sending abort to the
subscriber if we haven't sent any data, but this is not complete as
the output plugin can also take the decision not to send. So I think
this should not be done as part of this patch and can be done
separately. I think there is already a thread for handling the
same[1]/messages/by-id/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com

2.
+ if (streaming)
+ {
+ /*
+ * Set the last of the stream as the final lsn before calling
+ * stream stop.
+ */
+ if (!XLogRecPtrIsInvalid(prev_lsn))
+ txn->final_lsn = prev_lsn;
+ rb->stream_stop(rb, txn);
+ }

I am not sure if it is good to use final_lsn for this purpose. See
comments for this variable in reorderbuffer.h. Basically, it is used
for a specific purpose on different occasions. Now, if we want to
start using it for a new purpose, we need to study its interaction
with all other places and update the comments as well. Can we pass an
additional parameter to stream_stop() instead?

Done

3.
+ /* remember the command ID and snapshot for the streaming run */
+ txn->command_id = command_id;
+
+ /* Avoid copying if it's already copied. */
+ if (snapshot_now->copied)
+ txn->snapshot_now = snapshot_now;
+ else
+ txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+   txn, command_id);

This code is used at two different places, can we try to keep this in
a single function.

Done

4.
In ReorderBufferProcessTXN(), the patch is calling stream_stop in both
the try and catch block. If there is an error after calling it in a
try block, we might call it again via catch. I think that will lead
to sending a stop message twice. Won't that be a problem? See the
usage of iterstate in the catch block, we have made it safe from a
similar problem.

IMHO, we don't need that, because we only call stream_stop in the
catch block if the error type is ERRCODE_TRANSACTION_ROLLBACK. So if
in TRY block we have already stopped the stream then we should not get
that error. I have added the comments for the same.

5.
+ if (streaming)
+ {
+ /* Discard the changes that we just streamed. */
+ ReorderBufferTruncateTXN(rb, txn);
- PG_RE_THROW();
+ /* Re-throw only if it's not an abort. */
+ if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+ {
+ MemoryContextSwitchTo(ecxt);
+ PG_RE_THROW();
+ }
+ else
+ {
+ FlushErrorState();
+ FreeErrorData(errdata);
+ errdata = NULL;
+

I think here we can write few comments on why we are doing error-code
specific handling, basically, explain a bit about concurrent abort
handling and or refer to the part of comments where it is explained.

Done

6.
PG_CATCH();
{
+ MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+ ErrorData  *errdata = CopyErrorData();

I don't understand the usage of memory context in this part of the
code. Basically, you are switching to CurrentMemoryContext here, do
some error handling and then again reset back to some random context
before rethrowing the error. If there is some purpose for it, then it
might be better if you can write a few comments to explain the same.

Basically, the ccxt is the CurrentMemoryContext when we started the
streaming and ecxt it the context when we catch the error. So
ideally, before this change, it will rethrow in the context when we
catch the error i.e. ecxt. So what we are trying to do is put it back
to normal context (ccxt) and copy the error data in the normal
context. And, if we are not handling it gracefully then put it back
to the context it was in, and rethrow.

7.
+ReorderBufferCommit()
{
..
+ /*
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
+ *
+ * XXX Called after everything (origin ID and LSN, ...) is stored in the
+ * transaction, so we don't pass that directly.
+ *
+ * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+ */
+ if (rbtxn_is_streamed(txn))
+ {
+ ReorderBufferStreamCommit(rb, txn);
+ return;
+ }
+
..
}

"XXX Somewhat hackish redirection, perhaps needs to be refactored?"
What kind of refactoring we can do here? To me, it looks okay.

I think it looks fine to me also. So I have removed this comment.

8.
@@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
*rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);

txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+ /*
+ * TOCHECK: Mark toplevel transaction as having catalog changes too
+ * if one of its children has.
+ */
+ if (txn->toptxn != NULL)
+ txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}

Why are we marking top transaction here?

We need to mark top transaction to decide whether to build tuplecid
hash or not. In non-streaming mode, we are only sending during the
commit time, and during commit time we know whether the top
transaction has any catalog changes or not based on the invalidation
message so we are marking the top transaction there in DecodeCommit.
Since here we are not waiting till commit so we need to mark the top
transaction as soon as we mark any of its child transactions.

[1]: /messages/by-id/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v22-0001-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=v22-0001-Immediately-WAL-log-assignments.patchDownload
From 63bed6b2ed7844dd78eb2934b572466cfc671284 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 20 Mar 2020 15:03:01 +0530
Subject: [PATCH v22 01/12] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 46 ++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 +++
 src/backend/replication/logical/decode.c | 39 ++++++++++----------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 99 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index cd30b62d36..3af8e81af1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5118,6 +5120,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6020,3 +6023,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f09e..53be2b3059 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5995798b58..560ec27fa0 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1195,6 +1195,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1233,6 +1234,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3abf8..122c581d0f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel xid is valid, we need to assign the subxact to the
+	 * toplevel xact. We need to do this for all records, hence we do it before
+	 * the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(XLogRecGetTopXid(r)))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04babc2..8645b3816c 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e917dfe92d..26426cc779 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index c21b0ba972..83170a663c 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -191,6 +191,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -308,6 +310,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0194..2f0c8bf589 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
2.23.0

v22-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchapplication/octet-stream; name=v22-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchDownload
From 7ab0c5ad39273966eb6d5d71807d49af03eca057 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 10:55:19 +0530
Subject: [PATCH v22 04/12] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml  |  9 +++--
 src/backend/access/heap/heapam.c   | 10 ++++++
 src/backend/access/index/genam.c   | 53 ++++++++++++++++++++++++++++
 src/backend/access/table/tableam.c |  8 +++++
 src/backend/utils/time/snapmgr.c   | 13 +++++++
 src/include/access/tableam.h       | 55 ++++++++++++++++++++++++++++++
 src/include/utils/snapmgr.h        |  2 ++
 7 files changed, 147 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 1b56daa4bb..5f7394f3c1 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,9 +432,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 94eb37d48d..2d77107c4f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments at snapmgr.c
+	 * where these variables are declared.  Normally we have such a check at
+	 * tableam level API but this is called from many places so we need to
+	 * ensure it here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae39a..446b8cbc86 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,9 +430,36 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments at snapmgr.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
+/*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments at snapmgr.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
 /*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments at snapmgr.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index c814733b22..2f52b407c6 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -230,6 +230,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	Relation	rel = scan->rs_rd;
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
+	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
 	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..9f1ecd123f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -153,6 +153,19 @@ static Snapshot SecondarySnapshot = NULL;
 static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool bsysscan = false;
+
 /*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8c34935c34..9d890d3c4b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
+#include "utils/snapmgr.h"
 #include "utils/snapshot.h"
 
 
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1015,6 +1025,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1054,6 +1071,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1710,6 +1735,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1727,6 +1760,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1745,6 +1786,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1761,6 +1809,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce84..5af6df698b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,6 +145,8 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
+extern bool	bsysscan;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
 extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
 extern void TeardownHistoricSnapshot(bool is_error);
-- 
2.23.0

v22-0002-Issue-individual-invalidations-with-wal_level-lo.patchapplication/octet-stream; name=v22-0002-Issue-individual-invalidations-with-wal_level-lo.patchDownload
From 44afef284200ccb48460ff7d819930858658a466 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Tue, 14 Apr 2020 11:11:37 +0530
Subject: [PATCH v22 02/12] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.
---
 src/backend/access/rmgrdesc/xactdesc.c        |  40 +++++++
 src/backend/access/transam/xact.c             |   7 ++
 src/backend/replication/logical/decode.c      |  16 +++
 .../replication/logical/reorderbuffer.c       | 104 +++++++++++++++---
 src/backend/utils/cache/inval.c               |  49 +++++++++
 src/include/access/xact.h                     |  13 ++-
 src/include/replication/reorderbuffer.h       |  11 ++
 7 files changed, 226 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce75565f..7ab0d11ea9 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3af8e81af1..e576b10055 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6020,6 +6020,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 122c581d0f..69c1f45ef6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,22 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				Assert(TransactionIdIsValid(xid));
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->nmsgs, invals->msgs);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4594cf9509..b889edf461 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -220,7 +220,7 @@ static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
 static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
-static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
 
 /*
  * ---------------------------------------
@@ -455,6 +455,11 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 				pfree(change->data.msg.message);
 			change->data.msg.message = NULL;
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			if (change->data.inval.invalidations)
+				pfree(change->data.inval.invalidations);
+			change->data.inval.invalidations = NULL;
+			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			if (change->data.snapshot)
 			{
@@ -1814,17 +1819,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation messages locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					ReorderBufferExecuteInvalidations(
+							change->data.inval.ninvalidations,
+							change->data.inval.invalidations);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -1866,7 +1878,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1892,7 +1905,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2202,6 +2216,33 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	txn->ntuplecids++;
 }
 
+/*
+ * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn, int nmsgs,
+							 SharedInvalidationMessage *msgs)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.ninvalidations = nmsgs;
+	change->data.inval.invalidations = (SharedInvalidationMessage *)
+		MemoryContextAlloc(rb->context,
+						   sizeof(SharedInvalidationMessage) * nmsgs);
+	memcpy(change->data.inval.invalidations, msgs,
+		   sizeof(SharedInvalidationMessage) * nmsgs);
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Setup the invalidation of the toplevel transaction.
  *
@@ -2234,12 +2275,12 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  * in the changestream but we don't know which those are.
  */
 static void
-ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
 {
 	int			i;
 
-	for (i = 0; i < txn->ninvalidations; i++)
-		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	for (i = 0; i < nmsgs; i++)
+		LocalExecuteInvalidationMessage(&msgs[i]);
 }
 
 /*
@@ -2589,6 +2630,24 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				char	   *data;
+				Size		inval_size = sizeof(SharedInvalidationMessage) *
+										change->data.inval.ninvalidations;
+
+				sz += inval_size;
+
+				ReorderBufferSerializeReserve(rb, sz);
+				data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+
+				/* might have been reallocated above */
+				ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+				memcpy(data, change->data.inval.invalidations, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -2736,6 +2795,11 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			sz += sizeof(SharedInvalidationMessage) *
+					change->data.inval.ninvalidations;
+			break;
+
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -3002,6 +3066,20 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				Size	inval_size = sizeof(SharedInvalidationMessage) *
+									change->data.inval.ninvalidations;
+
+				change->data.inval.invalidations =
+						MemoryContextAlloc(rb->context, inval_size);
+
+				/* read the message */
+				memcpy(change->data.inval.invalidations, data, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33be6..cba5b6c64b 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transaction.  As of now it was
+ *	enough to log invalidation only at commit because we are only decoding the
+ *	transaction at the commit time.   We only need to log the catalog cache and
+ *	relcache invalidation.  There can not be any active MVCC scan in logical
+ *	decoding so we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
 	if (transInvalInfo == NULL)
 		return;
 
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
@@ -1501,3 +1513,40 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage	*invalMessages;
+	int	nmsgs = 0;
+
+	if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+	{
+		ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+								 MakeSharedInvalidMessagesArray);
+		invalMessages = SharedInvalidMessagesArray;
+		nmsgs  = numSharedInvalidMessagesArray;
+		SharedInvalidMessagesArray = NULL;
+		numSharedInvalidMessagesArray = 0;
+	}
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b3816c..b822c5e4b2 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -197,6 +197,17 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+/*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4dc9..af35287896 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,14 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations;		/* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation
+														 * message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +468,8 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  int nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
2.23.0

v22-0003-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=v22-0003-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From 3ea54d1caab9237a87629cb58a1f5be0e1bc9fb0 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v22 03/12] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 ++++++
 doc/src/sgml/logicaldecoding.sgml         | 213 +++++++++++++
 src/backend/replication/logical/logical.c | 365 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++
 src/include/replication/reorderbuffer.h   |  59 ++++
 6 files changed, 811 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c948856e..64f651fa72 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index bad3bfe620..1b56daa4bb 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index dc69e5ce5f..f49c48a34a 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similar to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,321 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = txn->final_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475e5d..deef31825d 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -79,6 +79,11 @@ typedef struct LogicalDecodingContext
 	 */
 	void	   *output_writer_private;
 
+	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236c57..0d0a94a648 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
 /*
  * Output plugin callbacks
  */
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index af35287896..65814af9f5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -354,6 +354,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
 struct ReorderBuffer
 {
 	/*
@@ -392,6 +440,17 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
 	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
-- 
2.23.0

v22-0005-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v22-0005-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From ed1366dcc22fc18c48041d99be4d69ddbda6a1ce Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 1 May 2020 19:56:35 +0530
Subject: [PATCH v22 05/12] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes
we have in memory and invoke new stream API methods. This happens
in ReorderBufferStreamTXN() using about the same logic as in
ReorderBufferCommit() logic.  However, sometime if we have incomplete
toast or speculative insert we spill to the disk because we can not
generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

This adds a second iterator for the streaming case, without the
spill-to-disk functionality and only processing changes currently
in memory.

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c   |  38 +-
 .../replication/logical/reorderbuffer.c       | 741 ++++++++++++++++--
 src/include/replication/reorderbuffer.h       |  31 +
 3 files changed, 732 insertions(+), 78 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aa..160b167adb 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b889edf461..b41451662a 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -236,6 +237,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +246,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +382,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -772,6 +786,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+/*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -865,6 +911,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -1024,6 +1073,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1038,6 +1090,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1315,6 +1370,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		dlist_delete(&txn->base_snapshot_node);
 	}
 
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
 	/*
 	 * Remove TXN from its containing list.
 	 *
@@ -1340,6 +1404,80 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferReturnTXN(rb, txn);
 }
 
+/*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
 /*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1491,57 +1629,145 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort messageto truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
+	/*
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
+	 */
+	if (TransactionIdEquals(CheckXidAlive, xid))
+		return;
 
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * Setup CheckXidAlive only if it's in progress. We don't check if the xid
+	 * is aborted. That will happen during catalog access.  Also reset the
+	 * bsysscan flag.
 	 */
-	if (txn->base_snapshot == NULL)
+	if (TransactionIdIsInProgress(xid))
 	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
-		return;
+		CheckXidAlive = xid;
+		bsysscan = false;
 	}
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1564,21 +1790,43 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
-
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
 		{
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * start stream or begin the transaction.  If this is the first
+			 * change in the current stream.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+					rb->stream_start(rb, txn, change->lsn);
+				else
+					rb->begin(rb, txn);
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1655,7 +1903,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1695,7 +1944,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1753,7 +2002,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1762,10 +2014,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1796,7 +2045,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1858,14 +2106,29 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+			rb->stream_stop(rb, txn, prev_lsn);
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1884,14 +2147,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,17 +2182,130 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		if (streaming)
+		{
+			/* Discard the changes that we just streamed. */
+			ReorderBufferTruncateTXN(rb, txn);
 
-		PG_RE_THROW();
+			/*
+			 * If the error code is ERRCODE_TRANSACTION_ROLLBACK,  that means
+			 * we have detected a concurrent abort of the (sub)transaction we
+			 * are streaming.  So just do the cleanup and return gracefully.
+			 * Otherwise, Re-throw the error.
+			 */
+			if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+			{
+				FlushErrorState();
+				FreeErrorData(errdata);
+				errdata = NULL;
+
+				/*
+				 * We can safely call the stream stop here without worrying
+				 * about whether it is already stopped or not in the TRY()
+				 * block because after stream is stopped in we can not get
+				 * the ERRCODE_TRANSACTION_ROLLBACK error.
+				 */
+				rb->stream_stop(rb, txn, prev_lsn);
+
+				/*
+				 * Remember the command ID and snapshot for the streaming run.
+				 */
+				ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+
+				ReorderBufferToastReset(rb, txn);
+				if (specinsert != NULL)
+				{
+					ReorderBufferReturnChange(rb, specinsert);
+					specinsert = NULL;
+				}
+			}
+			else
+			{
+				MemoryContextSwitchTo(ecxt);
+				PG_RE_THROW();
+			}
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.  We iterate over the top and
+ * subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot snapshot_now;
+	CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
@@ -1946,6 +2330,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2015,6 +2406,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2150,8 +2548,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2159,6 +2566,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2170,19 +2578,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2211,6 +2628,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2295,6 +2713,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2398,6 +2823,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
@@ -2418,15 +2875,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2746,6 +3234,102 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot snapshot_now;
+	CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -3864,6 +4448,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases we assume the CIDs is from the future command
+	 * and return as unresolve.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 65814af9f5..b3e2b3f64b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -171,6 +171,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -190,6 +191,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -224,6 +243,11 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	final_lsn;
 
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
@@ -254,6 +278,13 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
+	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
 	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
-- 
2.23.0

v22-0008-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v22-0008-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From 2dede993be1685bccb48995691a6c3361c02acf2 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v22 08/12] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 6da7f71ca3..086d0c7f02 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -70,7 +70,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2fbc..94c71f8ae2 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f871a..4ba80869b9 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9181..a6fae9c3f1 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17a78..202871a658 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10a19..70c86b22ac 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6d63..f9c8d1d348 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334ed89..cdf9b8e7bb 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bdba9..21f50c7012 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133f69..30561d8f96 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae38592b..9a6bac6822 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..ed56fbf96c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cba4c..4df1ddef63 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1a8a..c3caff6149 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7c2f..c62eb521e7 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30f59..2be7542831 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0578..2da9607a7d 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9435..96ffc091b0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
2.23.0

v22-0006-Add-support-for-streaming-to-built-in-replicatio.patchapplication/octet-stream; name=v22-0006-Add-support-for-streaming-to-built-in-replicatio.patchDownload
From 8fec42572d769275cc2501ee72a4f2a14b6ab45f Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 14 May 2020 21:27:46 +0530
Subject: [PATCH v22 06/12] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml      |    4 +-
 doc/src/sgml/ref/create_subscription.sgml     |   12 +
 src/backend/catalog/pg_subscription.c         |    1 +
 src/backend/commands/subscriptioncmds.c       |   45 +-
 src/backend/postmaster/pgstat.c               |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |    3 +
 src/backend/replication/logical/launcher.c    |    1 -
 src/backend/replication/logical/proto.c       |  140 ++-
 src/backend/replication/logical/worker.c      | 1033 +++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c   |  318 ++++-
 src/backend/replication/slotfuncs.c           |    6 +
 src/backend/replication/walsender.c           |    6 +
 src/include/catalog/pg_subscription.h         |    3 +
 src/include/pgstat.h                          |    6 +-
 src/include/replication/logicalproto.h        |   42 +-
 src/include/replication/walreceiver.h         |    1 +
 src/test/subscription/t/009_stream_simple.pl  |   86 ++
 src/test/subscription/t/010_stream_subxact.pl |  102 ++
 src/test/subscription/t/011_stream_ddl.pl     |   95 ++
 .../t/012_stream_subxact_abort.pl             |   82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |   84 ++
 21 files changed, 2041 insertions(+), 41 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 1b8beadbaa..95b7c24ef9 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -164,8 +164,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c244fb..3349cc4bfc 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731115..f28482f0f4 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9ebb026187..9065a1be1b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e246be388b..90182a0181 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4139,6 +4139,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9bb6..5257ab0394 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e987..8156a42ace 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,7 +14,6 @@
  *
  *-------------------------------------------------------------------------
  */
-
 #include "postgres.h"
 
 #include "access/heapam.h"
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd171..5242ac0efe 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a1224d..026cd48bd0 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,6 +54,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -64,6 +86,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +94,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -110,12 +134,57 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool	in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;						/* XID of the subxact */
+	off_t           offset;					/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +256,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -552,6 +657,319 @@ apply_handle_origin(StringInfo s)
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+			return;
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	ensure_transaction();
+
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -565,6 +983,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1001,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1040,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1158,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1303,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1676,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1817,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1477,6 +1929,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 	}
 }
 
+/*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
 /*
  * Apply main loop.
  */
@@ -1493,6 +1961,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1941,6 +2412,567 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3138,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 15379e3118..a94b4a0136 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,54 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +119,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +145,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +208,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +237,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +260,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +281,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +370,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +427,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +469,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +490,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +522,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +542,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +566,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +586,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +611,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +643,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -587,6 +723,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -623,6 +844,34 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -753,11 +1002,44 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -793,7 +1075,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 6fed3cfd23..e1344ab4cc 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -157,6 +157,12 @@ create_logical_replication_slot(char *name, char *plugin,
 											   .segment_close = wal_segment_close),
 									NULL, NULL, NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
 	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index a4ca8daea7..6def1b96c9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1020,6 +1020,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndPrepareWrite, WalSndWriteData,
 										WalSndUpdateProgress);
 
+		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
 		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d42d8..617b9094d4 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ae9a39573c..70826c1cef 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -980,7 +980,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561be9..89158ed46f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index ac1acbb27e..95132062c6 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v22-0009-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v22-0009-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From 8e911cfb9c8dd0cce8986bbfe21cab9355101d1f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v22 09/12] Add TAP test for streaming vs. DDL

---
 .../subscription/t/014_stream_through_ddl.pl  | 98 +++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000000..b8d78b1972
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v22-0007-Track-statistics-for-streaming.patchapplication/octet-stream; name=v22-0007-Track-statistics-for-streaming.patchDownload
From c9c7d5c6756e3c623bacbdd8a42f90e66d345a26 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Fri, 15 May 2020 14:07:04 +0530
Subject: [PATCH v22 07/12] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                  | 33 +++++++++++++++++++
 src/backend/catalog/system_views.sql          |  5 ++-
 .../replication/logical/reorderbuffer.c       | 13 ++++++++
 src/backend/replication/walsender.c           | 32 +++++++++++++++---
 src/include/catalog/pg_proc.dat               |  6 ++--
 src/include/replication/reorderbuffer.h       | 13 +++++---
 src/include/replication/walsender_private.h   |  5 +++
 src/test/regress/expected/rules.out           |  7 ++--
 8 files changed, 99 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 87502a49b6..5b64410e7d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2404,6 +2404,39 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        Amount of decoded transaction data spilled to disk.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_txns</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of in-progress transactions streamed to subscriber after
+       memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+       Streaming only works with toplevel transactions (subtransactions can't
+       be streamed independently), so the counter does not get incremented for
+       subtransactions.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_count</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times in-progress transactions were streamed to subscriber.
+       Transactions may get streamed repeatedly, and this counter gets incremented
+       on every such invocation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_bytes</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Amount of decoded in-progress transaction data streamed to subscriber.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2bd5f5ea14..8f34ce8deb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -788,7 +788,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b41451662a..25bb2fe766 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -332,6 +332,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3325,6 +3329,15 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
 
+	/*
+	 * Update the stream statistics.
+	 */
+	rb->streamCount += 1;
+	rb->streamBytes += txn->size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 6def1b96c9..5d23691930 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1359,7 +1359,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1380,7 +1380,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2429,6 +2430,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3264,7 +3268,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3322,6 +3326,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillCount;
 		int64		spillBytes;
 		bool		is_sync_standby;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 		int			j;
@@ -3347,6 +3354,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		/*
@@ -3449,6 +3459,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3697,11 +3712,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9edae40ed8..5a8826cc67 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5237,9 +5237,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index b3e2b3f64b..3d3d6609a3 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -514,15 +514,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 734acec2a4..b997d1710e 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -83,6 +83,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 8876025aaa..0c4952a1fa 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2005,9 +2005,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_slru| SELECT s.name,
     s.blks_zeroed,
-- 
2.23.0

v22-0010-Bugfix-handling-of-incomplete-toast-tuple.patchapplication/octet-stream; name=v22-0010-Bugfix-handling-of-incomplete-toast-tuple.patchDownload
From 2123d309c21ddb8b450855909e5b1c0b61131a06 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 14 May 2020 21:28:56 +0530
Subject: [PATCH v22 10/12] Bugfix handling of incomplete toast tuple

---
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/replication/logical/decode.c      |  17 +-
 .../replication/logical/reorderbuffer.c       | 193 +++++++++++-------
 src/include/access/heapam_xlog.h              |   1 +
 src/include/replication/reorderbuffer.h       |  24 ++-
 5 files changed, 158 insertions(+), 80 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2d77107c4f..3927448f46 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1955,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 69c1f45ef6..c841687c66 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -727,7 +727,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -794,7 +796,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -851,7 +854,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -887,7 +891,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -987,7 +991,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1025,7 +1029,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 25bb2fe766..dc02522aaa 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -179,6 +179,11 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define ChangeIsInsertOrUpdate(action) \
+			(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+			((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+			((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT))
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -655,11 +660,14 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
-	ReorderBufferTXN *txn;
+	ReorderBufferTXN *txn, *toptxn;
+	bool	can_stream = false;
 
-	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+	toptxn = txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
 
 	change->lsn = lsn;
 	change->txn = txn;
@@ -669,9 +677,49 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Otherwise, if
+	 * we have toast insert bit set and this is insert/update then clear the
+	 * bit.
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(txn) &&
+			ChangeIsInsertOrUpdate(change->action))
+	{
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+		can_stream = true;
+	}
+
+	/*
+	 * If this is a speculative insert then set the corresponding bit.
+	 * Otherwise, if we have speculative insert bit set and this is spec
+	 * confirm record then clear the bit.
+	 */
+	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM)
+	{
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+		can_stream = true;
+	}
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/*
+	 * If streaming is enable and we have serialized this transaction because
+	 * it had incomplete tuple.  So if now we have got the complete tuple we
+	 * can stream it.
+	 */
+	if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
+		&& !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
+	{
+		ReorderBufferStreamTXN(rb, toptxn);
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
+	}
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -701,7 +749,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1477,6 +1525,13 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
+	/* remove entries spilled to disk */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
+
 	/* also reset the number of entries in the transaction */
 	txn->nentries_mem = 0;
 	txn->nentries = 0;
@@ -1932,8 +1987,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						Assert(change->data.tp.newtuple != NULL);
 
 						dlist_delete(&change->node);
-						ReorderBufferToastAppendChunk(rb, txn, relation,
-													  change);
+							ReorderBufferToastAppendChunk(rb, txn, relation,
+														  change);
 					}
 
 			change_done:
@@ -2499,7 +2554,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2548,7 +2603,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2571,6 +2626,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2585,8 +2641,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2594,12 +2655,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2660,7 +2729,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 	memcpy(change->data.inval.invalidations, msgs,
 		   sizeof(SharedInvalidationMessage) * nmsgs);
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -2847,15 +2916,16 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		/*
+		 * if the current transaction is larger and doesn't have incomplete data
+		 * remember it.
+		 */
+		if (((!largest) || (txn->total_size > largest->total_size)) &&
+			((txn->total_size > 0) && !(rbtxn_has_toast_insert(txn)) &&
+			!(rbtxn_has_spec_insert(txn))))
+		largest = txn;
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2873,66 +2943,51 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
 	ReorderBufferTXN *txn;
 
-	/* bail out if we haven't exceeded the memory limit */
-	if (rb->size < logical_decoding_work_mem * 1024L)
-		return;
-
-	/*
-	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by streaming, if supported. Otherwise spill to disk.
-	 */
-	if (ReorderBufferCanStream(rb))
+	/* Loop until we reach under the memory limit. */
+	while (rb->size >= logical_decoding_work_mem * 1024L)
 	{
 		/*
-		 * Pick the largest toplevel transaction and evict it from memory by
-		 * streaming the already decoded part.
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTopTXN(rb);
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn && !txn->toptxn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		ReorderBufferStreamTXN(rb, txn);
-	}
-	else
-	{
-		/*
-		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
-		 */
-		txn = ReorderBufferLargestTXN(rb);
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
-		ReorderBufferSerializeTXN(rb, txn);
+		/*
+		 * After eviction, the transaction should have no entries in memory, and
+		 * should use 0 bytes for changes.
+		 *
+		 * XXX Checking the size is fine for both cases - spill to disk and
+		 * streaming. But for streaming we should really check nentries_mem for
+		 * all subtransactions too.
+		 */
+		Assert(txn->size == 0);
+		Assert(txn->nentries_mem == 0);
 	}
 
-	/*
-	 * After eviction, the transaction should have no entries in memory, and
-	 * should use 0 bytes for changes.
-	 *
-	 * XXX Checking the size is fine for both cases - spill to disk and
-	 * streaming. But for streaming we should really check nentries_mem for
-	 * all subtransactions too.
-	 */
-	Assert(txn->size == 0);
-	Assert(txn->nentries_mem == 0);
-
-	/*
-	 * And furthermore, evicting the transaction should get us below the
-	 * memory limit again - it is not possible that we're still exceeding the
-	 * memory limit after evicting the transaction.
-	 *
-	 * This follows from the simple fact that the selected transaction is at
-	 * least as large as the most recent change (which caused us to go over
-	 * the memory limit). So by evicting it we're definitely back below the
-	 * memory limit.
-	 */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cdb12..aa17f7df84 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 3d3d6609a3..95abdb184a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -172,6 +172,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -191,6 +193,17 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -199,10 +212,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -350,6 +359,9 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -542,7 +554,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
2.23.0

v22-0011-Provide-new-api-to-get-the-streaming-changes.patchapplication/octet-stream; name=v22-0011-Provide-new-api-to-get-the-streaming-changes.patchDownload
From 84229c761fadd6a45e2bc5c39da9551cdd161e78 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Sat, 2 May 2020 11:41:59 +0530
Subject: [PATCH v22 11/12] Provide new api to get the streaming changes

---
 .gitignore                                    |  1 +
 doc/src/sgml/test-decoding.sgml               | 22 ++++++++++++++++++
 src/backend/catalog/system_views.sql          |  8 +++++++
 .../replication/logical/logicalfuncs.c        | 23 +++++++++++++++----
 src/include/catalog/pg_proc.dat               |  9 ++++++++
 5 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/.gitignore b/.gitignore
index 794e35b73c..6083744c07 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d67b..eed6e9d134 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_streaming_changes('test_slot', NULL, NULL);
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8f34ce8deb..dd488cb2f8 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1242,6 +1242,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
 
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_peek_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index b99c94e848..70c28ffa91 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -108,7 +108,8 @@ check_permissions(void)
  * Helper function for the various SQL callable logical decoding functions.
  */
 static Datum
-pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool binary)
+pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
+								 bool binary, bool streaming)
 {
 	Name		name;
 	XLogRecPtr	upto_lsn;
@@ -252,6 +253,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or has been invalidated.")));
 
+		/* If called has not asked for streaming changes then disable it. */
+		ctx->streaming &= streaming;
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
@@ -362,7 +366,16 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 Datum
 pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, false);
+}
+
+/*
+ * SQL function to get the streaming changes as text, consuming the data.
+ */
+Datum
+pg_logical_slot_get_streaming_changes(PG_FUNCTION_ARGS)
+{
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, true);
 }
 
 /*
@@ -371,7 +384,7 @@ pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, false, false);
 }
 
 /*
@@ -380,7 +393,7 @@ pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, true, false);
 }
 
 /*
@@ -389,7 +402,7 @@ pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, true, false);
 }
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5a8826cc67..586c9621e2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10115,6 +10115,15 @@
   proargmodes => '{i,i,i,v,o,o,o}',
   proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
   prosrc => 'pg_logical_slot_get_binary_changes' },
+{ oid => '6150', descr => 'get streaming changes from replication slot',
+  proname => 'pg_logical_slot_get_streaming_changes', procost => '1000',
+  prorows => '1000', provariadic => 'text', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'name pg_lsn int4 _text',
+  proallargtypes => '{name,pg_lsn,int4,_text,pg_lsn,xid,text}',
+  proargmodes => '{i,i,i,v,o,o,o}',
+  proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
+  prosrc => 'pg_logical_slot_get_streaming_changes' },
 { oid => '3784', descr => 'peek at changes from replication slot',
   proname => 'pg_logical_slot_peek_changes', procost => '1000',
   prorows => '1000', provariadic => 'text', proisstrict => 'f',
-- 
2.23.0

v22-0012-Add-streaming-option-in-pg_dump.patchapplication/octet-stream; name=v22-0012-Add-streaming-option-in-pg_dump.patchDownload
From a8d03c999bef4bcc032c4a0ef4b793d5ff61f730 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v22 12/12] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index f33c2463a7..b6ae988b02 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4209,6 +4209,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4243,8 +4244,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4257,6 +4258,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4273,6 +4275,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4350,6 +4353,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 5f70400b25..3ccb6be953 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
2.23.0

#308Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#307)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Review comments:
------------------------------
1.
@@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
TransactionId xid,
}

case REORDER_BUFFER_CHANGE_MESSAGE:
- rb->message(rb, txn, change->lsn, true,
- change->data.msg.prefix,
- change->data.msg.message_size,
- change->data.msg.message);
+ if (streaming)
+ rb->stream_message(rb, txn, change->lsn, true,
+    change->data.msg.prefix,
+    change->data.msg.message_size,
+    change->data.msg.message);
+ else
+ rb->message(rb, txn, change->lsn, true,
+    change->data.msg.prefix,
+    change->data.msg.message_size,
+    change->data.msg.message);

Don't we need to set any_data_sent flag while streaming messages as we
do for other types of changes?

I think any_data_sent, was added to avoid sending abort to the
subscriber if we haven't sent any data, but this is not complete as
the output plugin can also take the decision not to send. So I think
this should not be done as part of this patch and can be done
separately. I think there is already a thread for handling the
same[1]

Hmm, but prior to this patch, we never use to send (empty) aborts but
now that will be possible. It is probably okay to deal that with
another patch mentioned by you but I felt at least any_data_sent will
work for some cases. OTOH, it appears to be half-baked solution, so
we should probably refrain from adding it. BTW, how do the pgoutput
plugin deal with it? I see that apply_handle_stream_abort will
unconditionally try to unlink the file and it will probably fail.
Have you tested this scenario after your latest changes?

4.
In ReorderBufferProcessTXN(), the patch is calling stream_stop in both
the try and catch block. If there is an error after calling it in a
try block, we might call it again via catch. I think that will lead
to sending a stop message twice. Won't that be a problem? See the
usage of iterstate in the catch block, we have made it safe from a
similar problem.

IMHO, we don't need that, because we only call stream_stop in the
catch block if the error type is ERRCODE_TRANSACTION_ROLLBACK. So if
in TRY block we have already stopped the stream then we should not get
that error. I have added the comments for the same.

I am still slightly nervous about it as I don't see any solid
guarantee for the same. You are right as the code stands today but
due to any code that gets added in the future, it might not remain
true. I feel it is better to have an Assert here to ensure that
stream_stop won't be called the second time. I don't see any good way
of doing it other than by maintaining flag or some state but I think
it will be good to ensure this.

6.
PG_CATCH();
{
+ MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+ ErrorData  *errdata = CopyErrorData();

I don't understand the usage of memory context in this part of the
code. Basically, you are switching to CurrentMemoryContext here, do
some error handling and then again reset back to some random context
before rethrowing the error. If there is some purpose for it, then it
might be better if you can write a few comments to explain the same.

Basically, the ccxt is the CurrentMemoryContext when we started the
streaming and ecxt it the context when we catch the error. So
ideally, before this change, it will rethrow in the context when we
catch the error i.e. ecxt. So what we are trying to do is put it back
to normal context (ccxt) and copy the error data in the normal
context. And, if we are not handling it gracefully then put it back
to the context it was in, and rethrow.

Okay, but when errorcode is *not* ERRCODE_TRANSACTION_ROLLBACK, don't
we need to clean up the reorderbuffer by calling
ReorderBufferCleanupTXN? If so, then you can try to combine it with
the not-streaming else loop.

8.
@@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
*rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);

txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+ /*
+ * TOCHECK: Mark toplevel transaction as having catalog changes too
+ * if one of its children has.
+ */
+ if (txn->toptxn != NULL)
+ txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}

Why are we marking top transaction here?

We need to mark top transaction to decide whether to build tuplecid
hash or not. In non-streaming mode, we are only sending during the
commit time, and during commit time we know whether the top
transaction has any catalog changes or not based on the invalidation
message so we are marking the top transaction there in DecodeCommit.
Since here we are not waiting till commit so we need to mark the top
transaction as soon as we mark any of its child transactions.

But how does it help? We use this flag (via
ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
anyway done in DecodeCommit and that too after setting this flag for
the top transaction if required. So, how will it help in setting it
while processing for subxid. Also, even if we have to do it won't it
add the xid needlessly in builder->committed.xip array?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#309Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#308)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Few comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple
1.
+ /*
+ * If this is a toast insert then set the corresponding bit.  Otherwise, if
+ * we have toast insert bit set and this is insert/update then clear the
+ * bit.
+ */
+ if (toast_insert)
+ toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+ else if (rbtxn_has_toast_insert(txn) &&
+ ChangeIsInsertOrUpdate(change->action))
+ {

Here, it might better to add a comment on why we expect only
Insert/Update? Also, it might be better that we add an assert for
other operations.

2.
@@ -1865,8 +1920,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
  * disk.
  */
  dlist_delete(&change->node);
- ReorderBufferToastAppendChunk(rb, txn, relation,
-   change);
+ ReorderBufferToastAppendChunk(rb, txn, relation,
+   change);
  }

This seems to be a spurious change.

3.
+ /*
+ * If streaming is enable and we have serialized this transaction because
+ * it had incomplete tuple.  So if now we have got the complete tuple we
+ * can stream it.
+ */
+ if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
+ && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
+ {

This comment is just saying what you are doing in the if-check. I
think you need to explain the rationale behind it. I don't like the
variable name 'can_stream' because it matches ReorderBufferCanStream
whereas it is for a different purpose, how about naming it as
'change_complete' or something like that. The check has many
conditions, can we move it to a separate function to make the code
here look clean?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#310Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#309)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

3.
+ /*
+ * If streaming is enable and we have serialized this transaction because
+ * it had incomplete tuple.  So if now we have got the complete tuple we
+ * can stream it.
+ */
+ if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
+ && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
+ {

This comment is just saying what you are doing in the if-check. I
think you need to explain the rationale behind it. I don't like the
variable name 'can_stream' because it matches ReorderBufferCanStream
whereas it is for a different purpose, how about naming it as
'change_complete' or something like that. The check has many
conditions, can we move it to a separate function to make the code
here look clean?

Do we really need this? Immediately after this check, we are calling
ReorderBufferCheckMemoryLimit which will anyway stream the changes if
required. Can we move the changes related to the detection of
incomplete data to a separate function?

Another comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple:

+ else if (rbtxn_has_toast_insert(txn) &&
+ ChangeIsInsertOrUpdate(change->action))
+ {
+ toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+ can_stream = true;
+ }
..
+#define ChangeIsInsertOrUpdate(action) \
+ (((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+ ((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+ ((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT))

How can we clear the RBTXN_HAS_TOAST_INSERT flag on
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT action?

IIUC, the basic idea used to handle incomplete changes (which is
possible in case of toast tuples and speculative inserts) is to mark
such TXNs as containing incomplete changes and then while finding the
largest top-level TXN for streaming, we ignore such TXN's and move to
next largest TXN. If none of the TXNs have complete changes then we
choose the largest (sub)transaction and spill the same to make the
in-memory changes below logical_decoding_work_mem threshold. This
idea can work but the strategy to choose the transaction is suboptimal
for cases where TXNs have some changes which are complete followed by
an incomplete toast or speculative tuple. I was having an offlist
discussion with Robert on this problem and he suggested that it would
be better if we track the complete part of changes separately and then
we can avoid the drawback mentioned above. I have thought about this
and I think it can work if we track the size and LSN of completed
changes. I think we need to ensure that if there is concurrent abort
then we discard all changes for current (sub)transaction not only up
to completed changes LSN whereas if the streaming is successful then
we can truncate the changes only up to completed changes LSN. What do
you think?

I wonder why you have done this as 0010 in the patch series, it should
be as 0006 after the
0005-Implement-streaming-mode-in-ReorderBuffer.patch. If we can do
that way then it would be easier for me to review. Is there a reason
for not doing so?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#311Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#310)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, May 19, 2020 at 2:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

3.
+ /*
+ * If streaming is enable and we have serialized this transaction because
+ * it had incomplete tuple.  So if now we have got the complete tuple we
+ * can stream it.
+ */
+ if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
+ && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
+ {

This comment is just saying what you are doing in the if-check. I
think you need to explain the rationale behind it. I don't like the
variable name 'can_stream' because it matches ReorderBufferCanStream
whereas it is for a different purpose, how about naming it as
'change_complete' or something like that. The check has many
conditions, can we move it to a separate function to make the code
here look clean?

Do we really need this? Immediately after this check, we are calling
ReorderBufferCheckMemoryLimit which will anyway stream the changes if
required.

Actually, ReorderBufferCheckMemoryLimit is only meant for checking
whether we need to stream the changes due to the memory limit. But
suppose when memory limit exceeds that time we could not stream the
transaction because there was only incomplete toast insert so we
serialized. Now, when we get the tuple which makes the changes
complete but now it is not crossing the memory limit as changes were
already serialized. So I am not sure whether it is a good idea to
stream the transaction as soon as we get the complete changes or we
shall wait till next time memory limit exceed and that time we select
the suitable candidate. Ideally, we were are in streaming more and
the transaction is serialized means it was already a candidate for
streaming but could not stream due to the incomplete changes so
shouldn't we stream it immediately as soon as its changes are complete
even though now we are in memory limit. Because our target is to
stream not spill so we should try to stream the spilled changes on the
first opportunity.

Can we move the changes related to the detection of

incomplete data to a separate function?

Ok.

Another comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple:

+ else if (rbtxn_has_toast_insert(txn) &&
+ ChangeIsInsertOrUpdate(change->action))
+ {
+ toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+ can_stream = true;
+ }
..
+#define ChangeIsInsertOrUpdate(action) \
+ (((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+ ((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+ ((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT))

How can we clear the RBTXN_HAS_TOAST_INSERT flag on
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT action?

Partial toast insert means we have inserted in the toast but not in
the main table. So even if it is spec insert we can form the complete
tuple, however, we can still not stream it because we haven't got
spec_confirm but for that, we are marking another flag. So if the
insert is aspect insert the toast insert will also be spec insert and
as part of that toast, spec inserts we are marking partial tuple so
cleaning that flag should happen when the spec insert is done for the
main table right?

IIUC, the basic idea used to handle incomplete changes (which is
possible in case of toast tuples and speculative inserts) is to mark
such TXNs as containing incomplete changes and then while finding the
largest top-level TXN for streaming, we ignore such TXN's and move to
next largest TXN. If none of the TXNs have complete changes then we
choose the largest (sub)transaction and spill the same to make the
in-memory changes below logical_decoding_work_mem threshold. This
idea can work but the strategy to choose the transaction is suboptimal
for cases where TXNs have some changes which are complete followed by
an incomplete toast or speculative tuple. I was having an offlist
discussion with Robert on this problem and he suggested that it would
be better if we track the complete part of changes separately and then
we can avoid the drawback mentioned above. I have thought about this
and I think it can work if we track the size and LSN of completed
changes. I think we need to ensure that if there is concurrent abort
then we discard all changes for current (sub)transaction not only up
to completed changes LSN whereas if the streaming is successful then
we can truncate the changes only up to completed changes LSN. What do
you think?

I wonder why you have done this as 0010 in the patch series, it should
be as 0006 after the
0005-Implement-streaming-mode-in-ReorderBuffer.patch. If we can do
that way then it would be easier for me to review. Is there a reason
for not doing so?

No reason, I can do that. Actually, later we can merge the changes to
0005 only, I kept separate for review. Anyway, in the next version, I
will make it as 0006.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#312Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#311)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, May 19, 2020 at 3:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 19, 2020 at 2:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

3.
+ /*
+ * If streaming is enable and we have serialized this transaction because
+ * it had incomplete tuple.  So if now we have got the complete tuple we
+ * can stream it.
+ */
+ if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
+ && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
+ {

This comment is just saying what you are doing in the if-check. I
think you need to explain the rationale behind it. I don't like the
variable name 'can_stream' because it matches ReorderBufferCanStream
whereas it is for a different purpose, how about naming it as
'change_complete' or something like that. The check has many
conditions, can we move it to a separate function to make the code
here look clean?

Do we really need this? Immediately after this check, we are calling
ReorderBufferCheckMemoryLimit which will anyway stream the changes if
required.

Actually, ReorderBufferCheckMemoryLimit is only meant for checking
whether we need to stream the changes due to the memory limit. But
suppose when memory limit exceeds that time we could not stream the
transaction because there was only incomplete toast insert so we
serialized. Now, when we get the tuple which makes the changes
complete but now it is not crossing the memory limit as changes were
already serialized. So I am not sure whether it is a good idea to
stream the transaction as soon as we get the complete changes or we
shall wait till next time memory limit exceed and that time we select
the suitable candidate.

I think it is better to wait till next time we exceed the memory threshold.

Ideally, we were are in streaming more and
the transaction is serialized means it was already a candidate for
streaming but could not stream due to the incomplete changes so
shouldn't we stream it immediately as soon as its changes are complete
even though now we are in memory limit.

The only time we need to stream or spill is when we exceed memory
threshold. In the above case, it is possible that next time there is
some other candidate transaction that we can stream.

Another comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple:

+ else if (rbtxn_has_toast_insert(txn) &&
+ ChangeIsInsertOrUpdate(change->action))
+ {
+ toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+ can_stream = true;
+ }
..
+#define ChangeIsInsertOrUpdate(action) \
+ (((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+ ((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+ ((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT))

How can we clear the RBTXN_HAS_TOAST_INSERT flag on
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT action?

Partial toast insert means we have inserted in the toast but not in
the main table. So even if it is spec insert we can form the complete
tuple, however, we can still not stream it because we haven't got
spec_confirm but for that, we are marking another flag. So if the
insert is aspect insert the toast insert will also be spec insert and
as part of that toast, spec inserts we are marking partial tuple so
cleaning that flag should happen when the spec insert is done for the
main table right?

Sounds reasonable.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#313Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#301)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, May 15, 2020 at 2:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

4.
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ Assert(!ctx->fast_forward);
+
+ /* We're only supposed to call this when streaming is supported. */
+ Assert(ctx->streaming);
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "stream_start";
+ /* state.report_location = apply_lsn; */

Why can't we supply the report_location here? I think here we need to
report txn->first_lsn if this is the very first stream and
txn->final_lsn if it is any consecutive one.

Done

Now after your change in stream_start_cb_wrapper, we assign
report_location as first_lsn passed as input to function but
write_location is still txn->first_lsn. Shouldn't we assing passed in
first_lsn to write_location? It seems assigning txn->first_lsn won't
be correct for streams other than first-one.

5.
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ Assert(!ctx->fast_forward);
+
+ /* We're only supposed to call this when streaming is supported. */
+ Assert(ctx->streaming);
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "stream_stop";
+ /* state.report_location = apply_lsn; */

Can't we report txn->final_lsn here

We are already setting this to the txn->final_ls in 0006 patch, but I
have moved it into this patch now.

Similar to previous point, here also, I think we need to assign report
and write location as last_lsn passed to this API.

v20-0005-Implement-streaming-mode-in-ReorderBuffer
-----------------------------------------------------------------------------
10.
Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

I don't think this part of the commit message is correct as we
sometimes need to spill even during streaming. Please check the
entire commit message and update according to the latest
implementation.

Done

You seem to forgot about removing the other part of message ("This
adds a second iterator for the streaming case...." which is not
relavant now.

11.
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
*/
static void
ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
*rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;

- if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
- return;
-

I don't understand this change. Why would "INSERT followed by
TRUNCATE" could lead to a tuple which can come for decode before its
CID? The patch has made changes based on this assumption in
HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the
behavior could be dependent on whether we are streaming the changes
for in-progress xact or at the commit of a transaction. We might want
to generate a test to once validate this behavior.

Also, the comment refers to tqual.c which is wrong as this API is now
in heapam_visibility.c.

Done.

+ * INSERT.  So in such cases we assume the CIDs is from the future command
+ * and return as unresolve.
+ */
+ if (tuplecid_data == NULL)
+ return false;
+

Here lets reword the last line of comment as ". So in such cases we
assume the CID is from the future command."

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#314Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#302)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, May 13, 2020 at 4:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

3.
And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error that we can ignore.  We
+ * might have already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the abort we will
+ * stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)

In the above comment, I don't think it is right to say that we ignore
the error raised due to the aborted transaction. We need to say that
we discard the already streamed changes on such an error.

Done.

In the same comment, there is typo (/messageto/message to).

4.
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
/*
- * If this transaction has no snapshot, it didn't make any changes to the
- * database, so there's nothing to decode.  Note that
- * ReorderBufferCommitChild will have transferred any snapshots from
- * subtransactions if there were any.
+ * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+ * aborted. That will happen during catalog access.  Also reset the
+ * sysbegin_called flag.
*/
- if (txn->base_snapshot == NULL)
+ if (!TransactionIdDidCommit(xid))
{
- Assert(txn->ninvalidations == 0);
- ReorderBufferCleanupTXN(rb, txn);
- return;
+ CheckXidAlive = xid;
+ bsysscan = false;
}

I think this function is inline as it needs to be called for each
change. If that is the case and otherwise also, isn't it better that
we check if passed xid is the same as CheckXidAlive before checking
TransactionIdDidCommit as TransactionIdDidCommit can be costly and
calling it for each change might not be a good idea?

Done, Also I think it is good the check the TransactionIdIsInProgress
instead of !TransactionIdDidCommit. I have changed that as well.

What if it is aborted just before this check? I think the decode API
won't be able to detect that and sys* API won't care to check because
CheckXidAlive won't be set for that case.

5.
setup CheckXidAlive if it's not committed yet. We don't check if the xid
+ * aborted. That will happen during catalog access.  Also reset the
+ * sysbegin_called flag.

/if the xid aborted/if the xid is aborted. missing comma after Also.

Done

You forgot to change as per the second part of the comment (missing
comma after Also).

8.
@@ -1588,8 +1766,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
* use as a normal record. It'll be cleaned up at the end
* of INSERT processing.
*/
- if (specinsert == NULL)
- elog(ERROR, "invalid ordering of speculative insertion changes");

You have removed this check but all other handling of specinsert is
same as far as this patch is concerned. Why so?

Seems like a merge issue, or the leftover from the old design of the
toast handling where we were streaming with the partial tuple.
fixed now.

9.
@@ -1676,8 +1860,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
* freed/reused while restoring spooled data from
* disk.
*/
- Assert(change->data.tp.newtuple != NULL);
-
dlist_delete(&change->node);

Why is this Assert removed?

Same cause as above so fixed.

10.
@@ -1753,7 +1935,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
relations[nrelations++] = relation;
}

- rb->apply_truncate(rb, txn, nrelations, relations, change);
+ if (streaming)
+ {
+ rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+ /* Remember that we have sent some data. */
+ change->txn->any_data_sent = true;
+ }
+ else
+ rb->apply_truncate(rb, txn, nrelations, relations, change);

Can we encapsulate this in a separate function like
ReorderBufferApplyTruncate or something like that? Basically, rather
than having streaming check in this function, lets do it in some other
internal function. And we can likewise do it for all the streaming
checks in this function or at least whereever it is feasible. That
will make this function look clean.

Done for truncate and change. I think we can create a few more such
functions for
start/stop and cleanup handling on error. I will work on that.

Yeah, I think that would be better.

One minor comment change suggestion:
/*
+ * start stream or begin the transaction.  If this is the first
+ * change in the current stream.
+ */

We can write the above comment as "Start the stream or begin the
transaction for the first change in the current stream."

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#315Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#314)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, May 19, 2020 at 6:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have further reviewed v22 and below are my comments:

v22-0005-Implement-streaming-mode-in-ReorderBuffer
--------------------------------------------------------------------------
1.
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)

The above 'Note' is not correct as per the latest implementation.

v22-0006-Add-support-for-streaming-to-built-in-replicatio
----------------------------------------------------------------------------
2.
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,7 +14,6 @@
  *
  *-------------------------------------------------------------------------
  */
-
 #include "postgres.h"

Spurious line removal.

3.
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+    XLogRecPtr commit_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'c'); /* action STREAM COMMIT */
+
+ Assert(TransactionIdIsValid(txn->xid));
+
+ /* transaction ID (we're starting to stream, so must be valid) */
+ pq_sendint32(out, txn->xid);

The part of the comment "we're starting to stream, so must be valid"
is not correct as we are not at the start of the stream here. The
patch has used the same incorrect sentence at few places, kindly fix
those as well.

4.
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
{
..

For this and other places in a patch like in function
stream_open_file(), instead of using TopMemoryContext, can we consider
using a new memory context LogicalStreamingContext or something like
that. We can create LogicalStreamingContext under TopMemoryContext. I
don't see any need of using TopMemoryContext here.

5.
+static void
+subxact_info_add(TransactionId xid)

This function has assumed a valid value for global variables like
stream_fd and stream_xid. I think it is better to have Assert for
those in this function before using them. The Assert for those are
present in handle_streamed_transaction but I feel they should be in
subxact_info_add.

6.
+subxact_info_add(TransactionId xid)
/*
+ * In most cases we're checking the same subxact as we've already seen in
+ * the last call, so make ure just ignore it (this change comes later).
+ */
+ if (subxact_last == xid)
+ return;

Typo and minor correction, /ure just/sure to

7.
+subxact_info_write(Oid subid, TransactionId xid)
{
..
+ /*
+ * But we free the memory allocated for subxact info. There might be one
+ * exceptional transaction with many subxacts, and we don't want to keep
+ * the memory allocated forewer.
+ *
+ */

a. Typo, /forewer/forever
b. The extra line at the end of the comment is not required.

8.
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)

Do we really need to have the checksum for temporary files? I have
checked a few other similar cases like SharedFileSet stuff for
parallel hash join but didn't find them using checksums. Can you also
once see other usages of temporary files and then let us decide if we
see any reason to have checksums for this?

Another point is we don't seem to be doing this for 'changes' file,
see stream_write_change. So, not sure, there is any sense to write
checksum for subxact file.

Tomas, do you see any reason for the same?

9.
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+ char tempdirpath[MAXPGPATH];
+
+ TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+ /*
+ * We might need to create the tablespace's tempfile directory, if no
+ * one has yet done so.
+ */
+ if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m",
+ tempdirpath)));
+
+ snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+ tempdirpath, subid, xid);
+}

Temporary files created in PGDATA/base/pgsql_tmp follow a certain
naming convention (see docs[1]https://www.postgresql.org/docs/devel/storage-file-layout.html) which is not followed here. You can
also refer SharedFileSetPath and OpenTemporaryFile. I think we can
just try to follow that convention and then additionally append subid,
xid and .subxacts. Also, a similar change is required for
changes_filename. I would like to know if there is a reason why we
want to use different naming convention here?

10.
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)

The comment seems to be wrong. I think this can be only called at
stream end, so it should be "This can only be called at the end of a
"streaming" block, i.e. at stream_stop message from the upstream."

11.
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
  Oid relid; /* relation oid */
-
+ TransactionId xid; /* transaction that created the record */
  /*
  * Did we send the schema?  If ancestor relid is set, its schema must also
  * have been sent for this to be true.
  */
  bool schema_sent;
+ List    *streamed_txns; /* streamed toplevel transactions with this
+ * schema */

The part of comment "So streamed trasactions are handled separately by
using schema_sent flag in ReorderBufferTXN." doesn't seem to match
with what we are doing in the latest version of the patch.

12.
maybe_send_schema()
{
..
+ if (in_streaming)
+ {
+ /*
+ * TOCHECK: We have to send schema after each catalog change and it may
+ * occur when streaming already started, so we have to track new catalog
+ * changes somehow.
+ */
+ schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
..
..
}

I think it is good to once verify/test what this comment says but as
per code we should be sending the schema after each catalog change as
we invalidate the streamed_txns list in rel_sync_cache_relation_cb
which must be called during relcache invalidation. Do we see any
problem with that mechanism?

13.
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+    ReorderBufferTXN *txn,
+    XLogRecPtr commit_lsn)

This comment is copied from pgoutput_stream_abort, so doesn't match
what this function is doing.

[1]: https://www.postgresql.org/docs/devel/storage-file-layout.html

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#316Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#315)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

v22-0006-Add-support-for-streaming-to-built-in-replicatio
----------------------------------------------------------------------------

Few more comments on v22-0006 patch:

1.
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+ int i;
+ char path[MAXPGPATH];
+ bool found = false;
+
+ subxact_filename(path, subid, xid);
+
+ if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));

Here, we have unlinked the files containing information of subxacts
but don't we need to free the corresponding memory (memory for
subxacts) as well?

2.
apply_handle_stream_abort()
{
..
+ subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+
+ return;
..
}

Like the previous comment, it seems here also we need to free subxacts
memory and additionally we forgot to adjust the xids array as well.

3.
apply_handle_stream_abort()
{
..
+ /* XXX optimize the search by bsearch on sorted data */
+ for (i = nsubxacts; i > 0; i--)
+ {
+ if (subxacts[i - 1].xid == subxid)
+ {
+ subidx = (i - 1);
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ return;
..
}

Is it possible that we didn't find the xid in subxacts array? If so,
I think we should mention the same in comments, otherwise, we should
have an assert for found.

4.
apply_handle_stream_abort()
{
..
+ changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+ if (truncate(path, subxacts[subidx].offset))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not truncate file \"%s\": %m", path)));
..
}

Will truncate works on Windows? I see in the code we ftruncate which
is defined as chsize in win32.h and win32_port.h. I have not tested
this so I am not very sure about this. I got a below warning when I
tried to compile this code on Windows. I think it is better to
ftruncate as it is used at other places in the code as well.

worker.c(798): warning C4013: 'truncate' undefined; assuming extern
returning int

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#317Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#309)
12 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Few comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple
1.
+ /*
+ * If this is a toast insert then set the corresponding bit.  Otherwise, if
+ * we have toast insert bit set and this is insert/update then clear the
+ * bit.
+ */
+ if (toast_insert)
+ toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+ else if (rbtxn_has_toast_insert(txn) &&
+ ChangeIsInsertOrUpdate(change->action))
+ {

Here, it might better to add a comment on why we expect only
Insert/Update? Also, it might be better that we add an assert for
other operations.

I have added comments that why on Insert/Update we clean the flag.
But I don't think we only expect insert/update, we might get the
toast delete right? because in toast update we will do toast delete +
toast insert. So when we get toast delete we just don't want to do
anything.

2.
@@ -1865,8 +1920,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
* disk.
*/
dlist_delete(&change->node);
- ReorderBufferToastAppendChunk(rb, txn, relation,
-   change);
+ ReorderBufferToastAppendChunk(rb, txn, relation,
+   change);
}

This seems to be a spurious change.

Done

3.
+ /*
+ * If streaming is enable and we have serialized this transaction because
+ * it had incomplete tuple.  So if now we have got the complete tuple we
+ * can stream it.
+ */
+ if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
+ && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
+ {

This comment is just saying what you are doing in the if-check. I
think you need to explain the rationale behind it. I don't like the
variable name 'can_stream' because it matches ReorderBufferCanStream
whereas it is for a different purpose, how about naming it as
'change_complete' or something like that. The check has many
conditions, can we move it to a separate function to make the code
here look clean?

As per the other comments we have removed this part in the latest patch set.

Apart from these comments fixes, there are 2 more changes
1. Handling of the toast tuple is changed as per the offlist
discussion with you
Basically, now, instead of not streaming the txn with the incomplete
tuple, we are streaming it up to the last complete lsn. So of the txn
has incomplete changes but its complete size is largest then we will
stream this. And, after streaming we will truncate the transaction up
to the last complete lsn.

2. There is a bug fix in handling the stream abort in 0008 (earlier it
was 0006).

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v23-0001-Immediately-WAL-log-assignments.patchapplication/octet-stream; name=v23-0001-Immediately-WAL-log-assignments.patchDownload
From 63bed6b2ed7844dd78eb2934b572466cfc671284 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 20 Mar 2020 15:03:01 +0530
Subject: [PATCH v23 01/12] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 46 ++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 +++
 src/backend/replication/logical/decode.c | 39 ++++++++++----------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 99 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index cd30b62d36..3af8e81af1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5118,6 +5120,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6020,3 +6023,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f09e..53be2b3059 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5995798b58..560ec27fa0 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1195,6 +1195,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1233,6 +1234,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3abf8..122c581d0f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel xid is valid, we need to assign the subxact to the
+	 * toplevel xact. We need to do this for all records, hence we do it before
+	 * the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(XLogRecGetTopXid(r)))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04babc2..8645b3816c 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e917dfe92d..26426cc779 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index c21b0ba972..83170a663c 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -191,6 +191,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -308,6 +310,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0194..2f0c8bf589 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
2.23.0

v23-0005-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v23-0005-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From 05e7b82634ff93eb35a63144b4b7d51611dd83b3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 1 May 2020 19:56:35 +0530
Subject: [PATCH v23 05/12] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes
we have in memory and invoke new stream API methods. This happens
in ReorderBufferStreamTXN() using about the same logic as in
ReorderBufferCommit() logic.  However, sometime if we have incomplete
toast or speculative insert we spill to the disk because we can not
generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c   |  38 +-
 .../replication/logical/reorderbuffer.c       | 758 ++++++++++++++++--
 src/include/replication/reorderbuffer.h       |  31 +
 3 files changed, 750 insertions(+), 77 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aa..160b167adb 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b889edf461..2cdfb348af 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -236,6 +237,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +246,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +382,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -772,6 +786,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+/*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -865,6 +911,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -1024,6 +1073,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1038,6 +1090,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1315,6 +1370,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		dlist_delete(&txn->base_snapshot_node);
 	}
 
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
 	/*
 	 * Remove TXN from its containing list.
 	 *
@@ -1340,6 +1404,80 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferReturnTXN(rb, txn);
 }
 
+/*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
 /*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1491,57 +1629,171 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+	 * aborted. That will happen during catalog access.  Also, reset the
+	 * bsysscan flag.
+	 */
+	if (!TransactionIdDidCommit(xid))
+	{
+		CheckXidAlive = xid;
+		bsysscan = false;
 	}
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction.
+ */
+static void
+ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   Snapshot snapshot_now,
+								   CommandId command_id,
+								   XLogRecPtr last_lsn,
+								   ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+
+	ReorderBufferToastReset(rb, txn);
+	if (specinsert != NULL)
+		ReorderBufferReturnChange(rb, specinsert);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool	stream_started = false;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1564,21 +1816,46 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
-
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
 		{
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Start stream or begin transaction for the first change in the
+			 * current stream.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+				else
+					rb->begin(rb, txn);
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1655,7 +1932,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1695,7 +1973,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1753,7 +2031,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1762,10 +2043,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1796,7 +2074,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1858,14 +2135,32 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			rb->stream_stop(rb, txn, prev_lsn);
+			stream_started = false;
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1884,14 +2179,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,17 +2214,118 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If the error code is ERRCODE_TRANSACTION_ROLLBACK,  that means we
+		 * have detected a concurrent abort of the (sub)transaction we are
+		 * streaming.  So just do the cleanup and return gracefully.
+		 * Otherwise, Re-throw the error.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * Only in the streaming mode we can get this error, because only
+			 * in the streaming mode we send in-progress transaction.
+			 */
+			Assert(streaming);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/*
+			 * In the TRY block we only stop the stream after we have send
+			 * all the changes.  So we have detected the concurrent abort then
+			 * the stream should have not been stopped yet.
+			 */
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
 
-		PG_RE_THROW();
+			/* Handle the concurrent abort. */
+			ReorderBufferHandleConcurrentAbort(rb, txn, snapshot_now,
+											   command_id, prev_lsn,
+											   specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.  We iterate over the top and
+ * subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot snapshot_now;
+	CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
@@ -1946,6 +2350,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2015,6 +2426,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2150,8 +2568,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2159,6 +2586,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2170,19 +2598,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2211,6 +2648,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2295,6 +2733,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2398,6 +2843,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
@@ -2418,15 +2895,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2746,6 +3254,102 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot snapshot_now;
+	CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -3864,6 +4468,16 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases we assume the CID is from the future command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 65814af9f5..b3e2b3f64b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -171,6 +171,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -190,6 +191,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -224,6 +243,11 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	final_lsn;
 
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
@@ -254,6 +278,13 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
+	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
 	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
-- 
2.23.0

v23-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchapplication/octet-stream; name=v23-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchDownload
From 351db3d1ffdaff49157f2c4897c86c1183abd483 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 10:55:19 +0530
Subject: [PATCH v23 04/12] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml  |  9 +++--
 src/backend/access/heap/heapam.c   | 10 ++++++
 src/backend/access/index/genam.c   | 53 ++++++++++++++++++++++++++++
 src/backend/access/table/tableam.c |  8 +++++
 src/backend/utils/time/snapmgr.c   | 13 +++++++
 src/include/access/tableam.h       | 55 ++++++++++++++++++++++++++++++
 src/include/utils/snapmgr.h        |  2 ++
 7 files changed, 147 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 1b56daa4bb..5f7394f3c1 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,9 +432,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 94eb37d48d..2d77107c4f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments at snapmgr.c
+	 * where these variables are declared.  Normally we have such a check at
+	 * tableam level API but this is called from many places so we need to
+	 * ensure it here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae39a..446b8cbc86 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,9 +430,36 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments at snapmgr.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
+/*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments at snapmgr.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
 /*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments at snapmgr.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index c814733b22..2f52b407c6 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -230,6 +230,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	Relation	rel = scan->rs_rd;
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
+	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
 	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..9f1ecd123f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -153,6 +153,19 @@ static Snapshot SecondarySnapshot = NULL;
 static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool bsysscan = false;
+
 /*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8c34935c34..9d890d3c4b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
+#include "utils/snapmgr.h"
 #include "utils/snapshot.h"
 
 
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1015,6 +1025,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1054,6 +1071,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1710,6 +1735,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1727,6 +1760,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1745,6 +1786,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1761,6 +1809,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce84..5af6df698b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,6 +145,8 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
+extern bool	bsysscan;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
 extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
 extern void TeardownHistoricSnapshot(bool is_error);
-- 
2.23.0

v23-0002-Issue-individual-invalidations-with-wal_level-lo.patchapplication/octet-stream; name=v23-0002-Issue-individual-invalidations-with-wal_level-lo.patchDownload
From 44afef284200ccb48460ff7d819930858658a466 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Tue, 14 Apr 2020 11:11:37 +0530
Subject: [PATCH v23 02/12] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.
---
 src/backend/access/rmgrdesc/xactdesc.c        |  40 +++++++
 src/backend/access/transam/xact.c             |   7 ++
 src/backend/replication/logical/decode.c      |  16 +++
 .../replication/logical/reorderbuffer.c       | 104 +++++++++++++++---
 src/backend/utils/cache/inval.c               |  49 +++++++++
 src/include/access/xact.h                     |  13 ++-
 src/include/replication/reorderbuffer.h       |  11 ++
 7 files changed, 226 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce75565f..7ab0d11ea9 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3af8e81af1..e576b10055 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6020,6 +6020,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 122c581d0f..69c1f45ef6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,22 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				Assert(TransactionIdIsValid(xid));
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->nmsgs, invals->msgs);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4594cf9509..b889edf461 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -220,7 +220,7 @@ static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
 static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
-static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
 
 /*
  * ---------------------------------------
@@ -455,6 +455,11 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 				pfree(change->data.msg.message);
 			change->data.msg.message = NULL;
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			if (change->data.inval.invalidations)
+				pfree(change->data.inval.invalidations);
+			change->data.inval.invalidations = NULL;
+			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			if (change->data.snapshot)
 			{
@@ -1814,17 +1819,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation messages locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					ReorderBufferExecuteInvalidations(
+							change->data.inval.ninvalidations,
+							change->data.inval.invalidations);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -1866,7 +1878,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1892,7 +1905,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2202,6 +2216,33 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	txn->ntuplecids++;
 }
 
+/*
+ * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn, int nmsgs,
+							 SharedInvalidationMessage *msgs)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.ninvalidations = nmsgs;
+	change->data.inval.invalidations = (SharedInvalidationMessage *)
+		MemoryContextAlloc(rb->context,
+						   sizeof(SharedInvalidationMessage) * nmsgs);
+	memcpy(change->data.inval.invalidations, msgs,
+		   sizeof(SharedInvalidationMessage) * nmsgs);
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Setup the invalidation of the toplevel transaction.
  *
@@ -2234,12 +2275,12 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  * in the changestream but we don't know which those are.
  */
 static void
-ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
 {
 	int			i;
 
-	for (i = 0; i < txn->ninvalidations; i++)
-		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	for (i = 0; i < nmsgs; i++)
+		LocalExecuteInvalidationMessage(&msgs[i]);
 }
 
 /*
@@ -2589,6 +2630,24 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				char	   *data;
+				Size		inval_size = sizeof(SharedInvalidationMessage) *
+										change->data.inval.ninvalidations;
+
+				sz += inval_size;
+
+				ReorderBufferSerializeReserve(rb, sz);
+				data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+
+				/* might have been reallocated above */
+				ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+				memcpy(data, change->data.inval.invalidations, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -2736,6 +2795,11 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			sz += sizeof(SharedInvalidationMessage) *
+					change->data.inval.ninvalidations;
+			break;
+
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -3002,6 +3066,20 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				Size	inval_size = sizeof(SharedInvalidationMessage) *
+									change->data.inval.ninvalidations;
+
+				change->data.inval.invalidations =
+						MemoryContextAlloc(rb->context, inval_size);
+
+				/* read the message */
+				memcpy(change->data.inval.invalidations, data, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33be6..cba5b6c64b 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transaction.  As of now it was
+ *	enough to log invalidation only at commit because we are only decoding the
+ *	transaction at the commit time.   We only need to log the catalog cache and
+ *	relcache invalidation.  There can not be any active MVCC scan in logical
+ *	decoding so we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
 	if (transInvalInfo == NULL)
 		return;
 
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
@@ -1501,3 +1513,40 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage	*invalMessages;
+	int	nmsgs = 0;
+
+	if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+	{
+		ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+								 MakeSharedInvalidMessagesArray);
+		invalMessages = SharedInvalidMessagesArray;
+		nmsgs  = numSharedInvalidMessagesArray;
+		SharedInvalidMessagesArray = NULL;
+		numSharedInvalidMessagesArray = 0;
+	}
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b3816c..b822c5e4b2 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -197,6 +197,17 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+/*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4dc9..af35287896 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,14 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations;		/* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation
+														 * message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +468,8 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  int nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
2.23.0

v23-0003-Extend-the-output-plugin-API-with-stream-methods.patchapplication/octet-stream; name=v23-0003-Extend-the-output-plugin-API-with-stream-methods.patchDownload
From f026ef1bafb9aa91ebfac980e37c43b83d26c562 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v23 03/12] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 ++++++
 doc/src/sgml/logicaldecoding.sgml         | 213 +++++++++++++
 src/backend/replication/logical/logical.c | 365 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++
 src/include/replication/reorderbuffer.h   |  59 ++++
 6 files changed, 811 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c948856e..64f651fa72 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index bad3bfe620..1b56daa4bb 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index dc69e5ce5f..0cff1ac393 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similar to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,321 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475e5d..deef31825d 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -79,6 +79,11 @@ typedef struct LogicalDecodingContext
 	 */
 	void	   *output_writer_private;
 
+	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236c57..0d0a94a648 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
 /*
  * Output plugin callbacks
  */
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index af35287896..65814af9f5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -354,6 +354,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
 struct ReorderBuffer
 {
 	/*
@@ -392,6 +440,17 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
 	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
-- 
2.23.0

v23-0009-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v23-0009-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From e162f78c1c2b678a0de9684aedcd5262aa6123b1 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v23 09/12] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 6da7f71ca3..086d0c7f02 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -70,7 +70,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2fbc..94c71f8ae2 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f871a..4ba80869b9 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9181..a6fae9c3f1 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17a78..202871a658 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10a19..70c86b22ac 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6d63..f9c8d1d348 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334ed89..cdf9b8e7bb 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bdba9..21f50c7012 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133f69..30561d8f96 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae38592b..9a6bac6822 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..ed56fbf96c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cba4c..4df1ddef63 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1a8a..c3caff6149 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7c2f..c62eb521e7 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30f59..2be7542831 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0578..2da9607a7d 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9435..96ffc091b0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
2.23.0

v23-0008-Add-support-for-streaming-to-built-in-replicatio.patchapplication/octet-stream; name=v23-0008-Add-support-for-streaming-to-built-in-replicatio.patchDownload
From 2278bf2ff100d75f7853a740d068cf71ec9c93c3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 14 May 2020 21:27:46 +0530
Subject: [PATCH v23 08/12] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml      |    4 +-
 doc/src/sgml/ref/create_subscription.sgml     |   12 +
 src/backend/catalog/pg_subscription.c         |    1 +
 src/backend/commands/subscriptioncmds.c       |   45 +-
 src/backend/postmaster/pgstat.c               |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |    3 +
 src/backend/replication/logical/launcher.c    |    1 -
 src/backend/replication/logical/proto.c       |  140 ++-
 src/backend/replication/logical/worker.c      | 1043 +++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c   |  318 ++++-
 src/backend/replication/slotfuncs.c           |    6 +
 src/backend/replication/walsender.c           |    6 +
 src/include/catalog/pg_subscription.h         |    3 +
 src/include/pgstat.h                          |    6 +-
 src/include/replication/logicalproto.h        |   42 +-
 src/include/replication/walreceiver.h         |    1 +
 src/test/subscription/t/009_stream_simple.pl  |   86 ++
 src/test/subscription/t/010_stream_subxact.pl |  102 ++
 src/test/subscription/t/011_stream_ddl.pl     |   95 ++
 .../t/012_stream_subxact_abort.pl             |   82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |   84 ++
 21 files changed, 2051 insertions(+), 41 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 1b8beadbaa..95b7c24ef9 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -164,8 +164,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c244fb..3349cc4bfc 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731115..f28482f0f4 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9ebb026187..9065a1be1b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e246be388b..90182a0181 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4139,6 +4139,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9bb6..5257ab0394 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e987..8156a42ace 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,7 +14,6 @@
  *
  *-------------------------------------------------------------------------
  */
-
 #include "postgres.h"
 
 #include "access/heapam.h"
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd171..5242ac0efe 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a1224d..68d08631be 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,6 +54,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -64,6 +86,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +94,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -110,12 +134,57 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool	in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;						/* XID of the subxact */
+	off_t           offset;					/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +256,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -552,6 +657,329 @@ apply_handle_origin(StringInfo s)
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		char		path[MAXPGPATH];
+
+		/*
+		 * XXX Maybe this should be an error instead? Can we receive abort for
+		 * a toplevel transaction we haven't received?
+		 */
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (unlink(path) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m", path)));
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+		if (!found)
+		{
+			/* Cleanup the subxact related infor. */
+			if (subxacts)
+				pfree(subxacts);
+
+			subxacts = NULL;
+			subxact_last = InvalidTransactionId;
+			nsubxacts = 0;
+			nsubxacts_max = 0;
+
+			return;
+		}
+
+		/* OK, truncate the file at the right offset. */
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+		if (truncate(path, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	ensure_transaction();
+
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -565,6 +993,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1011,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1050,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1168,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1313,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1686,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1827,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1477,6 +1939,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 	}
 }
 
+/*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
 /*
  * Apply main loop.
  */
@@ -1493,6 +1971,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1941,6 +2422,567 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forewer.
+	 *
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make ure just ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+			 tempdirpath, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/logical-%u-%u.changes",
+			 tempdirpath, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3148,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 15379e3118..a94b4a0136 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,54 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +119,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +145,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +208,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +237,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +260,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +281,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +370,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+	{
+		/*
+		 * TOCHECK: We have to send schema after each catalog change and it may
+		 * occur when streaming already started, so we have to track new catalog
+		 * changes somehow.
+		 */
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	}
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +427,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +469,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +490,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +522,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +542,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +566,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +586,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +611,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +643,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -587,6 +723,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -623,6 +844,34 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -753,11 +1002,44 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -793,7 +1075,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 6fed3cfd23..e1344ab4cc 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -157,6 +157,12 @@ create_logical_replication_slot(char *name, char *plugin,
 											   .segment_close = wal_segment_close),
 									NULL, NULL, NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
 	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b84145ad85..5d23691930 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1020,6 +1020,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndPrepareWrite, WalSndWriteData,
 										WalSndUpdateProgress);
 
+		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
 		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d42d8..617b9094d4 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ae9a39573c..70826c1cef 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -980,7 +980,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561be9..89158ed46f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index ac1acbb27e..95132062c6 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v23-0007-Track-statistics-for-streaming.patchapplication/octet-stream; name=v23-0007-Track-statistics-for-streaming.patchDownload
From 0c34d0123927164dfde5b272af939115e687f3d9 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Tue, 19 May 2020 19:08:16 +0530
Subject: [PATCH v23 07/12] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                  | 33 +++++++++++++++++++
 src/backend/catalog/system_views.sql          |  5 ++-
 .../replication/logical/reorderbuffer.c       | 12 +++++++
 src/backend/replication/walsender.c           | 32 +++++++++++++++---
 src/include/catalog/pg_proc.dat               |  6 ++--
 src/include/replication/reorderbuffer.h       | 13 +++++---
 src/include/replication/walsender_private.h   |  5 +++
 src/test/regress/expected/rules.out           |  7 ++--
 8 files changed, 98 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 87502a49b6..5b64410e7d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2404,6 +2404,39 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        Amount of decoded transaction data spilled to disk.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_txns</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of in-progress transactions streamed to subscriber after
+       memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+       Streaming only works with toplevel transactions (subtransactions can't
+       be streamed independently), so the counter does not get incremented for
+       subtransactions.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_count</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times in-progress transactions were streamed to subscriber.
+       Transactions may get streamed repeatedly, and this counter gets incremented
+       on every such invocation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_bytes</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Amount of decoded in-progress transaction data streamed to subscriber.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2bd5f5ea14..8f34ce8deb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -788,7 +788,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index fe2d0011c4..5c211d0c70 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -349,6 +349,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3475,6 +3479,14 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		ReorderBufferFreeSnap(rb, txn->snapshot_now);
 	}
 
+	/* Update the stream statistics. */
+	rb->streamCount += 1;
+	rb->streamBytes += (rbtxn_has_incomplete_tuple(txn)) ?
+						txn->complete_size : txn->total_size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	/*
 	 * Access the main routine to decode the changes and send to output plugin.
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index a4ca8daea7..b84145ad85 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1353,7 +1353,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1374,7 +1374,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2423,6 +2424,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3258,7 +3262,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3316,6 +3320,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillCount;
 		int64		spillBytes;
 		bool		is_sync_standby;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 		int			j;
@@ -3341,6 +3348,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		/*
@@ -3443,6 +3453,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3691,11 +3706,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9edae40ed8..5a8826cc67 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5237,9 +5237,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index a9b1aacdb1..1ced4caaae 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -541,15 +541,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 734acec2a4..b997d1710e 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -83,6 +83,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 8876025aaa..0c4952a1fa 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2005,9 +2005,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_slru| SELECT s.name,
     s.blks_zeroed,
-- 
2.23.0

v23-0006-Bugfix-handling-of-incomplete-toast-spec-insert-.patchapplication/octet-stream; name=v23-0006-Bugfix-handling-of-incomplete-toast-spec-insert-.patchDownload
From 8246bfc47e26ceceac35eba7bcad4eba79f09ad7 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Tue, 19 May 2020 18:55:23 +0530
Subject: [PATCH v23 06/12] Bugfix handling of incomplete toast/spec insert
 tuple

---
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/replication/logical/decode.c      |  17 +-
 .../replication/logical/reorderbuffer.c       | 324 ++++++++++++------
 src/include/access/heapam_xlog.h              |   1 +
 src/include/replication/reorderbuffer.h       |  39 ++-
 5 files changed, 277 insertions(+), 107 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2d77107c4f..3927448f46 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1955,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 69c1f45ef6..c841687c66 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -727,7 +727,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -794,7 +796,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -851,7 +854,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -887,7 +891,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -987,7 +991,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1025,7 +1029,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2cdfb348af..fe2d0011c4 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -179,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -254,6 +269,8 @@ static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
 static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static inline void ReorderBufferTXNDeleteChange(ReorderBufferTXN *txn,
+												ReorderBufferChange *change);
 
 /* ---------------------------------------
  * toast reassembly support
@@ -646,12 +663,71 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 	return txn;
 }
 
+/*
+ * Handle incomplete tuple during streaming.  If streaming is enabled then we
+ * might need to stream the in-progress transaction.  So the problem is that
+ * sometime we might get some incomplete changes which we can not stream
+ * until we get the complete change. e.g.  toast table insert without the main
+ * table insert.  So this function remember the lsn of the last complete change
+ * and the complete size upto last complete lsn so that if we need to stream
+ * we can only stream upto last complete lsn.
+ */
+static void
+ReorderBufferHandleIncompleteTuple(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   ReorderBufferChange *change,
+								   bool toast_insert)
+{
+	/* If streaming is not enable then nothing to do. */
+	if (!ReorderBufferCanStream(rb))
+		return;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		txn = txn->toptxn;
+
+	/*
+	 * If this is a first incomplete change then set the size of the complete
+	 * change.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(txn)) &&
+		(toast_insert || IsSpecInsert(change->action)))
+		txn->complete_size = txn->total_size;
+
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Basically,
+	 * both update and insert will do the insert in the toast table.  And as
+	 * explained in the function header we can not stream the only toast
+	 * changes.  So whenever we get the toast insert we set the flag and clear
+	 * the same whenever we get the next insert or update on the main table.
+	 */
+	if (toast_insert)
+		txn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(txn) && IsInsertOrUpdate(change->action))
+		txn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial tuple and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		txn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+		txn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+
+	/*
+	 * If we don't have any incomplete change after this change then set this
+	 * LSN as last complete lsn.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(txn)))
+		txn->last_complete_lsn = change->lsn;
+}
+
 /*
  * Queue a change into a transaction so it can be replayed upon commit.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
@@ -660,6 +736,9 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	change->lsn = lsn;
 	change->txn = txn;
 
+	/* Handle the incomplete tuple if it's a toast/spec insert */
+	ReorderBufferHandleIncompleteTuple(rb, txn, change, toast_insert);
+
 	Assert(InvalidXLogRecPtr != lsn);
 	dlist_push_tail(&txn->changes, &change->node);
 	txn->nentries++;
@@ -697,7 +776,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1412,6 +1491,30 @@ static void
 ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
 	dlist_mutable_iter iter;
+	ReorderBufferTXN *toptxn;
+
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
 
 	/* cleanup subtransactions & their changes */
 	dlist_foreach_modify(iter, &txn->subtxns)
@@ -1438,30 +1541,28 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		change = dlist_container(ReorderBufferChange, node, iter.cur);
 
-		/* remove the change from it's containing list */
-		dlist_delete(&change->node);
+		/* We have truncated upto last complete lsn so stop. */
+		if (rbtxn_has_incomplete_tuple(toptxn) &&
+			(change->lsn > toptxn->last_complete_lsn))
+		{
+			/*
+			 * If this is a top transaction then we can reset the
+			 * last_complete_lsn and complete_size, because by now we would
+			 * have stream all the changes upto last_complete_lsn.
+			 */
+			if (txn->toptxn == NULL)
+			{
+				toptxn->last_complete_lsn = InvalidXLogRecPtr;
+				toptxn->complete_size = 0;
+			}
+			break;
+		}
 
+		/* remove the change from it's containing list */
+		ReorderBufferTXNDeleteChange(txn, change);
 		ReorderBufferReturnChange(rb, change);
 	}
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The toplevel transaction, identified by (toptxn==NULL), is marked
-	 * as streamed always, even if it does not contain any changes (that
-	 * is, when all the changes are in subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts
-	 * for XIDs the downstream is not aware of. And of course, it always
-	 * knows about the toplevel xact (we send the XID in all messages),
-	 * but we never stream XIDs of empty subxacts.
-	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
 	 * any memory. We could also keep the hash table and update it with
@@ -1473,9 +1574,15 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
-	/* also reset the number of entries in the transaction */
-	txn->nentries_mem = 0;
-	txn->nentries = 0;
+	/*
+	 * If this txn is serialized and there are no more entries in the disk then
+	 * clean the disk space.
+	 */
+	if (rbtxn_is_serialized(txn) && (txn->nentries == txn->nentries_mem))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
 }
 
 /*
@@ -1732,6 +1839,20 @@ ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					change->data.msg.message);
 }
 
+/*
+ * While streaming a transaction, due to incomplete tuple we can not always
+ * stream all the changes.  So whenever we are deleting any change from the
+ * change list we need to update the entries count.
+ */
+static inline void
+ReorderBufferTXNDeleteChange(ReorderBufferTXN *txn, ReorderBufferChange *change)
+{
+	/* Delete the node and decrement the nentries_mem and nentries count. */
+	dlist_delete(&change->node);
+	change->txn->nentries_mem--;
+	change->txn->nentries--;
+}
+
 /*
  * Function to store the command id and snapshot at the end of the current
  * stream so that we can reuse the same while sending the next stream.
@@ -1955,8 +2076,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						 * disk.
 						 */
 						Assert(change->data.tp.newtuple != NULL);
-
-						dlist_delete(&change->node);
+						ReorderBufferTXNDeleteChange(change->txn, change);
 						ReorderBufferToastAppendChunk(rb, txn, relation,
 													  change);
 					}
@@ -2002,8 +2122,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						specinsert = NULL;
 					}
 
-					/* and memorize the pending insertion */
-					dlist_delete(&change->node);
+					/*
+					 * Remove from the change list and memorize the pending
+					 * insertion
+					 */
+					ReorderBufferTXNDeleteChange(change->txn, change);
 					specinsert = change;
 					break;
 
@@ -2118,6 +2241,15 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
 			}
+
+			/*
+			 * If the transaction contains incomplete tuple and this is the
+			 * last complete change then stop further processing of the
+			 * transaction.
+			 */
+			if (rbtxn_has_incomplete_tuple(txn) &&
+				prev_lsn == txn->last_complete_lsn)
+				break;
 		}
 
 		/*
@@ -2515,7 +2647,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2564,7 +2696,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2587,6 +2719,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2601,8 +2734,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2610,12 +2748,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2676,7 +2822,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 	memcpy(change->data.inval.invalidations, msgs,
 		   sizeof(SharedInvalidationMessage) * nmsgs);
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -2860,18 +3006,28 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 	dlist_foreach(iter, &rb->toplevel_by_lsn)
 	{
 		ReorderBufferTXN *txn;
+		Size	size = 0;
+		Size	largest_size = 0;
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
+		/*
+		 * If this transaction have some incomplete changes then only consider
+		 * the size upto last complete lsn.
+		 */
+		if (rbtxn_has_incomplete_tuple(txn))
+			size = txn->complete_size;
+		else
+			size = txn->total_size;
+
+		/* If the current transaction is larger then remember it. */
+		if ((largest != NULL || size > largest_size) && size > 0)
+		{
 			largest = txn;
+			largest_size = size;
+		}
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2889,66 +3045,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
 	ReorderBufferTXN *txn;
 
-	/* bail out if we haven't exceeded the memory limit */
-	if (rb->size < logical_decoding_work_mem * 1024L)
-		return;
-
-	/*
-	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by streaming, if supported. Otherwise spill to disk.
-	 */
-	if (ReorderBufferCanStream(rb))
-	{
-		/*
-		 * Pick the largest toplevel transaction and evict it from memory by
-		 * streaming the already decoded part.
-		 */
-		txn = ReorderBufferLargestTopTXN(rb);
-
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn && !txn->toptxn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
-
-		ReorderBufferStreamTXN(rb, txn);
-	}
-	else
+	/* Loop until we reach under the memory limit. */
+	while (rb->size >= logical_decoding_work_mem * 1024L)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		ReorderBufferSerializeTXN(rb, txn);
-	}
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
 
-	/*
-	 * After eviction, the transaction should have no entries in memory, and
-	 * should use 0 bytes for changes.
-	 *
-	 * XXX Checking the size is fine for both cases - spill to disk and
-	 * streaming. But for streaming we should really check nentries_mem for
-	 * all subtransactions too.
-	 */
-	Assert(txn->size == 0);
-	Assert(txn->nentries_mem == 0);
+			ReorderBufferSerializeTXN(rb, txn);
+			/*
+			 * After eviction, the transaction should have no entries in memory, and
+			 * should use 0 bytes for changes.
+			 */
+			Assert(txn->size == 0);
+			Assert(txn->nentries_mem == 0);
+		}
+	}
 
-	/*
-	 * And furthermore, evicting the transaction should get us below the
-	 * memory limit again - it is not possible that we're still exceeding the
-	 * memory limit after evicting the transaction.
-	 *
-	 * This follows from the simple fact that the selected transaction is at
-	 * least as large as the most recent change (which caused us to go over
-	 * the memory limit). So by evicting it we're definitely back below the
-	 * memory limit.
-	 */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
@@ -3344,10 +3480,6 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 */
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
-
-	Assert(dlist_is_empty(&txn->changes));
-	Assert(txn->nentries == 0);
-	Assert(txn->nentries_mem == 0);
 }
 
 /*
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cdb12..aa17f7df84 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index b3e2b3f64b..a9b1aacdb1 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -172,6 +172,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -191,6 +193,26 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -199,10 +221,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -350,6 +368,15 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* Size of the commplete changes. */
+	Size		complete_size;
+
+	/* LSN of the last complete change. */
+	XLogRecPtr	last_complete_lsn;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -537,7 +564,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
2.23.0

v23-0010-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v23-0010-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From 4ecada99a766acdb480d5540b82b31c6d0376ba4 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v23 10/12] Add TAP test for streaming vs. DDL

---
 .../subscription/t/014_stream_through_ddl.pl  | 98 +++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000000..b8d78b1972
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v23-0011-Provide-new-api-to-get-the-streaming-changes.patchapplication/octet-stream; name=v23-0011-Provide-new-api-to-get-the-streaming-changes.patchDownload
From 7c1221ecb0592d78817d1a560709aeb7b2c1b489 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Sat, 2 May 2020 11:41:59 +0530
Subject: [PATCH v23 11/12] Provide new api to get the streaming changes

---
 .gitignore                                    |  1 +
 doc/src/sgml/test-decoding.sgml               | 22 ++++++++++++++++++
 src/backend/catalog/system_views.sql          |  8 +++++++
 .../replication/logical/logicalfuncs.c        | 23 +++++++++++++++----
 src/include/catalog/pg_proc.dat               |  9 ++++++++
 5 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/.gitignore b/.gitignore
index 794e35b73c..6083744c07 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d67b..eed6e9d134 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_streaming_changes('test_slot', NULL, NULL);
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8f34ce8deb..dd488cb2f8 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1242,6 +1242,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
 
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_peek_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index b99c94e848..70c28ffa91 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -108,7 +108,8 @@ check_permissions(void)
  * Helper function for the various SQL callable logical decoding functions.
  */
 static Datum
-pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool binary)
+pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
+								 bool binary, bool streaming)
 {
 	Name		name;
 	XLogRecPtr	upto_lsn;
@@ -252,6 +253,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or has been invalidated.")));
 
+		/* If called has not asked for streaming changes then disable it. */
+		ctx->streaming &= streaming;
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
@@ -362,7 +366,16 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 Datum
 pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, false);
+}
+
+/*
+ * SQL function to get the streaming changes as text, consuming the data.
+ */
+Datum
+pg_logical_slot_get_streaming_changes(PG_FUNCTION_ARGS)
+{
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, true);
 }
 
 /*
@@ -371,7 +384,7 @@ pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, false, false);
 }
 
 /*
@@ -380,7 +393,7 @@ pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, true, false);
 }
 
 /*
@@ -389,7 +402,7 @@ pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, true, false);
 }
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5a8826cc67..586c9621e2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10115,6 +10115,15 @@
   proargmodes => '{i,i,i,v,o,o,o}',
   proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
   prosrc => 'pg_logical_slot_get_binary_changes' },
+{ oid => '6150', descr => 'get streaming changes from replication slot',
+  proname => 'pg_logical_slot_get_streaming_changes', procost => '1000',
+  prorows => '1000', provariadic => 'text', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'name pg_lsn int4 _text',
+  proallargtypes => '{name,pg_lsn,int4,_text,pg_lsn,xid,text}',
+  proargmodes => '{i,i,i,v,o,o,o}',
+  proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
+  prosrc => 'pg_logical_slot_get_streaming_changes' },
 { oid => '3784', descr => 'peek at changes from replication slot',
   proname => 'pg_logical_slot_peek_changes', procost => '1000',
   prorows => '1000', provariadic => 'text', proisstrict => 'f',
-- 
2.23.0

v23-0012-Add-streaming-option-in-pg_dump.patchapplication/octet-stream; name=v23-0012-Add-streaming-option-in-pg_dump.patchDownload
From f6580f4e534bc06bed24b6e21f00e150bda943bc Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v23 12/12] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index f33c2463a7..b6ae988b02 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4209,6 +4209,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4243,8 +4244,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4257,6 +4258,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4273,6 +4275,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4350,6 +4353,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 5f70400b25..3ccb6be953 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
2.23.0

#318Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#308)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Review comments:
------------------------------
1.
@@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
TransactionId xid,
}

case REORDER_BUFFER_CHANGE_MESSAGE:
- rb->message(rb, txn, change->lsn, true,
- change->data.msg.prefix,
- change->data.msg.message_size,
- change->data.msg.message);
+ if (streaming)
+ rb->stream_message(rb, txn, change->lsn, true,
+    change->data.msg.prefix,
+    change->data.msg.message_size,
+    change->data.msg.message);
+ else
+ rb->message(rb, txn, change->lsn, true,
+    change->data.msg.prefix,
+    change->data.msg.message_size,
+    change->data.msg.message);

Don't we need to set any_data_sent flag while streaming messages as we
do for other types of changes?

I think any_data_sent, was added to avoid sending abort to the
subscriber if we haven't sent any data, but this is not complete as
the output plugin can also take the decision not to send. So I think
this should not be done as part of this patch and can be done
separately. I think there is already a thread for handling the
same[1]

Hmm, but prior to this patch, we never use to send (empty) aborts but
now that will be possible. It is probably okay to deal that with
another patch mentioned by you but I felt at least any_data_sent will
work for some cases. OTOH, it appears to be half-baked solution, so
we should probably refrain from adding it. BTW, how do the pgoutput
plugin deal with it? I see that apply_handle_stream_abort will
unconditionally try to unlink the file and it will probably fail.
Have you tested this scenario after your latest changes?

Yeah, I see, I think this is a problem, but this exists without my
latest change as well, if pgoutput ignore some changes because it is
not published then we will see a similar error. Shall we handle the
ENOENT error case from unlink? I think the best idea is that we shall
track the empty transaction.

4.
In ReorderBufferProcessTXN(), the patch is calling stream_stop in both
the try and catch block. If there is an error after calling it in a
try block, we might call it again via catch. I think that will lead
to sending a stop message twice. Won't that be a problem? See the
usage of iterstate in the catch block, we have made it safe from a
similar problem.

IMHO, we don't need that, because we only call stream_stop in the
catch block if the error type is ERRCODE_TRANSACTION_ROLLBACK. So if
in TRY block we have already stopped the stream then we should not get
that error. I have added the comments for the same.

I am still slightly nervous about it as I don't see any solid
guarantee for the same. You are right as the code stands today but
due to any code that gets added in the future, it might not remain
true. I feel it is better to have an Assert here to ensure that
stream_stop won't be called the second time. I don't see any good way
of doing it other than by maintaining flag or some state but I think
it will be good to ensure this.

Done

6.
PG_CATCH();
{
+ MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+ ErrorData  *errdata = CopyErrorData();

I don't understand the usage of memory context in this part of the
code. Basically, you are switching to CurrentMemoryContext here, do
some error handling and then again reset back to some random context
before rethrowing the error. If there is some purpose for it, then it
might be better if you can write a few comments to explain the same.

Basically, the ccxt is the CurrentMemoryContext when we started the
streaming and ecxt it the context when we catch the error. So
ideally, before this change, it will rethrow in the context when we
catch the error i.e. ecxt. So what we are trying to do is put it back
to normal context (ccxt) and copy the error data in the normal
context. And, if we are not handling it gracefully then put it back
to the context it was in, and rethrow.

Okay, but when errorcode is *not* ERRCODE_TRANSACTION_ROLLBACK, don't
we need to clean up the reorderbuffer by calling
ReorderBufferCleanupTXN? If so, then you can try to combine it with
the not-streaming else loop.

Done

8.
@@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
*rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);

txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+ /*
+ * TOCHECK: Mark toplevel transaction as having catalog changes too
+ * if one of its children has.
+ */
+ if (txn->toptxn != NULL)
+ txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}

Why are we marking top transaction here?

We need to mark top transaction to decide whether to build tuplecid
hash or not. In non-streaming mode, we are only sending during the
commit time, and during commit time we know whether the top
transaction has any catalog changes or not based on the invalidation
message so we are marking the top transaction there in DecodeCommit.
Since here we are not waiting till commit so we need to mark the top
transaction as soon as we mark any of its child transactions.

But how does it help? We use this flag (via
ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
anyway done in DecodeCommit and that too after setting this flag for
the top transaction if required. So, how will it help in setting it
while processing for subxid. Also, even if we have to do it won't it
add the xid needlessly in builder->committed.xip array?

In ReorderBufferBuildTupleCidHash, we use this flag to decide whether
to build the tuplecid hash or not based on whether it has catalog
changes or not.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#319Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#312)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, May 19, 2020 at 4:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, May 19, 2020 at 3:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 19, 2020 at 2:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

3.
+ /*
+ * If streaming is enable and we have serialized this transaction because
+ * it had incomplete tuple.  So if now we have got the complete tuple we
+ * can stream it.
+ */
+ if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
+ && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
+ {

This comment is just saying what you are doing in the if-check. I
think you need to explain the rationale behind it. I don't like the
variable name 'can_stream' because it matches ReorderBufferCanStream
whereas it is for a different purpose, how about naming it as
'change_complete' or something like that. The check has many
conditions, can we move it to a separate function to make the code
here look clean?

Do we really need this? Immediately after this check, we are calling
ReorderBufferCheckMemoryLimit which will anyway stream the changes if
required.

Actually, ReorderBufferCheckMemoryLimit is only meant for checking
whether we need to stream the changes due to the memory limit. But
suppose when memory limit exceeds that time we could not stream the
transaction because there was only incomplete toast insert so we
serialized. Now, when we get the tuple which makes the changes
complete but now it is not crossing the memory limit as changes were
already serialized. So I am not sure whether it is a good idea to
stream the transaction as soon as we get the complete changes or we
shall wait till next time memory limit exceed and that time we select
the suitable candidate.

I think it is better to wait till next time we exceed the memory threshold.

Okay, done this way.

Ideally, we were are in streaming more and
the transaction is serialized means it was already a candidate for
streaming but could not stream due to the incomplete changes so
shouldn't we stream it immediately as soon as its changes are complete
even though now we are in memory limit.

The only time we need to stream or spill is when we exceed memory
threshold. In the above case, it is possible that next time there is
some other candidate transaction that we can stream.

Another comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple:

+ else if (rbtxn_has_toast_insert(txn) &&
+ ChangeIsInsertOrUpdate(change->action))
+ {
+ toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+ can_stream = true;
+ }
..
+#define ChangeIsInsertOrUpdate(action) \
+ (((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+ ((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+ ((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT))

How can we clear the RBTXN_HAS_TOAST_INSERT flag on
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT action?

Partial toast insert means we have inserted in the toast but not in
the main table. So even if it is spec insert we can form the complete
tuple, however, we can still not stream it because we haven't got
spec_confirm but for that, we are marking another flag. So if the
insert is aspect insert the toast insert will also be spec insert and
as part of that toast, spec inserts we are marking partial tuple so
cleaning that flag should happen when the spec insert is done for the
main table right?

Sounds reasonable.

ok

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#320Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#313)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, May 19, 2020 at 5:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 15, 2020 at 2:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

4.
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ Assert(!ctx->fast_forward);
+
+ /* We're only supposed to call this when streaming is supported. */
+ Assert(ctx->streaming);
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "stream_start";
+ /* state.report_location = apply_lsn; */

Why can't we supply the report_location here? I think here we need to
report txn->first_lsn if this is the very first stream and
txn->final_lsn if it is any consecutive one.

Done

Now after your change in stream_start_cb_wrapper, we assign
report_location as first_lsn passed as input to function but
write_location is still txn->first_lsn. Shouldn't we assing passed in
first_lsn to write_location? It seems assigning txn->first_lsn won't
be correct for streams other than first-one.

Done

5.
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ Assert(!ctx->fast_forward);
+
+ /* We're only supposed to call this when streaming is supported. */
+ Assert(ctx->streaming);
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "stream_stop";
+ /* state.report_location = apply_lsn; */

Can't we report txn->final_lsn here

We are already setting this to the txn->final_ls in 0006 patch, but I
have moved it into this patch now.

Similar to previous point, here also, I think we need to assign report
and write location as last_lsn passed to this API.

Done

v20-0005-Implement-streaming-mode-in-ReorderBuffer
-----------------------------------------------------------------------------
10.
Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

I don't think this part of the commit message is correct as we
sometimes need to spill even during streaming. Please check the
entire commit message and update according to the latest
implementation.

Done

You seem to forgot about removing the other part of message ("This
adds a second iterator for the streaming case...." which is not
relavant now.

Done

11.
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
*/
static void
ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
*rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;

- if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
- return;
-

I don't understand this change. Why would "INSERT followed by
TRUNCATE" could lead to a tuple which can come for decode before its
CID? The patch has made changes based on this assumption in
HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the
behavior could be dependent on whether we are streaming the changes
for in-progress xact or at the commit of a transaction. We might want
to generate a test to once validate this behavior.

Also, the comment refers to tqual.c which is wrong as this API is now
in heapam_visibility.c.

Done.

+ * INSERT.  So in such cases we assume the CIDs is from the future command
+ * and return as unresolve.
+ */
+ if (tuplecid_data == NULL)
+ return false;
+

Here lets reword the last line of comment as ". So in such cases we
assume the CID is from the future command."

Done

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#321Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#314)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, May 19, 2020 at 6:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, May 13, 2020 at 4:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

3.
And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error that we can ignore.  We
+ * might have already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the abort we will
+ * stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)

In the above comment, I don't think it is right to say that we ignore
the error raised due to the aborted transaction. We need to say that
we discard the already streamed changes on such an error.

Done.

In the same comment, there is typo (/messageto/message to).

Done

4.
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
/*
- * If this transaction has no snapshot, it didn't make any changes to the
- * database, so there's nothing to decode.  Note that
- * ReorderBufferCommitChild will have transferred any snapshots from
- * subtransactions if there were any.
+ * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+ * aborted. That will happen during catalog access.  Also reset the
+ * sysbegin_called flag.
*/
- if (txn->base_snapshot == NULL)
+ if (!TransactionIdDidCommit(xid))
{
- Assert(txn->ninvalidations == 0);
- ReorderBufferCleanupTXN(rb, txn);
- return;
+ CheckXidAlive = xid;
+ bsysscan = false;
}

I think this function is inline as it needs to be called for each
change. If that is the case and otherwise also, isn't it better that
we check if passed xid is the same as CheckXidAlive before checking
TransactionIdDidCommit as TransactionIdDidCommit can be costly and
calling it for each change might not be a good idea?

Done, Also I think it is good the check the TransactionIdIsInProgress
instead of !TransactionIdDidCommit. I have changed that as well.

What if it is aborted just before this check? I think the decode API
won't be able to detect that and sys* API won't care to check because
CheckXidAlive won't be set for that case.

Yeah, that's the problem, I think it should be TransactionIdDidCommit only.

5.
setup CheckXidAlive if it's not committed yet. We don't check if the xid
+ * aborted. That will happen during catalog access.  Also reset the
+ * sysbegin_called flag.

/if the xid aborted/if the xid is aborted. missing comma after Also.

Done

You forgot to change as per the second part of the comment (missing
comma after Also).

Done

8.
@@ -1588,8 +1766,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
* use as a normal record. It'll be cleaned up at the end
* of INSERT processing.
*/
- if (specinsert == NULL)
- elog(ERROR, "invalid ordering of speculative insertion changes");

You have removed this check but all other handling of specinsert is
same as far as this patch is concerned. Why so?

Seems like a merge issue, or the leftover from the old design of the
toast handling where we were streaming with the partial tuple.
fixed now.

9.
@@ -1676,8 +1860,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
* freed/reused while restoring spooled data from
* disk.
*/
- Assert(change->data.tp.newtuple != NULL);
-
dlist_delete(&change->node);

Why is this Assert removed?

Same cause as above so fixed.

10.
@@ -1753,7 +1935,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
relations[nrelations++] = relation;
}

- rb->apply_truncate(rb, txn, nrelations, relations, change);
+ if (streaming)
+ {
+ rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+ /* Remember that we have sent some data. */
+ change->txn->any_data_sent = true;
+ }
+ else
+ rb->apply_truncate(rb, txn, nrelations, relations, change);

Can we encapsulate this in a separate function like
ReorderBufferApplyTruncate or something like that? Basically, rather
than having streaming check in this function, lets do it in some other
internal function. And we can likewise do it for all the streaming
checks in this function or at least whereever it is feasible. That
will make this function look clean.

Done for truncate and change. I think we can create a few more such
functions for
start/stop and cleanup handling on error. I will work on that.

Yeah, I think that would be better.

I have done some refactoring, please look into the latest version.

One minor comment change suggestion:
/*
+ * start stream or begin the transaction.  If this is the first
+ * change in the current stream.
+ */

We can write the above comment as "Start the stream or begin the
transaction for the first change in the current stream."

Done

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#322Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#315)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, May 19, 2020 at 6:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have further reviewed v22 and below are my comments:

v22-0005-Implement-streaming-mode-in-ReorderBuffer
--------------------------------------------------------------------------
1.
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)

The above 'Note' is not correct as per the latest implementation.

That is removed in 0010 in the latest version you can see in 0006.

v22-0006-Add-support-for-streaming-to-built-in-replicatio
----------------------------------------------------------------------------
2.
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,7 +14,6 @@
*
*-------------------------------------------------------------------------
*/
-
#include "postgres.h"

Spurious line removal.

Fixed

3.
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+    XLogRecPtr commit_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'c'); /* action STREAM COMMIT */
+
+ Assert(TransactionIdIsValid(txn->xid));
+
+ /* transaction ID (we're starting to stream, so must be valid) */
+ pq_sendint32(out, txn->xid);

The part of the comment "we're starting to stream, so must be valid"
is not correct as we are not at the start of the stream here. The
patch has used the same incorrect sentence at few places, kindly fix
those as well.

I have removed that part of the comment.

4.
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
{
..

For this and other places in a patch like in function
stream_open_file(), instead of using TopMemoryContext, can we consider
using a new memory context LogicalStreamingContext or something like
that. We can create LogicalStreamingContext under TopMemoryContext. I
don't see any need of using TopMemoryContext here.

But, when we will delete/reset the LogicalStreamingContext? because
we are planning to keep this memory until the worker is alive so that
supposed to be the top memory context. If we create any other context
with the same life span as TopMemoryContext then what is the point?
Am I missing something?

5.
+static void
+subxact_info_add(TransactionId xid)

This function has assumed a valid value for global variables like
stream_fd and stream_xid. I think it is better to have Assert for
those in this function before using them. The Assert for those are
present in handle_streamed_transaction but I feel they should be in
subxact_info_add.

Done

6.
+subxact_info_add(TransactionId xid)
/*
+ * In most cases we're checking the same subxact as we've already seen in
+ * the last call, so make ure just ignore it (this change comes later).
+ */
+ if (subxact_last == xid)
+ return;

Typo and minor correction, /ure just/sure to

Done

7.
+subxact_info_write(Oid subid, TransactionId xid)
{
..
+ /*
+ * But we free the memory allocated for subxact info. There might be one
+ * exceptional transaction with many subxacts, and we don't want to keep
+ * the memory allocated forewer.
+ *
+ */

a. Typo, /forewer/forever
b. The extra line at the end of the comment is not required.

Done

8.
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)

Do we really need to have the checksum for temporary files? I have
checked a few other similar cases like SharedFileSet stuff for
parallel hash join but didn't find them using checksums. Can you also
once see other usages of temporary files and then let us decide if we
see any reason to have checksums for this?

Yeah, even I can see other places checksum is not used.

Another point is we don't seem to be doing this for 'changes' file,
see stream_write_change. So, not sure, there is any sense to write
checksum for subxact file.

I can see there are comment atop this function

* XXX The subxact file includes CRC32C of the contents. Maybe we should
* include something like that here too, but doing so will not be as
* straighforward, because we write the file in chunks.

Tomas, do you see any reason for the same?

9.
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+ char tempdirpath[MAXPGPATH];
+
+ TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+ /*
+ * We might need to create the tablespace's tempfile directory, if no
+ * one has yet done so.
+ */
+ if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m",
+ tempdirpath)));
+
+ snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+ tempdirpath, subid, xid);
+}

Temporary files created in PGDATA/base/pgsql_tmp follow a certain
naming convention (see docs[1]) which is not followed here. You can
also refer SharedFileSetPath and OpenTemporaryFile. I think we can
just try to follow that convention and then additionally append subid,
xid and .subxacts. Also, a similar change is required for
changes_filename. I would like to know if there is a reason why we
want to use different naming convention here?

I have changed it to this: pgsql_tmpPID-subid-xid.subxacts.

10.
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)

The comment seems to be wrong. I think this can be only called at
stream end, so it should be "This can only be called at the end of a
"streaming" block, i.e. at stream_stop message from the upstream."

Right, I have fixed it.

11.
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ *
* For partitions, 'pubactions' considers not only the table's own
* publications, but also those of all of its ancestors.
*/
typedef struct RelationSyncEntry
{
Oid relid; /* relation oid */
-
+ TransactionId xid; /* transaction that created the record */
/*
* Did we send the schema?  If ancestor relid is set, its schema must also
* have been sent for this to be true.
*/
bool schema_sent;
+ List    *streamed_txns; /* streamed toplevel transactions with this
+ * schema */

The part of comment "So streamed trasactions are handled separately by
using schema_sent flag in ReorderBufferTXN." doesn't seem to match
with what we are doing in the latest version of the patch.

Yeah, it's wrong, I have fixed it.

12.
maybe_send_schema()
{
..
+ if (in_streaming)
+ {
+ /*
+ * TOCHECK: We have to send schema after each catalog change and it may
+ * occur when streaming already started, so we have to track new catalog
+ * changes somehow.
+ */
+ schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
..
..
}

I think it is good to once verify/test what this comment says but as
per code we should be sending the schema after each catalog change as
we invalidate the streamed_txns list in rel_sync_cache_relation_cb
which must be called during relcache invalidation. Do we see any
problem with that mechanism?

I have tested this, I think we are already sending the schema after
each catalog change.

13.
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+    ReorderBufferTXN *txn,
+    XLogRecPtr commit_lsn)

This comment is copied from pgoutput_stream_abort, so doesn't match
what this function is doing.

Done

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v24.tarapplication/x-tar; name=v24.tarDownload
._v24000755 000765 000024 00000000334 13662724173 013153 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl��^���OU�=C�6�PaxHeader/v24000755 000765 000024 00000000036 13662724173 014706 xustar00dilipkumarstaff000000 000000 30 mtime=1590405243.879287394
v24/000755 000765 000024 00000000000 13662724173 013010 5ustar00dilipkumarstaff000000 000000 v24/._v24-0001-Immediately-WAL-log-assignments.patch000644 000765 000024 00000000334 13662753515 023233 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0001-Immediately-WAL-log-assignments.patch000644 000765 000024 00000000036 13662753515 024766 xustar00dilipkumarstaff000000 000000 30 mtime=1590417229.892369076
v24/v24-0001-Immediately-WAL-log-assignments.patch000644 000765 000024 00001074000 13662753515 023020 0ustar00dilipkumarstaff000000 000000 ._v24-0002-Issue-individual-invalidations-with-wal_level-lo.patch000644 000765 000024 00000000334 13662752377 026243 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lPaxHeader/v24-0002-Issue-individual-invalidations-with-wal_level-lo.patch000644 000765 000024 00000000036 13662752377 027776 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.629380599
v24-0002-Issue-individual-invalidations-with-wal_level-lo.patch000644 000765 000024 00000041336 13662752377 026035 0ustar00dilipkumarstaff000000 000000 From ebb6dbc7cd2e8227cd085d62736f95015334c5ad Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Tue, 14 Apr 2020 11:11:37 +0530
Subject: [PATCH v24 02/12] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.
---
 src/backend/access/rmgrdesc/xactdesc.c        |  40 +++++++
 src/backend/access/transam/xact.c             |   7 ++
 src/backend/replication/logical/decode.c      |  16 +++
 .../replication/logical/reorderbuffer.c       | 104 +++++++++++++++---
 src/backend/utils/cache/inval.c               |  49 +++++++++
 src/include/access/xact.h                     |  13 ++-
 src/include/replication/reorderbuffer.h       |  11 ++
 7 files changed, 226 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce75565f..7ab0d11ea9 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3af8e81af1..e576b10055 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6020,6 +6020,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 122c581d0f..69c1f45ef6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,22 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				Assert(TransactionIdIsValid(xid));
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->nmsgs, invals->msgs);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4594cf9509..b889edf461 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -220,7 +220,7 @@ static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
 static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
-static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
 
 /*
  * ---------------------------------------
@@ -455,6 +455,11 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 				pfree(change->data.msg.message);
 			change->data.msg.message = NULL;
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			if (change->data.inval.invalidations)
+				pfree(change->data.inval.invalidations);
+			change->data.inval.invalidations = NULL;
+			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			if (change->data.snapshot)
 			{
@@ -1814,17 +1819,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation messages locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					ReorderBufferExecuteInvalidations(
+							change->data.inval.ninvalidations,
+							change->data.inval.invalidations);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -1866,7 +1878,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1892,7 +1905,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2202,6 +2216,33 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	txn->ntuplecids++;
 }
 
+/*
+ * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn, int nmsgs,
+							 SharedInvalidationMessage *msgs)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.ninvalidations = nmsgs;
+	change->data.inval.invalidations = (SharedInvalidationMessage *)
+		MemoryContextAlloc(rb->context,
+						   sizeof(SharedInvalidationMessage) * nmsgs);
+	memcpy(change->data.inval.invalidations, msgs,
+		   sizeof(SharedInvalidationMessage) * nmsgs);
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Setup the invalidation of the toplevel transaction.
  *
@@ -2234,12 +2275,12 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  * in the changestream but we don't know which those are.
  */
 static void
-ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
 {
 	int			i;
 
-	for (i = 0; i < txn->ninvalidations; i++)
-		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	for (i = 0; i < nmsgs; i++)
+		LocalExecuteInvalidationMessage(&msgs[i]);
 }
 
 /*
@@ -2589,6 +2630,24 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				char	   *data;
+				Size		inval_size = sizeof(SharedInvalidationMessage) *
+										change->data.inval.ninvalidations;
+
+				sz += inval_size;
+
+				ReorderBufferSerializeReserve(rb, sz);
+				data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+
+				/* might have been reallocated above */
+				ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+				memcpy(data, change->data.inval.invalidations, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -2736,6 +2795,11 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			sz += sizeof(SharedInvalidationMessage) *
+					change->data.inval.ninvalidations;
+			break;
+
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -3002,6 +3066,20 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				Size	inval_size = sizeof(SharedInvalidationMessage) *
+									change->data.inval.ninvalidations;
+
+				change->data.inval.invalidations =
+						MemoryContextAlloc(rb->context, inval_size);
+
+				/* read the message */
+				memcpy(change->data.inval.invalidations, data, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33be6..cba5b6c64b 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transaction.  As of now it was
+ *	enough to log invalidation only at commit because we are only decoding the
+ *	transaction at the commit time.   We only need to log the catalog cache and
+ *	relcache invalidation.  There can not be any active MVCC scan in logical
+ *	decoding so we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
 	if (transInvalInfo == NULL)
 		return;
 
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
@@ -1501,3 +1513,40 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage	*invalMessages;
+	int	nmsgs = 0;
+
+	if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+	{
+		ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+								 MakeSharedInvalidMessagesArray);
+		invalMessages = SharedInvalidMessagesArray;
+		nmsgs  = numSharedInvalidMessagesArray;
+		SharedInvalidMessagesArray = NULL;
+		numSharedInvalidMessagesArray = 0;
+	}
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b3816c..b822c5e4b2 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -197,6 +197,17 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+/*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4dc9..af35287896 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,14 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations;		/* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation
+														 * message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +468,8 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  int nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
2.23.0

._v24-0003-Extend-the-output-plugin-API-with-stream-methods.patch000644 000765 000024 00000000334 13662752377 026004 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lPaxHeader/v24-0003-Extend-the-output-plugin-API-with-stream-methods.patch000644 000765 000024 00000000036 13662752377 027537 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.629725287
v24-0003-Extend-the-output-plugin-API-with-stream-methods.patch000644 000765 000024 00000106360 13662752377 025575 0ustar00dilipkumarstaff000000 000000 From bbd0b76f1ada9c6d5e53aef22f6e96ac53ce4017 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v24 03/12] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 ++++++
 doc/src/sgml/logicaldecoding.sgml         | 213 +++++++++++++
 src/backend/replication/logical/logical.c | 365 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++
 src/include/replication/reorderbuffer.h   |  59 ++++
 6 files changed, 811 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c948856e..64f651fa72 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index bad3bfe620..1b56daa4bb 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index dc69e5ce5f..0cff1ac393 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similar to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,321 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475e5d..deef31825d 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -79,6 +79,11 @@ typedef struct LogicalDecodingContext
 	 */
 	void	   *output_writer_private;
 
+	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236c57..0d0a94a648 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
 /*
  * Output plugin callbacks
  */
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index af35287896..65814af9f5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -354,6 +354,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
 struct ReorderBuffer
 {
 	/*
@@ -392,6 +440,17 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
 	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
-- 
2.23.0

._v24-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch000644 000765 000024 00000000334 13662752377 026251 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lPaxHeader/v24-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch000644 000765 000024 00000000036 13662752377 030004 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.631015992
v24-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch000644 000765 000024 00000032536 13662752377 026045 0ustar00dilipkumarstaff000000 000000 From 8f9b9d1569dfacf20b205c893f98379c9f854662 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 10:55:19 +0530
Subject: [PATCH v24 04/12] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml  |  9 +++--
 src/backend/access/heap/heapam.c   | 10 ++++++
 src/backend/access/index/genam.c   | 53 ++++++++++++++++++++++++++++
 src/backend/access/table/tableam.c |  8 +++++
 src/backend/utils/time/snapmgr.c   | 13 +++++++
 src/include/access/tableam.h       | 55 ++++++++++++++++++++++++++++++
 src/include/utils/snapmgr.h        |  2 ++
 7 files changed, 147 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 1b56daa4bb..5f7394f3c1 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,9 +432,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 94eb37d48d..2d77107c4f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments at snapmgr.c
+	 * where these variables are declared.  Normally we have such a check at
+	 * tableam level API but this is called from many places so we need to
+	 * ensure it here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae39a..446b8cbc86 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,9 +430,36 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments at snapmgr.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
+/*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments at snapmgr.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
 /*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments at snapmgr.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index c814733b22..2f52b407c6 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -230,6 +230,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	Relation	rel = scan->rs_rd;
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
+	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
 	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..9f1ecd123f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -153,6 +153,19 @@ static Snapshot SecondarySnapshot = NULL;
 static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction. ��Currently, it is used in logical decoding. ��It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables. ��We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool bsysscan = false;
+
 /*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8c34935c34..9d890d3c4b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
+#include "utils/snapmgr.h"
 #include "utils/snapshot.h"
 
 
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1015,6 +1025,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1054,6 +1071,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1710,6 +1735,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1727,6 +1760,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1745,6 +1786,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1761,6 +1809,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce84..5af6df698b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,6 +145,8 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
+extern bool	bsysscan;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
 extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
 extern void TeardownHistoricSnapshot(bool is_error);
-- 
2.23.0

._v24-0005-Implement-streaming-mode-in-ReorderBuffer.patch000644 000765 000024 00000000334 13662752377 024636 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lPaxHeader/v24-0005-Implement-streaming-mode-in-ReorderBuffer.patch000644 000765 000024 00000000036 13662752377 026371 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.631962245
v24-0005-Implement-streaming-mode-in-ReorderBuffer.patch000644 000765 000024 00000113253 13662752377 024426 0ustar00dilipkumarstaff000000 000000 From d3e05a511d3db8aa849ca385c145c7a676ac77d4 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 1 May 2020 19:56:35 +0530
Subject: [PATCH v24 05/12] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes
we have in memory and invoke new stream API methods. This happens
in ReorderBufferStreamTXN() using about the same logic as in
ReorderBufferCommit() logic.  However, sometime if we have incomplete
toast or speculative insert we spill to the disk because we can not
generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c   |  38 +-
 .../replication/logical/reorderbuffer.c       | 758 ++++++++++++++++--
 src/include/replication/reorderbuffer.h       |  31 +
 3 files changed, 750 insertions(+), 77 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aa..160b167adb 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b889edf461..2cdfb348af 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -236,6 +237,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +246,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +382,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -772,6 +786,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+/*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -865,6 +911,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -1024,6 +1073,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1038,6 +1090,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1315,6 +1370,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		dlist_delete(&txn->base_snapshot_node);
 	}
 
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
 	/*
 	 * Remove TXN from its containing list.
 	 *
@@ -1340,6 +1404,80 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferReturnTXN(rb, txn);
 }
 
+/*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
 /*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1491,57 +1629,171 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+	 * aborted. That will happen during catalog access.  Also, reset the
+	 * bsysscan flag.
+	 */
+	if (!TransactionIdDidCommit(xid))
+	{
+		CheckXidAlive = xid;
+		bsysscan = false;
 	}
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction.
+ */
+static void
+ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   Snapshot snapshot_now,
+								   CommandId command_id,
+								   XLogRecPtr last_lsn,
+								   ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+
+	ReorderBufferToastReset(rb, txn);
+	if (specinsert != NULL)
+		ReorderBufferReturnChange(rb, specinsert);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool	stream_started = false;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1564,21 +1816,46 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
-
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
 		{
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Start stream or begin transaction for the first change in the
+			 * current stream.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+				else
+					rb->begin(rb, txn);
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1655,7 +1932,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1695,7 +1973,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1753,7 +2031,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1762,10 +2043,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1796,7 +2074,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1858,14 +2135,32 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			rb->stream_stop(rb, txn, prev_lsn);
+			stream_started = false;
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1884,14 +2179,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,17 +2214,118 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If the error code is ERRCODE_TRANSACTION_ROLLBACK,  that means we
+		 * have detected a concurrent abort of the (sub)transaction we are
+		 * streaming.  So just do the cleanup and return gracefully.
+		 * Otherwise, Re-throw the error.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * Only in the streaming mode we can get this error, because only
+			 * in the streaming mode we send in-progress transaction.
+			 */
+			Assert(streaming);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/*
+			 * In the TRY block we only stop the stream after we have send
+			 * all the changes.  So we have detected the concurrent abort then
+			 * the stream should have not been stopped yet.
+			 */
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
 
-		PG_RE_THROW();
+			/* Handle the concurrent abort. */
+			ReorderBufferHandleConcurrentAbort(rb, txn, snapshot_now,
+											   command_id, prev_lsn,
+											   specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.  We iterate over the top and
+ * subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot snapshot_now;
+	CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
@@ -1946,6 +2350,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2015,6 +2426,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2150,8 +2568,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2159,6 +2586,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2170,19 +2598,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2211,6 +2648,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2295,6 +2733,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2398,6 +2843,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
@@ -2418,15 +2895,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2746,6 +3254,102 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot snapshot_now;
+	CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -3864,6 +4468,16 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases we assume the CID is from the future command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 65814af9f5..b3e2b3f64b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -171,6 +171,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -190,6 +191,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -224,6 +243,11 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	final_lsn;
 
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
@@ -254,6 +278,13 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
+	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
 	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
-- 
2.23.0

._v24-0006-Bugfix-handling-of-incomplete-toast-spec-insert-.patch000644 000765 000024 00000000334 13662752377 026042 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lPaxHeader/v24-0006-Bugfix-handling-of-incomplete-toast-spec-insert-.patch000644 000765 000024 00000000036 13662752377 027575 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.632280004
v24-0006-Bugfix-handling-of-incomplete-toast-spec-insert-.patch000644 000765 000024 00000054773 13662752377 025645 0ustar00dilipkumarstaff000000 000000 From 25ceb8fbf1534abb3bec472a742b22216d32de7c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Tue, 19 May 2020 18:55:23 +0530
Subject: [PATCH v24 06/12] Bugfix handling of incomplete toast/spec insert
 tuple

---
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/replication/logical/decode.c      |  17 +-
 .../replication/logical/reorderbuffer.c       | 324 ++++++++++++------
 src/include/access/heapam_xlog.h              |   1 +
 src/include/replication/reorderbuffer.h       |  39 ++-
 5 files changed, 277 insertions(+), 107 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2d77107c4f..3927448f46 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1955,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 69c1f45ef6..c841687c66 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -727,7 +727,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -794,7 +796,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -851,7 +854,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -887,7 +891,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -987,7 +991,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1025,7 +1029,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2cdfb348af..fe2d0011c4 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -179,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -254,6 +269,8 @@ static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
 static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static inline void ReorderBufferTXNDeleteChange(ReorderBufferTXN *txn,
+												ReorderBufferChange *change);
 
 /* ---------------------------------------
  * toast reassembly support
@@ -646,12 +663,71 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 	return txn;
 }
 
+/*
+ * Handle incomplete tuple during streaming.  If streaming is enabled then we
+ * might need to stream the in-progress transaction.  So the problem is that
+ * sometime we might get some incomplete changes which we can not stream
+ * until we get the complete change. e.g.  toast table insert without the main
+ * table insert.  So this function remember the lsn of the last complete change
+ * and the complete size upto last complete lsn so that if we need to stream
+ * we can only stream upto last complete lsn.
+ */
+static void
+ReorderBufferHandleIncompleteTuple(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   ReorderBufferChange *change,
+								   bool toast_insert)
+{
+	/* If streaming is not enable then nothing to do. */
+	if (!ReorderBufferCanStream(rb))
+		return;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		txn = txn->toptxn;
+
+	/*
+	 * If this is a first incomplete change then set the size of the complete
+	 * change.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(txn)) &&
+		(toast_insert || IsSpecInsert(change->action)))
+		txn->complete_size = txn->total_size;
+
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Basically,
+	 * both update and insert will do the insert in the toast table.  And as
+	 * explained in the function header we can not stream the only toast
+	 * changes.  So whenever we get the toast insert we set the flag and clear
+	 * the same whenever we get the next insert or update on the main table.
+	 */
+	if (toast_insert)
+		txn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(txn) && IsInsertOrUpdate(change->action))
+		txn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial tuple and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		txn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+		txn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+
+	/*
+	 * If we don't have any incomplete change after this change then set this
+	 * LSN as last complete lsn.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(txn)))
+		txn->last_complete_lsn = change->lsn;
+}
+
 /*
  * Queue a change into a transaction so it can be replayed upon commit.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
@@ -660,6 +736,9 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	change->lsn = lsn;
 	change->txn = txn;
 
+	/* Handle the incomplete tuple if it's a toast/spec insert */
+	ReorderBufferHandleIncompleteTuple(rb, txn, change, toast_insert);
+
 	Assert(InvalidXLogRecPtr != lsn);
 	dlist_push_tail(&txn->changes, &change->node);
 	txn->nentries++;
@@ -697,7 +776,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1412,6 +1491,30 @@ static void
 ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
 	dlist_mutable_iter iter;
+	ReorderBufferTXN *toptxn;
+
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
 
 	/* cleanup subtransactions & their changes */
 	dlist_foreach_modify(iter, &txn->subtxns)
@@ -1438,30 +1541,28 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		change = dlist_container(ReorderBufferChange, node, iter.cur);
 
-		/* remove the change from it's containing list */
-		dlist_delete(&change->node);
+		/* We have truncated upto last complete lsn so stop. */
+		if (rbtxn_has_incomplete_tuple(toptxn) &&
+			(change->lsn > toptxn->last_complete_lsn))
+		{
+			/*
+			 * If this is a top transaction then we can reset the
+			 * last_complete_lsn and complete_size, because by now we would
+			 * have stream all the changes upto last_complete_lsn.
+			 */
+			if (txn->toptxn == NULL)
+			{
+				toptxn->last_complete_lsn = InvalidXLogRecPtr;
+				toptxn->complete_size = 0;
+			}
+			break;
+		}
 
+		/* remove the change from it's containing list */
+		ReorderBufferTXNDeleteChange(txn, change);
 		ReorderBufferReturnChange(rb, change);
 	}
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The toplevel transaction, identified by (toptxn==NULL), is marked
-	 * as streamed always, even if it does not contain any changes (that
-	 * is, when all the changes are in subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts
-	 * for XIDs the downstream is not aware of. And of course, it always
-	 * knows about the toplevel xact (we send the XID in all messages),
-	 * but we never stream XIDs of empty subxacts.
-	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
 	 * any memory. We could also keep the hash table and update it with
@@ -1473,9 +1574,15 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
-	/* also reset the number of entries in the transaction */
-	txn->nentries_mem = 0;
-	txn->nentries = 0;
+	/*
+	 * If this txn is serialized and there are no more entries in the disk then
+	 * clean the disk space.
+	 */
+	if (rbtxn_is_serialized(txn) && (txn->nentries == txn->nentries_mem))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
 }
 
 /*
@@ -1732,6 +1839,20 @@ ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					change->data.msg.message);
 }
 
+/*
+ * While streaming a transaction, due to incomplete tuple we can not always
+ * stream all the changes.  So whenever we are deleting any change from the
+ * change list we need to update the entries count.
+ */
+static inline void
+ReorderBufferTXNDeleteChange(ReorderBufferTXN *txn, ReorderBufferChange *change)
+{
+	/* Delete the node and decrement the nentries_mem and nentries count. */
+	dlist_delete(&change->node);
+	change->txn->nentries_mem--;
+	change->txn->nentries--;
+}
+
 /*
  * Function to store the command id and snapshot at the end of the current
  * stream so that we can reuse the same while sending the next stream.
@@ -1955,8 +2076,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						 * disk.
 						 */
 						Assert(change->data.tp.newtuple != NULL);
-
-						dlist_delete(&change->node);
+						ReorderBufferTXNDeleteChange(change->txn, change);
 						ReorderBufferToastAppendChunk(rb, txn, relation,
 													  change);
 					}
@@ -2002,8 +2122,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						specinsert = NULL;
 					}
 
-					/* and memorize the pending insertion */
-					dlist_delete(&change->node);
+					/*
+					 * Remove from the change list and memorize the pending
+					 * insertion
+					 */
+					ReorderBufferTXNDeleteChange(change->txn, change);
 					specinsert = change;
 					break;
 
@@ -2118,6 +2241,15 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
 			}
+
+			/*
+			 * If the transaction contains incomplete tuple and this is the
+			 * last complete change then stop further processing of the
+			 * transaction.
+			 */
+			if (rbtxn_has_incomplete_tuple(txn) &&
+				prev_lsn == txn->last_complete_lsn)
+				break;
 		}
 
 		/*
@@ -2515,7 +2647,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2564,7 +2696,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2587,6 +2719,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2601,8 +2734,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2610,12 +2748,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2676,7 +2822,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 	memcpy(change->data.inval.invalidations, msgs,
 		   sizeof(SharedInvalidationMessage) * nmsgs);
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -2860,18 +3006,28 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 	dlist_foreach(iter, &rb->toplevel_by_lsn)
 	{
 		ReorderBufferTXN *txn;
+		Size	size = 0;
+		Size	largest_size = 0;
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
+		/*
+		 * If this transaction have some incomplete changes then only consider
+		 * the size upto last complete lsn.
+		 */
+		if (rbtxn_has_incomplete_tuple(txn))
+			size = txn->complete_size;
+		else
+			size = txn->total_size;
+
+		/* If the current transaction is larger then remember it. */
+		if ((largest != NULL || size > largest_size) && size > 0)
+		{
 			largest = txn;
+			largest_size = size;
+		}
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2889,66 +3045,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
 	ReorderBufferTXN *txn;
 
-	/* bail out if we haven't exceeded the memory limit */
-	if (rb->size < logical_decoding_work_mem * 1024L)
-		return;
-
-	/*
-	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by streaming, if supported. Otherwise spill to disk.
-	 */
-	if (ReorderBufferCanStream(rb))
-	{
-		/*
-		 * Pick the largest toplevel transaction and evict it from memory by
-		 * streaming the already decoded part.
-		 */
-		txn = ReorderBufferLargestTopTXN(rb);
-
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn && !txn->toptxn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
-
-		ReorderBufferStreamTXN(rb, txn);
-	}
-	else
+	/* Loop until we reach under the memory limit. */
+	while (rb->size >= logical_decoding_work_mem * 1024L)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		ReorderBufferSerializeTXN(rb, txn);
-	}
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
 
-	/*
-	 * After eviction, the transaction should have no entries in memory, and
-	 * should use 0 bytes for changes.
-	 *
-	 * XXX Checking the size is fine for both cases - spill to disk and
-	 * streaming. But for streaming we should really check nentries_mem for
-	 * all subtransactions too.
-	 */
-	Assert(txn->size == 0);
-	Assert(txn->nentries_mem == 0);
+			ReorderBufferSerializeTXN(rb, txn);
+			/*
+			 * After eviction, the transaction should have no entries in memory, and
+			 * should use 0 bytes for changes.
+			 */
+			Assert(txn->size == 0);
+			Assert(txn->nentries_mem == 0);
+		}
+	}
 
-	/*
-	 * And furthermore, evicting the transaction should get us below the
-	 * memory limit again - it is not possible that we're still exceeding the
-	 * memory limit after evicting the transaction.
-	 *
-	 * This follows from the simple fact that the selected transaction is at
-	 * least as large as the most recent change (which caused us to go over
-	 * the memory limit). So by evicting it we're definitely back below the
-	 * memory limit.
-	 */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
@@ -3344,10 +3480,6 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 */
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
-
-	Assert(dlist_is_empty(&txn->changes));
-	Assert(txn->nentries == 0);
-	Assert(txn->nentries_mem == 0);
 }
 
 /*
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cdb12..aa17f7df84 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index b3e2b3f64b..a9b1aacdb1 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -172,6 +172,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -191,6 +193,26 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -199,10 +221,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -350,6 +368,15 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* Size of the commplete changes. */
+	Size		complete_size;
+
+	/* LSN of the last complete change. */
+	XLogRecPtr	last_complete_lsn;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -537,7 +564,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
2.23.0

._v24-0007-Track-statistics-for-streaming.patch000644 000765 000024 00000000334 13662752377 022646 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lPaxHeader/v24-0007-Track-statistics-for-streaming.patch000644 000765 000024 00000000036 13662752377 024401 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.632577999
v24-0007-Track-statistics-for-streaming.patch000644 000765 000024 00000027722 13662752377 022443 0ustar00dilipkumarstaff000000 000000 From fa13067af5b05db4d8932f071f3dccab35e602e1 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Tue, 19 May 2020 19:08:16 +0530
Subject: [PATCH v24 07/12] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                  | 33 +++++++++++++++++++
 src/backend/catalog/system_views.sql          |  5 ++-
 .../replication/logical/reorderbuffer.c       | 12 +++++++
 src/backend/replication/walsender.c           | 32 +++++++++++++++---
 src/include/catalog/pg_proc.dat               |  6 ++--
 src/include/replication/reorderbuffer.h       | 13 +++++---
 src/include/replication/walsender_private.h   |  5 +++
 src/test/regress/expected/rules.out           |  7 ++--
 8 files changed, 98 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 49d4bb13b9..0fc896ca7e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2453,6 +2453,39 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        Amount of decoded transaction data spilled to disk.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_txns</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of in-progress transactions streamed to subscriber after
+       memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+       Streaming only works with toplevel transactions (subtransactions can't
+       be streamed independently), so the counter does not get incremented for
+       subtransactions.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_count</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times in-progress transactions were streamed to subscriber.
+       Transactions may get streamed repeatedly, and this counter gets incremented
+       on every such invocation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_bytes</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Amount of decoded in-progress transaction data streamed to subscriber.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 56420bbc9d..9f509fbc21 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -788,7 +788,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index fe2d0011c4..5c211d0c70 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -349,6 +349,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3475,6 +3479,14 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		ReorderBufferFreeSnap(rb, txn->snapshot_now);
 	}
 
+	/* Update the stream statistics. */
+	rb->streamCount += 1;
+	rb->streamBytes += (rbtxn_has_incomplete_tuple(txn)) ?
+						txn->complete_size : txn->total_size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	/*
 	 * Access the main routine to decode the changes and send to output plugin.
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 86847cbb54..adb7d7962e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1353,7 +1353,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1374,7 +1374,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2423,6 +2424,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3258,7 +3262,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3316,6 +3320,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillCount;
 		int64		spillBytes;
 		bool		is_sync_standby;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 		int			j;
@@ -3341,6 +3348,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		/*
@@ -3443,6 +3453,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3691,11 +3706,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 61f2c2f5b4..7869f721da 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5237,9 +5237,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index a9b1aacdb1..1ced4caaae 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -541,15 +541,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 734acec2a4..b997d1710e 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -83,6 +83,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b813e32215..cf22f8a038 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2005,9 +2005,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_slru| SELECT s.name,
     s.blks_zeroed,
-- 
2.23.0

._v24-0008-Add-support-for-streaming-to-built-in-replicatio.patch000644 000765 000024 00000000334 13662752377 026107 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lPaxHeader/v24-0008-Add-support-for-streaming-to-built-in-replicatio.patch000644 000765 000024 00000000036 13662752377 027642 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.632950828
v24-0008-Add-support-for-streaming-to-built-in-replicatio.patch000644 000765 000024 00000263777 13662752377 025720 0ustar00dilipkumarstaff000000 000000 From 059bdcce70378c507d1b79f6df34d4c2b5dc7ff2 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 14 May 2020 21:27:46 +0530
Subject: [PATCH v24 08/12] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml      |    4 +-
 doc/src/sgml/ref/create_subscription.sgml     |   12 +
 src/backend/catalog/pg_subscription.c         |    1 +
 src/backend/commands/subscriptioncmds.c       |   45 +-
 src/backend/postmaster/pgstat.c               |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |    3 +
 src/backend/replication/logical/proto.c       |  140 ++-
 src/backend/replication/logical/worker.c      | 1046 +++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c   |  318 ++++-
 src/backend/replication/slotfuncs.c           |    6 +
 src/backend/replication/walsender.c           |    6 +
 src/include/catalog/pg_subscription.h         |    3 +
 src/include/pgstat.h                          |    6 +-
 src/include/replication/logicalproto.h        |   42 +-
 src/include/replication/walreceiver.h         |    1 +
 src/test/subscription/t/009_stream_simple.pl  |   86 ++
 src/test/subscription/t/010_stream_subxact.pl |  102 ++
 src/test/subscription/t/011_stream_ddl.pl     |   95 ++
 .../t/012_stream_subxact_abort.pl             |   82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |   84 ++
 20 files changed, 2054 insertions(+), 40 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 1b8beadbaa..95b7c24ef9 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -164,8 +164,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c244fb..3349cc4bfc 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731115..f28482f0f4 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9ebb026187..9065a1be1b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index d7f99d9944..e843d1e658 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4139,6 +4139,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9bb6..5257ab0394 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd171..83d0642cf3 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a1224d..e4e52f10f8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,6 +54,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -64,6 +86,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +94,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -110,12 +134,57 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool	in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;						/* XID of the subxact */
+	off_t           offset;					/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +256,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -552,6 +657,329 @@ apply_handle_origin(StringInfo s)
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		/*
+		 * Pass the missing_ok as true so that if we haven't got any changes
+		 * for the top transaction (empty transaction) we don't give error.
+		 */
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, true);
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		int			fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just free the memory and return.
+		 */
+		if (!found)
+		{
+			/* Free the subxacts memory */
+			if (subxacts)
+				pfree(subxacts);
+
+			subxacts = NULL;
+			subxact_last = InvalidTransactionId;
+			nsubxacts = 0;
+			nsubxacts_max = 0;
+
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = OpenTransientFile(path, O_WRONLY | PG_BINARY);
+		if (fd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+		}
+
+		/* OK, truncate the file at the right offset. */
+		if (ftruncate(fd, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+		CloseTransientFile(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	ensure_transaction();
+
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -565,6 +993,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1011,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1050,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1168,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1313,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1686,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1827,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1477,6 +1939,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 	}
 }
 
+/*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
 /*
  * Apply main loop.
  */
@@ -1493,6 +1971,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1941,6 +2422,570 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd >= 0);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.subxacts",
+			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.changes",
+			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3151,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 15379e3118..1509f9b826 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +122,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +148,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +211,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +240,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +263,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +373,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +423,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +465,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +486,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +518,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +538,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +562,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +582,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +607,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +639,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -587,6 +719,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -623,6 +840,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -753,11 +1002,44 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -793,7 +1075,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 6fed3cfd23..e1344ab4cc 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -157,6 +157,12 @@ create_logical_replication_slot(char *name, char *plugin,
 											   .segment_close = wal_segment_close),
 									NULL, NULL, NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
 	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index adb7d7962e..9731b86d1f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1020,6 +1020,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndPrepareWrite, WalSndWriteData,
 										WalSndUpdateProgress);
 
+		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
 		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d42d8..617b9094d4 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c55dc1481c..899d7e2013 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -980,7 +980,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561be9..89158ed46f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index ac1acbb27e..95132062c6 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

._v24-0009-Enable-streaming-for-all-subscription-TAP-tests.patch000644 000765 000024 00000000334 13662752377 025654 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lPaxHeader/v24-0009-Enable-streaming-for-all-subscription-TAP-tests.patch000644 000765 000024 00000000036 13662752377 027407 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.633564758
v24-0009-Enable-streaming-for-all-subscription-TAP-tests.patch000644 000765 000024 00000035517 13662752377 025452 0ustar00dilipkumarstaff000000 000000 From 7889a00f341645ebfe3ff6e767031ad5461fc838 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v24 09/12] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318fc7c..6f7bedc130 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2fbc..94c71f8ae2 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f871a..4ba80869b9 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9181..a6fae9c3f1 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17a78..202871a658 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10a19..70c86b22ac 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6d63..f9c8d1d348 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334ed89..cdf9b8e7bb 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bdba9..21f50c7012 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133f69..30561d8f96 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae38592b..9a6bac6822 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..ed56fbf96c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cba4c..4df1ddef63 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1a8a..c3caff6149 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7c2f..c62eb521e7 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30f59..2be7542831 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0578..2da9607a7d 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9435..96ffc091b0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
2.23.0

._v24-0010-Add-TAP-test-for-streaming-vs.-DDL.patch000644 000765 000024 00000000334 13662752377 022600 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lPaxHeader/v24-0010-Add-TAP-test-for-streaming-vs.-DDL.patch000644 000765 000024 00000000036 13662752377 024333 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.633942798
v24-0010-Add-TAP-test-for-streaming-vs.-DDL.patch000644 000765 000024 00000010620 13662752377 022362 0ustar00dilipkumarstaff000000 000000 From 2901b64dfa1bf8f23bbec43f8537f2f67fe1c1af Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v24 10/12] Add TAP test for streaming vs. DDL

---
 .../subscription/t/014_stream_through_ddl.pl  | 98 +++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000000..b8d78b1972
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

._v24-0011-Provide-new-api-to-get-the-streaming-changes.patch000644 000765 000024 00000000334 13662752377 025150 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lPaxHeader/v24-0011-Provide-new-api-to-get-the-streaming-changes.patch000644 000765 000024 00000000036 13662752377 026703 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.634256327
v24-0011-Provide-new-api-to-get-the-streaming-changes.patch000644 000765 000024 00000014524 13662752377 024741 0ustar00dilipkumarstaff000000 000000 From 1b39aa9c0729326ca5b0539b45774eea50647c7b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Sat, 2 May 2020 11:41:59 +0530
Subject: [PATCH v24 11/12] Provide new api to get the streaming changes

---
 .gitignore                                    |  1 +
 doc/src/sgml/test-decoding.sgml               | 22 ++++++++++++++++++
 src/backend/catalog/system_views.sql          |  8 +++++++
 .../replication/logical/logicalfuncs.c        | 23 +++++++++++++++----
 src/include/catalog/pg_proc.dat               |  9 ++++++++
 5 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/.gitignore b/.gitignore
index 794e35b73c..6083744c07 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d67b..eed6e9d134 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_streaming_changes('test_slot', NULL, NULL);
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 9f509fbc21..5fe6f28ba2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1243,6 +1243,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
 
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_peek_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index b99c94e848..70c28ffa91 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -108,7 +108,8 @@ check_permissions(void)
  * Helper function for the various SQL callable logical decoding functions.
  */
 static Datum
-pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool binary)
+pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
+								 bool binary, bool streaming)
 {
 	Name		name;
 	XLogRecPtr	upto_lsn;
@@ -252,6 +253,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or has been invalidated.")));
 
+		/* If called has not asked for streaming changes then disable it. */
+		ctx->streaming &= streaming;
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
@@ -362,7 +366,16 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 Datum
 pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, false);
+}
+
+/*
+ * SQL function to get the streaming changes as text, consuming the data.
+ */
+Datum
+pg_logical_slot_get_streaming_changes(PG_FUNCTION_ARGS)
+{
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, true);
 }
 
 /*
@@ -371,7 +384,7 @@ pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, false, false);
 }
 
 /*
@@ -380,7 +393,7 @@ pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, true, false);
 }
 
 /*
@@ -389,7 +402,7 @@ pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, true, false);
 }
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7869f721da..875e0bef28 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10115,6 +10115,15 @@
   proargmodes => '{i,i,i,v,o,o,o}',
   proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
   prosrc => 'pg_logical_slot_get_binary_changes' },
+{ oid => '6150', descr => 'get streaming changes from replication slot',
+  proname => 'pg_logical_slot_get_streaming_changes', procost => '1000',
+  prorows => '1000', provariadic => 'text', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'name pg_lsn int4 _text',
+  proallargtypes => '{name,pg_lsn,int4,_text,pg_lsn,xid,text}',
+  proargmodes => '{i,i,i,v,o,o,o}',
+  proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
+  prosrc => 'pg_logical_slot_get_streaming_changes' },
 { oid => '3784', descr => 'peek at changes from replication slot',
   proname => 'pg_logical_slot_peek_changes', procost => '1000',
   prorows => '1000', provariadic => 'text', proisstrict => 'f',
-- 
2.23.0

._v24-0012-Add-streaming-option-in-pg_dump.patch000644 000765 000024 00000000334 13662752377 022655 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lPaxHeader/v24-0012-Add-streaming-option-in-pg_dump.patch000644 000765 000024 00000000036 13662752377 024410 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.634567552
v24-0012-Add-streaming-option-in-pg_dump.patch000644 000765 000024 00000005337 13662752377 022450 0ustar00dilipkumarstaff000000 000000 From c1c1e920db284580106e270053dc73fb5fe788c7 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v24 12/12] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index a4e949c636..debb52af49 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4235,8 +4236,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4249,6 +4250,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4265,6 +4267,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4342,6 +4345,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0c2fcfb3a9..af64270c55 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
2.23.0

v24/._v24-0005-Implement-streaming-mode-in-ReorderBuffer.patch000644 000765 000024 00000000334 13662752377 025251 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0005-Implement-streaming-mode-in-ReorderBuffer.patch000644 000765 000024 00000000036 13662752377 027004 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.631962245
v24/v24-0005-Implement-streaming-mode-in-ReorderBuffer.patch000644 000765 000024 00000113253 13662752377 025041 0ustar00dilipkumarstaff000000 000000 From d3e05a511d3db8aa849ca385c145c7a676ac77d4 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 1 May 2020 19:56:35 +0530
Subject: [PATCH v24 05/12] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes
we have in memory and invoke new stream API methods. This happens
in ReorderBufferStreamTXN() using about the same logic as in
ReorderBufferCommit() logic.  However, sometime if we have incomplete
toast or speculative insert we spill to the disk because we can not
generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c   |  38 +-
 .../replication/logical/reorderbuffer.c       | 758 ++++++++++++++++--
 src/include/replication/reorderbuffer.h       |  31 +
 3 files changed, 750 insertions(+), 77 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aa..160b167adb 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b889edf461..2cdfb348af 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -236,6 +237,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +246,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +382,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -772,6 +786,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+/*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -865,6 +911,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -1024,6 +1073,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1038,6 +1090,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1315,6 +1370,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		dlist_delete(&txn->base_snapshot_node);
 	}
 
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
 	/*
 	 * Remove TXN from its containing list.
 	 *
@@ -1340,6 +1404,80 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferReturnTXN(rb, txn);
 }
 
+/*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
 /*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1491,57 +1629,171 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+	 * aborted. That will happen during catalog access.  Also, reset the
+	 * bsysscan flag.
+	 */
+	if (!TransactionIdDidCommit(xid))
+	{
+		CheckXidAlive = xid;
+		bsysscan = false;
 	}
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction.
+ */
+static void
+ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   Snapshot snapshot_now,
+								   CommandId command_id,
+								   XLogRecPtr last_lsn,
+								   ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+
+	ReorderBufferToastReset(rb, txn);
+	if (specinsert != NULL)
+		ReorderBufferReturnChange(rb, specinsert);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool	stream_started = false;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1564,21 +1816,46 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
-
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
 		{
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Start stream or begin transaction for the first change in the
+			 * current stream.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+				else
+					rb->begin(rb, txn);
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1655,7 +1932,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1695,7 +1973,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1753,7 +2031,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1762,10 +2043,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1796,7 +2074,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1858,14 +2135,32 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			rb->stream_stop(rb, txn, prev_lsn);
+			stream_started = false;
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1884,14 +2179,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,17 +2214,118 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If the error code is ERRCODE_TRANSACTION_ROLLBACK,  that means we
+		 * have detected a concurrent abort of the (sub)transaction we are
+		 * streaming.  So just do the cleanup and return gracefully.
+		 * Otherwise, Re-throw the error.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * Only in the streaming mode we can get this error, because only
+			 * in the streaming mode we send in-progress transaction.
+			 */
+			Assert(streaming);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/*
+			 * In the TRY block we only stop the stream after we have send
+			 * all the changes.  So we have detected the concurrent abort then
+			 * the stream should have not been stopped yet.
+			 */
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
 
-		PG_RE_THROW();
+			/* Handle the concurrent abort. */
+			ReorderBufferHandleConcurrentAbort(rb, txn, snapshot_now,
+											   command_id, prev_lsn,
+											   specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.  We iterate over the top and
+ * subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot snapshot_now;
+	CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
@@ -1946,6 +2350,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2015,6 +2426,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2150,8 +2568,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2159,6 +2586,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2170,19 +2598,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2211,6 +2648,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2295,6 +2733,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2398,6 +2843,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
@@ -2418,15 +2895,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2746,6 +3254,102 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot snapshot_now;
+	CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -3864,6 +4468,16 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases we assume the CID is from the future command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 65814af9f5..b3e2b3f64b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -171,6 +171,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -190,6 +191,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -224,6 +243,11 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	final_lsn;
 
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
@@ -254,6 +278,13 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
+	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
 	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
-- 
2.23.0

v24/._v24-0011-Provide-new-api-to-get-the-streaming-changes.patch000644 000765 000024 00000000334 13662752377 025563 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0011-Provide-new-api-to-get-the-streaming-changes.patch000644 000765 000024 00000000036 13662752377 027316 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.634256327
v24/v24-0011-Provide-new-api-to-get-the-streaming-changes.patch000644 000765 000024 00000014524 13662752377 025354 0ustar00dilipkumarstaff000000 000000 From 1b39aa9c0729326ca5b0539b45774eea50647c7b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Sat, 2 May 2020 11:41:59 +0530
Subject: [PATCH v24 11/12] Provide new api to get the streaming changes

---
 .gitignore                                    |  1 +
 doc/src/sgml/test-decoding.sgml               | 22 ++++++++++++++++++
 src/backend/catalog/system_views.sql          |  8 +++++++
 .../replication/logical/logicalfuncs.c        | 23 +++++++++++++++----
 src/include/catalog/pg_proc.dat               |  9 ++++++++
 5 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/.gitignore b/.gitignore
index 794e35b73c..6083744c07 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d67b..eed6e9d134 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_streaming_changes('test_slot', NULL, NULL);
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 9f509fbc21..5fe6f28ba2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1243,6 +1243,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
 
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_peek_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index b99c94e848..70c28ffa91 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -108,7 +108,8 @@ check_permissions(void)
  * Helper function for the various SQL callable logical decoding functions.
  */
 static Datum
-pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool binary)
+pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
+								 bool binary, bool streaming)
 {
 	Name		name;
 	XLogRecPtr	upto_lsn;
@@ -252,6 +253,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or has been invalidated.")));
 
+		/* If called has not asked for streaming changes then disable it. */
+		ctx->streaming &= streaming;
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
@@ -362,7 +366,16 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 Datum
 pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, false);
+}
+
+/*
+ * SQL function to get the streaming changes as text, consuming the data.
+ */
+Datum
+pg_logical_slot_get_streaming_changes(PG_FUNCTION_ARGS)
+{
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, true);
 }
 
 /*
@@ -371,7 +384,7 @@ pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, false, false);
 }
 
 /*
@@ -380,7 +393,7 @@ pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, true, false);
 }
 
 /*
@@ -389,7 +402,7 @@ pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, true, false);
 }
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7869f721da..875e0bef28 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10115,6 +10115,15 @@
   proargmodes => '{i,i,i,v,o,o,o}',
   proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
   prosrc => 'pg_logical_slot_get_binary_changes' },
+{ oid => '6150', descr => 'get streaming changes from replication slot',
+  proname => 'pg_logical_slot_get_streaming_changes', procost => '1000',
+  prorows => '1000', provariadic => 'text', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'name pg_lsn int4 _text',
+  proallargtypes => '{name,pg_lsn,int4,_text,pg_lsn,xid,text}',
+  proargmodes => '{i,i,i,v,o,o,o}',
+  proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
+  prosrc => 'pg_logical_slot_get_streaming_changes' },
 { oid => '3784', descr => 'peek at changes from replication slot',
   proname => 'pg_logical_slot_peek_changes', procost => '1000',
   prorows => '1000', provariadic => 'text', proisstrict => 'f',
-- 
2.23.0

v24/._v24-0007-Track-statistics-for-streaming.patch000644 000765 000024 00000000334 13662752377 023261 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0007-Track-statistics-for-streaming.patch000644 000765 000024 00000000036 13662752377 025014 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.632577999
v24/v24-0007-Track-statistics-for-streaming.patch000644 000765 000024 00000027722 13662752377 023056 0ustar00dilipkumarstaff000000 000000 From fa13067af5b05db4d8932f071f3dccab35e602e1 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Tue, 19 May 2020 19:08:16 +0530
Subject: [PATCH v24 07/12] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                  | 33 +++++++++++++++++++
 src/backend/catalog/system_views.sql          |  5 ++-
 .../replication/logical/reorderbuffer.c       | 12 +++++++
 src/backend/replication/walsender.c           | 32 +++++++++++++++---
 src/include/catalog/pg_proc.dat               |  6 ++--
 src/include/replication/reorderbuffer.h       | 13 +++++---
 src/include/replication/walsender_private.h   |  5 +++
 src/test/regress/expected/rules.out           |  7 ++--
 8 files changed, 98 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 49d4bb13b9..0fc896ca7e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2453,6 +2453,39 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        Amount of decoded transaction data spilled to disk.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_txns</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of in-progress transactions streamed to subscriber after
+       memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+       Streaming only works with toplevel transactions (subtransactions can't
+       be streamed independently), so the counter does not get incremented for
+       subtransactions.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_count</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times in-progress transactions were streamed to subscriber.
+       Transactions may get streamed repeatedly, and this counter gets incremented
+       on every such invocation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_bytes</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Amount of decoded in-progress transaction data streamed to subscriber.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 56420bbc9d..9f509fbc21 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -788,7 +788,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index fe2d0011c4..5c211d0c70 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -349,6 +349,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3475,6 +3479,14 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		ReorderBufferFreeSnap(rb, txn->snapshot_now);
 	}
 
+	/* Update the stream statistics. */
+	rb->streamCount += 1;
+	rb->streamBytes += (rbtxn_has_incomplete_tuple(txn)) ?
+						txn->complete_size : txn->total_size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	/*
 	 * Access the main routine to decode the changes and send to output plugin.
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 86847cbb54..adb7d7962e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1353,7 +1353,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1374,7 +1374,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2423,6 +2424,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3258,7 +3262,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3316,6 +3320,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillCount;
 		int64		spillBytes;
 		bool		is_sync_standby;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 		int			j;
@@ -3341,6 +3348,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		/*
@@ -3443,6 +3453,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3691,11 +3706,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 61f2c2f5b4..7869f721da 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5237,9 +5237,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index a9b1aacdb1..1ced4caaae 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -541,15 +541,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 734acec2a4..b997d1710e 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -83,6 +83,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b813e32215..cf22f8a038 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2005,9 +2005,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_slru| SELECT s.name,
     s.blks_zeroed,
-- 
2.23.0

v24/._v24-0009-Enable-streaming-for-all-subscription-TAP-tests.patch000644 000765 000024 00000000334 13662752377 026267 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0009-Enable-streaming-for-all-subscription-TAP-tests.patch000644 000765 000024 00000000036 13662752377 030022 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.633564758
v24/v24-0009-Enable-streaming-for-all-subscription-TAP-tests.patch000644 000765 000024 00000035517 13662752377 026065 0ustar00dilipkumarstaff000000 000000 From 7889a00f341645ebfe3ff6e767031ad5461fc838 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v24 09/12] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318fc7c..6f7bedc130 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2fbc..94c71f8ae2 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f871a..4ba80869b9 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9181..a6fae9c3f1 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17a78..202871a658 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10a19..70c86b22ac 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6d63..f9c8d1d348 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334ed89..cdf9b8e7bb 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bdba9..21f50c7012 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133f69..30561d8f96 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae38592b..9a6bac6822 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..ed56fbf96c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cba4c..4df1ddef63 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1a8a..c3caff6149 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7c2f..c62eb521e7 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30f59..2be7542831 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0578..2da9607a7d 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9435..96ffc091b0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
2.23.0

v24/._v24-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch000644 000765 000024 00000000334 13662752377 026664 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch000644 000765 000024 00000000036 13662752377 030417 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.631015992
v24/v24-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch000644 000765 000024 00000032536 13662752377 026460 0ustar00dilipkumarstaff000000 000000 From 8f9b9d1569dfacf20b205c893f98379c9f854662 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 10:55:19 +0530
Subject: [PATCH v24 04/12] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml  |  9 +++--
 src/backend/access/heap/heapam.c   | 10 ++++++
 src/backend/access/index/genam.c   | 53 ++++++++++++++++++++++++++++
 src/backend/access/table/tableam.c |  8 +++++
 src/backend/utils/time/snapmgr.c   | 13 +++++++
 src/include/access/tableam.h       | 55 ++++++++++++++++++++++++++++++
 src/include/utils/snapmgr.h        |  2 ++
 7 files changed, 147 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 1b56daa4bb..5f7394f3c1 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,9 +432,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 94eb37d48d..2d77107c4f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments at snapmgr.c
+	 * where these variables are declared.  Normally we have such a check at
+	 * tableam level API but this is called from many places so we need to
+	 * ensure it here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae39a..446b8cbc86 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,9 +430,36 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments at snapmgr.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
+/*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments at snapmgr.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
 /*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments at snapmgr.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index c814733b22..2f52b407c6 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -230,6 +230,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	Relation	rel = scan->rs_rd;
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
+	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
 	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..9f1ecd123f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -153,6 +153,19 @@ static Snapshot SecondarySnapshot = NULL;
 static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction. ��Currently, it is used in logical decoding. ��It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables. ��We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool bsysscan = false;
+
 /*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8c34935c34..9d890d3c4b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
+#include "utils/snapmgr.h"
 #include "utils/snapshot.h"
 
 
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1015,6 +1025,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1054,6 +1071,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1710,6 +1735,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1727,6 +1760,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1745,6 +1786,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1761,6 +1809,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce84..5af6df698b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,6 +145,8 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
+extern bool	bsysscan;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
 extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
 extern void TeardownHistoricSnapshot(bool is_error);
-- 
2.23.0

v24/._v24-0006-Bugfix-handling-of-incomplete-toast-spec-insert-.patch000644 000765 000024 00000000334 13662752377 026455 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0006-Bugfix-handling-of-incomplete-toast-spec-insert-.patch000644 000765 000024 00000000036 13662752377 030210 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.632280004
v24/v24-0006-Bugfix-handling-of-incomplete-toast-spec-insert-.patch000644 000765 000024 00000054773 13662752377 026260 0ustar00dilipkumarstaff000000 000000 From 25ceb8fbf1534abb3bec472a742b22216d32de7c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Tue, 19 May 2020 18:55:23 +0530
Subject: [PATCH v24 06/12] Bugfix handling of incomplete toast/spec insert
 tuple

---
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/replication/logical/decode.c      |  17 +-
 .../replication/logical/reorderbuffer.c       | 324 ++++++++++++------
 src/include/access/heapam_xlog.h              |   1 +
 src/include/replication/reorderbuffer.h       |  39 ++-
 5 files changed, 277 insertions(+), 107 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2d77107c4f..3927448f46 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1955,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 69c1f45ef6..c841687c66 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -727,7 +727,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -794,7 +796,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -851,7 +854,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -887,7 +891,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -987,7 +991,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1025,7 +1029,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2cdfb348af..fe2d0011c4 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -179,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -254,6 +269,8 @@ static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
 static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static inline void ReorderBufferTXNDeleteChange(ReorderBufferTXN *txn,
+												ReorderBufferChange *change);
 
 /* ---------------------------------------
  * toast reassembly support
@@ -646,12 +663,71 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 	return txn;
 }
 
+/*
+ * Handle incomplete tuple during streaming.  If streaming is enabled then we
+ * might need to stream the in-progress transaction.  So the problem is that
+ * sometime we might get some incomplete changes which we can not stream
+ * until we get the complete change. e.g.  toast table insert without the main
+ * table insert.  So this function remember the lsn of the last complete change
+ * and the complete size upto last complete lsn so that if we need to stream
+ * we can only stream upto last complete lsn.
+ */
+static void
+ReorderBufferHandleIncompleteTuple(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   ReorderBufferChange *change,
+								   bool toast_insert)
+{
+	/* If streaming is not enable then nothing to do. */
+	if (!ReorderBufferCanStream(rb))
+		return;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		txn = txn->toptxn;
+
+	/*
+	 * If this is a first incomplete change then set the size of the complete
+	 * change.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(txn)) &&
+		(toast_insert || IsSpecInsert(change->action)))
+		txn->complete_size = txn->total_size;
+
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Basically,
+	 * both update and insert will do the insert in the toast table.  And as
+	 * explained in the function header we can not stream the only toast
+	 * changes.  So whenever we get the toast insert we set the flag and clear
+	 * the same whenever we get the next insert or update on the main table.
+	 */
+	if (toast_insert)
+		txn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(txn) && IsInsertOrUpdate(change->action))
+		txn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial tuple and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		txn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+		txn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+
+	/*
+	 * If we don't have any incomplete change after this change then set this
+	 * LSN as last complete lsn.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(txn)))
+		txn->last_complete_lsn = change->lsn;
+}
+
 /*
  * Queue a change into a transaction so it can be replayed upon commit.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
@@ -660,6 +736,9 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	change->lsn = lsn;
 	change->txn = txn;
 
+	/* Handle the incomplete tuple if it's a toast/spec insert */
+	ReorderBufferHandleIncompleteTuple(rb, txn, change, toast_insert);
+
 	Assert(InvalidXLogRecPtr != lsn);
 	dlist_push_tail(&txn->changes, &change->node);
 	txn->nentries++;
@@ -697,7 +776,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1412,6 +1491,30 @@ static void
 ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
 	dlist_mutable_iter iter;
+	ReorderBufferTXN *toptxn;
+
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
 
 	/* cleanup subtransactions & their changes */
 	dlist_foreach_modify(iter, &txn->subtxns)
@@ -1438,30 +1541,28 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		change = dlist_container(ReorderBufferChange, node, iter.cur);
 
-		/* remove the change from it's containing list */
-		dlist_delete(&change->node);
+		/* We have truncated upto last complete lsn so stop. */
+		if (rbtxn_has_incomplete_tuple(toptxn) &&
+			(change->lsn > toptxn->last_complete_lsn))
+		{
+			/*
+			 * If this is a top transaction then we can reset the
+			 * last_complete_lsn and complete_size, because by now we would
+			 * have stream all the changes upto last_complete_lsn.
+			 */
+			if (txn->toptxn == NULL)
+			{
+				toptxn->last_complete_lsn = InvalidXLogRecPtr;
+				toptxn->complete_size = 0;
+			}
+			break;
+		}
 
+		/* remove the change from it's containing list */
+		ReorderBufferTXNDeleteChange(txn, change);
 		ReorderBufferReturnChange(rb, change);
 	}
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The toplevel transaction, identified by (toptxn==NULL), is marked
-	 * as streamed always, even if it does not contain any changes (that
-	 * is, when all the changes are in subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts
-	 * for XIDs the downstream is not aware of. And of course, it always
-	 * knows about the toplevel xact (we send the XID in all messages),
-	 * but we never stream XIDs of empty subxacts.
-	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
 	 * any memory. We could also keep the hash table and update it with
@@ -1473,9 +1574,15 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
-	/* also reset the number of entries in the transaction */
-	txn->nentries_mem = 0;
-	txn->nentries = 0;
+	/*
+	 * If this txn is serialized and there are no more entries in the disk then
+	 * clean the disk space.
+	 */
+	if (rbtxn_is_serialized(txn) && (txn->nentries == txn->nentries_mem))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
 }
 
 /*
@@ -1732,6 +1839,20 @@ ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					change->data.msg.message);
 }
 
+/*
+ * While streaming a transaction, due to incomplete tuple we can not always
+ * stream all the changes.  So whenever we are deleting any change from the
+ * change list we need to update the entries count.
+ */
+static inline void
+ReorderBufferTXNDeleteChange(ReorderBufferTXN *txn, ReorderBufferChange *change)
+{
+	/* Delete the node and decrement the nentries_mem and nentries count. */
+	dlist_delete(&change->node);
+	change->txn->nentries_mem--;
+	change->txn->nentries--;
+}
+
 /*
  * Function to store the command id and snapshot at the end of the current
  * stream so that we can reuse the same while sending the next stream.
@@ -1955,8 +2076,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						 * disk.
 						 */
 						Assert(change->data.tp.newtuple != NULL);
-
-						dlist_delete(&change->node);
+						ReorderBufferTXNDeleteChange(change->txn, change);
 						ReorderBufferToastAppendChunk(rb, txn, relation,
 													  change);
 					}
@@ -2002,8 +2122,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						specinsert = NULL;
 					}
 
-					/* and memorize the pending insertion */
-					dlist_delete(&change->node);
+					/*
+					 * Remove from the change list and memorize the pending
+					 * insertion
+					 */
+					ReorderBufferTXNDeleteChange(change->txn, change);
 					specinsert = change;
 					break;
 
@@ -2118,6 +2241,15 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
 			}
+
+			/*
+			 * If the transaction contains incomplete tuple and this is the
+			 * last complete change then stop further processing of the
+			 * transaction.
+			 */
+			if (rbtxn_has_incomplete_tuple(txn) &&
+				prev_lsn == txn->last_complete_lsn)
+				break;
 		}
 
 		/*
@@ -2515,7 +2647,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2564,7 +2696,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2587,6 +2719,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2601,8 +2734,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2610,12 +2748,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2676,7 +2822,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 	memcpy(change->data.inval.invalidations, msgs,
 		   sizeof(SharedInvalidationMessage) * nmsgs);
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -2860,18 +3006,28 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 	dlist_foreach(iter, &rb->toplevel_by_lsn)
 	{
 		ReorderBufferTXN *txn;
+		Size	size = 0;
+		Size	largest_size = 0;
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
+		/*
+		 * If this transaction have some incomplete changes then only consider
+		 * the size upto last complete lsn.
+		 */
+		if (rbtxn_has_incomplete_tuple(txn))
+			size = txn->complete_size;
+		else
+			size = txn->total_size;
+
+		/* If the current transaction is larger then remember it. */
+		if ((largest != NULL || size > largest_size) && size > 0)
+		{
 			largest = txn;
+			largest_size = size;
+		}
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2889,66 +3045,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
 	ReorderBufferTXN *txn;
 
-	/* bail out if we haven't exceeded the memory limit */
-	if (rb->size < logical_decoding_work_mem * 1024L)
-		return;
-
-	/*
-	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by streaming, if supported. Otherwise spill to disk.
-	 */
-	if (ReorderBufferCanStream(rb))
-	{
-		/*
-		 * Pick the largest toplevel transaction and evict it from memory by
-		 * streaming the already decoded part.
-		 */
-		txn = ReorderBufferLargestTopTXN(rb);
-
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn && !txn->toptxn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
-
-		ReorderBufferStreamTXN(rb, txn);
-	}
-	else
+	/* Loop until we reach under the memory limit. */
+	while (rb->size >= logical_decoding_work_mem * 1024L)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		ReorderBufferSerializeTXN(rb, txn);
-	}
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
 
-	/*
-	 * After eviction, the transaction should have no entries in memory, and
-	 * should use 0 bytes for changes.
-	 *
-	 * XXX Checking the size is fine for both cases - spill to disk and
-	 * streaming. But for streaming we should really check nentries_mem for
-	 * all subtransactions too.
-	 */
-	Assert(txn->size == 0);
-	Assert(txn->nentries_mem == 0);
+			ReorderBufferSerializeTXN(rb, txn);
+			/*
+			 * After eviction, the transaction should have no entries in memory, and
+			 * should use 0 bytes for changes.
+			 */
+			Assert(txn->size == 0);
+			Assert(txn->nentries_mem == 0);
+		}
+	}
 
-	/*
-	 * And furthermore, evicting the transaction should get us below the
-	 * memory limit again - it is not possible that we're still exceeding the
-	 * memory limit after evicting the transaction.
-	 *
-	 * This follows from the simple fact that the selected transaction is at
-	 * least as large as the most recent change (which caused us to go over
-	 * the memory limit). So by evicting it we're definitely back below the
-	 * memory limit.
-	 */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
@@ -3344,10 +3480,6 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 */
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
-
-	Assert(dlist_is_empty(&txn->changes));
-	Assert(txn->nentries == 0);
-	Assert(txn->nentries_mem == 0);
 }
 
 /*
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cdb12..aa17f7df84 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index b3e2b3f64b..a9b1aacdb1 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -172,6 +172,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -191,6 +193,26 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -199,10 +221,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -350,6 +368,15 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* Size of the commplete changes. */
+	Size		complete_size;
+
+	/* LSN of the last complete change. */
+	XLogRecPtr	last_complete_lsn;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -537,7 +564,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
2.23.0

v24/._v24-0008-Add-support-for-streaming-to-built-in-replicatio.patch000644 000765 000024 00000000334 13662752377 026522 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0008-Add-support-for-streaming-to-built-in-replicatio.patch000644 000765 000024 00000000036 13662752377 030255 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.632950828
v24/v24-0008-Add-support-for-streaming-to-built-in-replicatio.patch000644 000765 000024 00000263777 13662752377 026333 0ustar00dilipkumarstaff000000 000000 From 059bdcce70378c507d1b79f6df34d4c2b5dc7ff2 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 14 May 2020 21:27:46 +0530
Subject: [PATCH v24 08/12] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml      |    4 +-
 doc/src/sgml/ref/create_subscription.sgml     |   12 +
 src/backend/catalog/pg_subscription.c         |    1 +
 src/backend/commands/subscriptioncmds.c       |   45 +-
 src/backend/postmaster/pgstat.c               |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |    3 +
 src/backend/replication/logical/proto.c       |  140 ++-
 src/backend/replication/logical/worker.c      | 1046 +++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c   |  318 ++++-
 src/backend/replication/slotfuncs.c           |    6 +
 src/backend/replication/walsender.c           |    6 +
 src/include/catalog/pg_subscription.h         |    3 +
 src/include/pgstat.h                          |    6 +-
 src/include/replication/logicalproto.h        |   42 +-
 src/include/replication/walreceiver.h         |    1 +
 src/test/subscription/t/009_stream_simple.pl  |   86 ++
 src/test/subscription/t/010_stream_subxact.pl |  102 ++
 src/test/subscription/t/011_stream_ddl.pl     |   95 ++
 .../t/012_stream_subxact_abort.pl             |   82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |   84 ++
 20 files changed, 2054 insertions(+), 40 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 1b8beadbaa..95b7c24ef9 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -164,8 +164,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c244fb..3349cc4bfc 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731115..f28482f0f4 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9ebb026187..9065a1be1b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index d7f99d9944..e843d1e658 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4139,6 +4139,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9bb6..5257ab0394 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd171..83d0642cf3 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a1224d..e4e52f10f8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,6 +54,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -64,6 +86,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +94,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -110,12 +134,57 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool	in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;						/* XID of the subxact */
+	off_t           offset;					/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +256,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -552,6 +657,329 @@ apply_handle_origin(StringInfo s)
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		/*
+		 * Pass the missing_ok as true so that if we haven't got any changes
+		 * for the top transaction (empty transaction) we don't give error.
+		 */
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, true);
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		int			fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just free the memory and return.
+		 */
+		if (!found)
+		{
+			/* Free the subxacts memory */
+			if (subxacts)
+				pfree(subxacts);
+
+			subxacts = NULL;
+			subxact_last = InvalidTransactionId;
+			nsubxacts = 0;
+			nsubxacts_max = 0;
+
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = OpenTransientFile(path, O_WRONLY | PG_BINARY);
+		if (fd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+		}
+
+		/* OK, truncate the file at the right offset. */
+		if (ftruncate(fd, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+		CloseTransientFile(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	ensure_transaction();
+
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -565,6 +993,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1011,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1050,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1168,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1313,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1686,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1827,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1477,6 +1939,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 	}
 }
 
+/*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
 /*
  * Apply main loop.
  */
@@ -1493,6 +1971,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1941,6 +2422,570 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd >= 0);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.subxacts",
+			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.changes",
+			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3151,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 15379e3118..1509f9b826 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +122,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +148,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +211,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +240,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +263,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +373,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +423,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +465,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +486,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +518,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +538,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +562,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +582,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +607,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +639,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -587,6 +719,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -623,6 +840,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -753,11 +1002,44 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -793,7 +1075,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 6fed3cfd23..e1344ab4cc 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -157,6 +157,12 @@ create_logical_replication_slot(char *name, char *plugin,
 											   .segment_close = wal_segment_close),
 									NULL, NULL, NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
 	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index adb7d7962e..9731b86d1f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1020,6 +1020,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndPrepareWrite, WalSndWriteData,
 										WalSndUpdateProgress);
 
+		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
 		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d42d8..617b9094d4 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c55dc1481c..899d7e2013 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -980,7 +980,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561be9..89158ed46f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index ac1acbb27e..95132062c6 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v24/._v24-0002-Issue-individual-invalidations-with-wal_level-lo.patch000644 000765 000024 00000000334 13662752377 026656 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0002-Issue-individual-invalidations-with-wal_level-lo.patch000644 000765 000024 00000000036 13662752377 030411 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.629380599
v24/v24-0002-Issue-individual-invalidations-with-wal_level-lo.patch000644 000765 000024 00000041336 13662752377 026450 0ustar00dilipkumarstaff000000 000000 From ebb6dbc7cd2e8227cd085d62736f95015334c5ad Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Tue, 14 Apr 2020 11:11:37 +0530
Subject: [PATCH v24 02/12] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.
---
 src/backend/access/rmgrdesc/xactdesc.c        |  40 +++++++
 src/backend/access/transam/xact.c             |   7 ++
 src/backend/replication/logical/decode.c      |  16 +++
 .../replication/logical/reorderbuffer.c       | 104 +++++++++++++++---
 src/backend/utils/cache/inval.c               |  49 +++++++++
 src/include/access/xact.h                     |  13 ++-
 src/include/replication/reorderbuffer.h       |  11 ++
 7 files changed, 226 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce75565f..7ab0d11ea9 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3af8e81af1..e576b10055 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6020,6 +6020,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 122c581d0f..69c1f45ef6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,22 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				Assert(TransactionIdIsValid(xid));
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->nmsgs, invals->msgs);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4594cf9509..b889edf461 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -220,7 +220,7 @@ static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
 static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
-static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
 
 /*
  * ---------------------------------------
@@ -455,6 +455,11 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 				pfree(change->data.msg.message);
 			change->data.msg.message = NULL;
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			if (change->data.inval.invalidations)
+				pfree(change->data.inval.invalidations);
+			change->data.inval.invalidations = NULL;
+			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			if (change->data.snapshot)
 			{
@@ -1814,17 +1819,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation messages locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					ReorderBufferExecuteInvalidations(
+							change->data.inval.ninvalidations,
+							change->data.inval.invalidations);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -1866,7 +1878,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1892,7 +1905,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2202,6 +2216,33 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	txn->ntuplecids++;
 }
 
+/*
+ * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn, int nmsgs,
+							 SharedInvalidationMessage *msgs)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.ninvalidations = nmsgs;
+	change->data.inval.invalidations = (SharedInvalidationMessage *)
+		MemoryContextAlloc(rb->context,
+						   sizeof(SharedInvalidationMessage) * nmsgs);
+	memcpy(change->data.inval.invalidations, msgs,
+		   sizeof(SharedInvalidationMessage) * nmsgs);
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Setup the invalidation of the toplevel transaction.
  *
@@ -2234,12 +2275,12 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  * in the changestream but we don't know which those are.
  */
 static void
-ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
 {
 	int			i;
 
-	for (i = 0; i < txn->ninvalidations; i++)
-		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	for (i = 0; i < nmsgs; i++)
+		LocalExecuteInvalidationMessage(&msgs[i]);
 }
 
 /*
@@ -2589,6 +2630,24 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				char	   *data;
+				Size		inval_size = sizeof(SharedInvalidationMessage) *
+										change->data.inval.ninvalidations;
+
+				sz += inval_size;
+
+				ReorderBufferSerializeReserve(rb, sz);
+				data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+
+				/* might have been reallocated above */
+				ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+				memcpy(data, change->data.inval.invalidations, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -2736,6 +2795,11 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			sz += sizeof(SharedInvalidationMessage) *
+					change->data.inval.ninvalidations;
+			break;
+
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -3002,6 +3066,20 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				Size	inval_size = sizeof(SharedInvalidationMessage) *
+									change->data.inval.ninvalidations;
+
+				change->data.inval.invalidations =
+						MemoryContextAlloc(rb->context, inval_size);
+
+				/* read the message */
+				memcpy(change->data.inval.invalidations, data, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33be6..cba5b6c64b 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transaction.  As of now it was
+ *	enough to log invalidation only at commit because we are only decoding the
+ *	transaction at the commit time.   We only need to log the catalog cache and
+ *	relcache invalidation.  There can not be any active MVCC scan in logical
+ *	decoding so we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
 	if (transInvalInfo == NULL)
 		return;
 
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
@@ -1501,3 +1513,40 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage	*invalMessages;
+	int	nmsgs = 0;
+
+	if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+	{
+		ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+								 MakeSharedInvalidMessagesArray);
+		invalMessages = SharedInvalidMessagesArray;
+		nmsgs  = numSharedInvalidMessagesArray;
+		SharedInvalidMessagesArray = NULL;
+		numSharedInvalidMessagesArray = 0;
+	}
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b3816c..b822c5e4b2 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -197,6 +197,17 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+/*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4dc9..af35287896 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,14 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations;		/* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation
+														 * message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +468,8 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  int nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
2.23.0

v24/._v24-0010-Add-TAP-test-for-streaming-vs.-DDL.patch000644 000765 000024 00000000334 13662752377 023213 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0010-Add-TAP-test-for-streaming-vs.-DDL.patch000644 000765 000024 00000000036 13662752377 024746 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.633942798
v24/v24-0010-Add-TAP-test-for-streaming-vs.-DDL.patch000644 000765 000024 00000010620 13662752377 022775 0ustar00dilipkumarstaff000000 000000 From 2901b64dfa1bf8f23bbec43f8537f2f67fe1c1af Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v24 10/12] Add TAP test for streaming vs. DDL

---
 .../subscription/t/014_stream_through_ddl.pl  | 98 +++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000000..b8d78b1972
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v24/._v24-0012-Add-streaming-option-in-pg_dump.patch000644 000765 000024 00000000334 13662752377 023270 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0012-Add-streaming-option-in-pg_dump.patch000644 000765 000024 00000000036 13662752377 025023 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.634567552
v24/v24-0012-Add-streaming-option-in-pg_dump.patch000644 000765 000024 00000005337 13662752377 023063 0ustar00dilipkumarstaff000000 000000 From c1c1e920db284580106e270053dc73fb5fe788c7 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v24 12/12] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index a4e949c636..debb52af49 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4235,8 +4236,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4249,6 +4250,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4265,6 +4267,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4342,6 +4345,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0c2fcfb3a9..af64270c55 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
2.23.0

v24/._v24-0003-Extend-the-output-plugin-API-with-stream-methods.patch000644 000765 000024 00000000334 13662752377 026417 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0003-Extend-the-output-plugin-API-with-stream-methods.patch000644 000765 000024 00000000036 13662752377 030152 xustar00dilipkumarstaff000000 000000 30 mtime=1590416639.629725287
v24/v24-0003-Extend-the-output-plugin-API-with-stream-methods.patch000644 000765 000024 00000106360 13662752377 026210 0ustar00dilipkumarstaff000000 000000 From bbd0b76f1ada9c6d5e53aef22f6e96ac53ce4017 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v24 03/12] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 ++++++
 doc/src/sgml/logicaldecoding.sgml         | 213 +++++++++++++
 src/backend/replication/logical/logical.c | 365 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++
 src/include/replication/reorderbuffer.h   |  59 ++++
 6 files changed, 811 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c948856e..64f651fa72 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index bad3bfe620..1b56daa4bb 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index dc69e5ce5f..0cff1ac393 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similar to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,321 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475e5d..deef31825d 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -79,6 +79,11 @@ typedef struct LogicalDecodingContext
 	 */
 	void	   *output_writer_private;
 
+	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236c57..0d0a94a648 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
 /*
  * Output plugin callbacks
  */
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index af35287896..65814af9f5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -354,6 +354,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
 struct ReorderBuffer
 {
 	/*
@@ -392,6 +440,17 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
 	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
-- 
2.23.0

#323Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#316)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, May 22, 2020 at 4:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

v22-0006-Add-support-for-streaming-to-built-in-replicatio
----------------------------------------------------------------------------

Few more comments on v22-0006 patch:

1.
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+ int i;
+ char path[MAXPGPATH];
+ bool found = false;
+
+ subxact_filename(path, subid, xid);
+
+ if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));

Here, we have unlinked the files containing information of subxacts
but don't we need to free the corresponding memory (memory for
subxacts) as well?

Basically, stream_cleanup_files, is used for
1) cleanup file on worker exit
2) while writing the first segment of the xid we clean up to ensure
there are no orphaned file with same xid.
3) After apply commit we clean up the file.

Whereas subxacts memory is used between the stream start and stream
stop as soon stream stop we write the subxacts changes to file and
free the memory. So there is no case that we can have subxact memory
at stream_cleanup_files, except on worker exit but there we are
already exiting the worker. IMHO we don't need to free memory there.

2.
apply_handle_stream_abort()
{
..
+ subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+
+ return;
..
}

Like the previous comment, it seems here also we need to free subxacts
memory and additionally we forgot to adjust the xids array as well.

In this, we are allocating memory in subxact_info_read, but we are
again calling subxact_info_write which will free the memory.

3.
apply_handle_stream_abort()
{
..
+ /* XXX optimize the search by bsearch on sorted data */
+ for (i = nsubxacts; i > 0; i--)
+ {
+ if (subxacts[i - 1].xid == subxid)
+ {
+ subidx = (i - 1);
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ return;
..
}

Is it possible that we didn't find the xid in subxacts array? If so,
I think we should mention the same in comments, otherwise, we should
have an assert for found.

We may not find due to the empty transaction, I have changed the comments.

4.
apply_handle_stream_abort()
{
..
+ changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+ if (truncate(path, subxacts[subidx].offset))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not truncate file \"%s\": %m", path)));
..
}

Will truncate works on Windows? I see in the code we ftruncate which
is defined as chsize in win32.h and win32_port.h. I have not tested
this so I am not very sure about this. I got a below warning when I
tried to compile this code on Windows. I think it is better to
ftruncate as it is used at other places in the code as well.

worker.c(798): warning C4013: 'truncate' undefined; assuming extern
returning int

I have changed to the ftruncate.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#324Erik Rijkers
er@xs4all.nl
In reply to: Dilip Kumar (#322)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On 2020-05-25 16:37, Dilip Kumar wrote:

On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, May 19, 2020 at 6:01 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have further reviewed v22 and below are my comments:

[v24.tar]

Hi,

I am not able to extract all files correctly from this tar.

The first file v24-0001-* seems to have some 'binary' junk at the top.

(The other 11 files seem normally readably)

Erik Rijkers

#325Dilip Kumar
dilipbalaut@gmail.com
In reply to: Erik Rijkers (#324)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, May 25, 2020 at 8:48 PM Erik Rijkers <er@xs4all.nl> wrote:

Hi,

I am not able to extract all files correctly from this tar.

The first file v24-0001-* seems to have some 'binary' junk at the top.

(The other 11 files seem normally readably)

Okay, sending again.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v24.tarapplication/x-tar; name=v24.tarDownload
._v24000755 000765 000024 00000000334 13663075075 013154 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl��^���OU�=C�6�PaxHeader/v24000755 000765 000024 00000000036 13663075075 014707 xustar00dilipkumarstaff000000 000000 30 mtime=1590458941.710218587
v24/000755 000765 000024 00000000000 13663075075 013011 5ustar00dilipkumarstaff000000 000000 v24/._v24-0001-Immediately-WAL-log-assignments.patch000644 000765 000024 00000000552 13663075061 023227 0ustar00dilipkumarstaff000000 000000 Mac OS X        	28jATTRjZcom.apple.lastuseddate#PS Hcom.apple.maclhcom.macromates.selectionRangeicom.macromates.visibleIndex6z�^�:���x&@w�|G*��l10v24/PaxHeader/v24-0001-Immediately-WAL-log-assignments.patch000644 000765 000024 00000000036 13663075061 024760 xustar00dilipkumarstaff000000 000000 30 mtime=1590458929.189182576
v24/v24-0001-Immediately-WAL-log-assignments.patch000644 000765 000024 00000025100 13663075061 023006 0ustar00dilipkumarstaff000000 000000 From 60f0c359fb811a8f6607c942a8fc025c2f0a53cb Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 20 Mar 2020 15:03:01 +0530
Subject: [PATCH v24 01/12] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 46 ++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 +++
 src/backend/replication/logical/decode.c | 39 ++++++++++----------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 99 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index cd30b62d36..3af8e81af1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5118,6 +5120,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6020,3 +6023,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f09e..53be2b3059 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5995798b58..560ec27fa0 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1195,6 +1195,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1233,6 +1234,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3abf8..122c581d0f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel xid is valid, we need to assign the subxact to the
+	 * toplevel xact. We need to do this for all records, hence we do it before
+	 * the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(XLogRecGetTopXid(r)))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04babc2..8645b3816c 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e917dfe92d..26426cc779 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index c21b0ba972..83170a663c 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -191,6 +191,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -308,6 +310,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0194..2f0c8bf589 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
2.23.0

v24/._v24-0005-Implement-streaming-mode-in-ReorderBuffer.patch000644 000765 000024 00000000334 13663075061 025236 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0005-Implement-streaming-mode-in-ReorderBuffer.patch000644 000765 000024 00000000036 13663075061 026771 xustar00dilipkumarstaff000000 000000 30 mtime=1590458929.193498059
v24/v24-0005-Implement-streaming-mode-in-ReorderBuffer.patch000644 000765 000024 00000113253 13663075061 025026 0ustar00dilipkumarstaff000000 000000 From d3e05a511d3db8aa849ca385c145c7a676ac77d4 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 1 May 2020 19:56:35 +0530
Subject: [PATCH v24 05/12] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes
we have in memory and invoke new stream API methods. This happens
in ReorderBufferStreamTXN() using about the same logic as in
ReorderBufferCommit() logic.  However, sometime if we have incomplete
toast or speculative insert we spill to the disk because we can not
generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c   |  38 +-
 .../replication/logical/reorderbuffer.c       | 758 ++++++++++++++++--
 src/include/replication/reorderbuffer.h       |  31 +
 3 files changed, 750 insertions(+), 77 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aa..160b167adb 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b889edf461..2cdfb348af 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -236,6 +237,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +246,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +382,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -772,6 +786,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+/*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -865,6 +911,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -1024,6 +1073,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1038,6 +1090,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1315,6 +1370,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		dlist_delete(&txn->base_snapshot_node);
 	}
 
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
 	/*
 	 * Remove TXN from its containing list.
 	 *
@@ -1340,6 +1404,80 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferReturnTXN(rb, txn);
 }
 
+/*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
 /*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1491,57 +1629,171 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+	 * aborted. That will happen during catalog access.  Also, reset the
+	 * bsysscan flag.
+	 */
+	if (!TransactionIdDidCommit(xid))
+	{
+		CheckXidAlive = xid;
+		bsysscan = false;
 	}
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction.
+ */
+static void
+ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   Snapshot snapshot_now,
+								   CommandId command_id,
+								   XLogRecPtr last_lsn,
+								   ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+
+	ReorderBufferToastReset(rb, txn);
+	if (specinsert != NULL)
+		ReorderBufferReturnChange(rb, specinsert);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool	stream_started = false;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1564,21 +1816,46 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
-
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
 		{
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Start stream or begin transaction for the first change in the
+			 * current stream.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+				else
+					rb->begin(rb, txn);
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1655,7 +1932,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1695,7 +1973,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1753,7 +2031,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1762,10 +2043,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1796,7 +2074,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1858,14 +2135,32 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.
+		 */
+		if (streaming)
+		{
+			rb->stream_stop(rb, txn, prev_lsn);
+			stream_started = false;
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1884,14 +2179,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,17 +2214,118 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If the error code is ERRCODE_TRANSACTION_ROLLBACK,  that means we
+		 * have detected a concurrent abort of the (sub)transaction we are
+		 * streaming.  So just do the cleanup and return gracefully.
+		 * Otherwise, Re-throw the error.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * Only in the streaming mode we can get this error, because only
+			 * in the streaming mode we send in-progress transaction.
+			 */
+			Assert(streaming);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/*
+			 * In the TRY block we only stop the stream after we have send
+			 * all the changes.  So we have detected the concurrent abort then
+			 * the stream should have not been stopped yet.
+			 */
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
 
-		PG_RE_THROW();
+			/* Handle the concurrent abort. */
+			ReorderBufferHandleConcurrentAbort(rb, txn, snapshot_now,
+											   command_id, prev_lsn,
+											   specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.  We iterate over the top and
+ * subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot snapshot_now;
+	CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
@@ -1946,6 +2350,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2015,6 +2426,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2150,8 +2568,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2159,6 +2586,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2170,19 +2598,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2211,6 +2648,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2295,6 +2733,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2398,6 +2843,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
@@ -2418,15 +2895,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2746,6 +3254,102 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot snapshot_now;
+	CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -3864,6 +4468,16 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases we assume the CID is from the future command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 65814af9f5..b3e2b3f64b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -171,6 +171,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -190,6 +191,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -224,6 +243,11 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	final_lsn;
 
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
@@ -254,6 +278,13 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
+	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
 	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
-- 
2.23.0

v24/._v24-0011-Provide-new-api-to-get-the-streaming-changes.patch000644 000765 000024 00000000334 13663075061 025550 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0011-Provide-new-api-to-get-the-streaming-changes.patch000644 000765 000024 00000000036 13663075061 027303 xustar00dilipkumarstaff000000 000000 30 mtime=1590458929.198823888
v24/v24-0011-Provide-new-api-to-get-the-streaming-changes.patch000644 000765 000024 00000014524 13663075061 025341 0ustar00dilipkumarstaff000000 000000 From 1b39aa9c0729326ca5b0539b45774eea50647c7b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Sat, 2 May 2020 11:41:59 +0530
Subject: [PATCH v24 11/12] Provide new api to get the streaming changes

---
 .gitignore                                    |  1 +
 doc/src/sgml/test-decoding.sgml               | 22 ++++++++++++++++++
 src/backend/catalog/system_views.sql          |  8 +++++++
 .../replication/logical/logicalfuncs.c        | 23 +++++++++++++++----
 src/include/catalog/pg_proc.dat               |  9 ++++++++
 5 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/.gitignore b/.gitignore
index 794e35b73c..6083744c07 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d67b..eed6e9d134 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_streaming_changes('test_slot', NULL, NULL);
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 9f509fbc21..5fe6f28ba2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1243,6 +1243,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
 
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_peek_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index b99c94e848..70c28ffa91 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -108,7 +108,8 @@ check_permissions(void)
  * Helper function for the various SQL callable logical decoding functions.
  */
 static Datum
-pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool binary)
+pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
+								 bool binary, bool streaming)
 {
 	Name		name;
 	XLogRecPtr	upto_lsn;
@@ -252,6 +253,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or has been invalidated.")));
 
+		/* If called has not asked for streaming changes then disable it. */
+		ctx->streaming &= streaming;
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
@@ -362,7 +366,16 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 Datum
 pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, false);
+}
+
+/*
+ * SQL function to get the streaming changes as text, consuming the data.
+ */
+Datum
+pg_logical_slot_get_streaming_changes(PG_FUNCTION_ARGS)
+{
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, true);
 }
 
 /*
@@ -371,7 +384,7 @@ pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, false, false);
 }
 
 /*
@@ -380,7 +393,7 @@ pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, true, false);
 }
 
 /*
@@ -389,7 +402,7 @@ pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, true, false);
 }
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7869f721da..875e0bef28 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10115,6 +10115,15 @@
   proargmodes => '{i,i,i,v,o,o,o}',
   proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
   prosrc => 'pg_logical_slot_get_binary_changes' },
+{ oid => '6150', descr => 'get streaming changes from replication slot',
+  proname => 'pg_logical_slot_get_streaming_changes', procost => '1000',
+  prorows => '1000', provariadic => 'text', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'name pg_lsn int4 _text',
+  proallargtypes => '{name,pg_lsn,int4,_text,pg_lsn,xid,text}',
+  proargmodes => '{i,i,i,v,o,o,o}',
+  proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
+  prosrc => 'pg_logical_slot_get_streaming_changes' },
 { oid => '3784', descr => 'peek at changes from replication slot',
   proname => 'pg_logical_slot_peek_changes', procost => '1000',
   prorows => '1000', provariadic => 'text', proisstrict => 'f',
-- 
2.23.0

v24/._.DS_Store000644 000765 000024 00000000170 13663075075 014707 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2Fx        @ATTRxxv24/PaxHeader/.DS_Store000644 000765 000024 00000000036 13663075075 016444 xustar00dilipkumarstaff000000 000000 30 mtime=1590458941.710712747
v24/.DS_Store000644 000765 000024 00000014004 13663075075 014473 0ustar00dilipkumarstaff000000 000000 Bud1% @� @� @� @E%DSDB`� @� @� @v24/._v24-0007-Track-statistics-for-streaming.patch000644 000765 000024 00000000334 13663075061 023246 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0007-Track-statistics-for-streaming.patch000644 000765 000024 00000000036 13663075061 025001 xustar00dilipkumarstaff000000 000000 30 mtime=1590458929.195089113
v24/v24-0007-Track-statistics-for-streaming.patch000644 000765 000024 00000027722 13663075061 023043 0ustar00dilipkumarstaff000000 000000 From fa13067af5b05db4d8932f071f3dccab35e602e1 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Tue, 19 May 2020 19:08:16 +0530
Subject: [PATCH v24 07/12] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                  | 33 +++++++++++++++++++
 src/backend/catalog/system_views.sql          |  5 ++-
 .../replication/logical/reorderbuffer.c       | 12 +++++++
 src/backend/replication/walsender.c           | 32 +++++++++++++++---
 src/include/catalog/pg_proc.dat               |  6 ++--
 src/include/replication/reorderbuffer.h       | 13 +++++---
 src/include/replication/walsender_private.h   |  5 +++
 src/test/regress/expected/rules.out           |  7 ++--
 8 files changed, 98 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 49d4bb13b9..0fc896ca7e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2453,6 +2453,39 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        Amount of decoded transaction data spilled to disk.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_txns</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of in-progress transactions streamed to subscriber after
+       memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+       Streaming only works with toplevel transactions (subtransactions can't
+       be streamed independently), so the counter does not get incremented for
+       subtransactions.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_count</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times in-progress transactions were streamed to subscriber.
+       Transactions may get streamed repeatedly, and this counter gets incremented
+       on every such invocation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_bytes</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Amount of decoded in-progress transaction data streamed to subscriber.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 56420bbc9d..9f509fbc21 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -788,7 +788,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index fe2d0011c4..5c211d0c70 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -349,6 +349,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3475,6 +3479,14 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		ReorderBufferFreeSnap(rb, txn->snapshot_now);
 	}
 
+	/* Update the stream statistics. */
+	rb->streamCount += 1;
+	rb->streamBytes += (rbtxn_has_incomplete_tuple(txn)) ?
+						txn->complete_size : txn->total_size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	/*
 	 * Access the main routine to decode the changes and send to output plugin.
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 86847cbb54..adb7d7962e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1353,7 +1353,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1374,7 +1374,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2423,6 +2424,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3258,7 +3262,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3316,6 +3320,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillCount;
 		int64		spillBytes;
 		bool		is_sync_standby;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 		int			j;
@@ -3341,6 +3348,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		/*
@@ -3443,6 +3453,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3691,11 +3706,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 61f2c2f5b4..7869f721da 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5237,9 +5237,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index a9b1aacdb1..1ced4caaae 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -541,15 +541,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 734acec2a4..b997d1710e 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -83,6 +83,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b813e32215..cf22f8a038 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2005,9 +2005,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_slru| SELECT s.name,
     s.blks_zeroed,
-- 
2.23.0

v24/._v24-0009-Enable-streaming-for-all-subscription-TAP-tests.patch000644 000765 000024 00000000334 13663075061 026254 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0009-Enable-streaming-for-all-subscription-TAP-tests.patch000644 000765 000024 00000000036 13663075061 030007 xustar00dilipkumarstaff000000 000000 30 mtime=1590458929.196792291
v24/v24-0009-Enable-streaming-for-all-subscription-TAP-tests.patch000644 000765 000024 00000035517 13663075061 026052 0ustar00dilipkumarstaff000000 000000 From 7889a00f341645ebfe3ff6e767031ad5461fc838 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v24 09/12] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318fc7c..6f7bedc130 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2fbc..94c71f8ae2 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f871a..4ba80869b9 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9181..a6fae9c3f1 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17a78..202871a658 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10a19..70c86b22ac 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6d63..f9c8d1d348 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334ed89..cdf9b8e7bb 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bdba9..21f50c7012 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133f69..30561d8f96 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae38592b..9a6bac6822 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..ed56fbf96c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cba4c..4df1ddef63 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1a8a..c3caff6149 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7c2f..c62eb521e7 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30f59..2be7542831 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0578..2da9607a7d 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9435..96ffc091b0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
2.23.0

v24/._v24-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch000644 000765 000024 00000000334 13663075061 026651 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch000644 000765 000024 00000000036 13663075061 030404 xustar00dilipkumarstaff000000 000000 30 mtime=1590458929.192702628
v24/v24-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch000644 000765 000024 00000032536 13663075061 026445 0ustar00dilipkumarstaff000000 000000 From 8f9b9d1569dfacf20b205c893f98379c9f854662 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 10:55:19 +0530
Subject: [PATCH v24 04/12] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml  |  9 +++--
 src/backend/access/heap/heapam.c   | 10 ++++++
 src/backend/access/index/genam.c   | 53 ++++++++++++++++++++++++++++
 src/backend/access/table/tableam.c |  8 +++++
 src/backend/utils/time/snapmgr.c   | 13 +++++++
 src/include/access/tableam.h       | 55 ++++++++++++++++++++++++++++++
 src/include/utils/snapmgr.h        |  2 ++
 7 files changed, 147 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 1b56daa4bb..5f7394f3c1 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,9 +432,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 94eb37d48d..2d77107c4f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments at snapmgr.c
+	 * where these variables are declared.  Normally we have such a check at
+	 * tableam level API but this is called from many places so we need to
+	 * ensure it here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae39a..446b8cbc86 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,9 +430,36 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments at snapmgr.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
+/*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments at snapmgr.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
 /*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments at snapmgr.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index c814733b22..2f52b407c6 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -230,6 +230,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	Relation	rel = scan->rs_rd;
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
+	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
 	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..9f1ecd123f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -153,6 +153,19 @@ static Snapshot SecondarySnapshot = NULL;
 static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction. ��Currently, it is used in logical decoding. ��It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables. ��We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool bsysscan = false;
+
 /*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8c34935c34..9d890d3c4b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
+#include "utils/snapmgr.h"
 #include "utils/snapshot.h"
 
 
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1015,6 +1025,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1054,6 +1071,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1710,6 +1735,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1727,6 +1760,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1745,6 +1786,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1761,6 +1809,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce84..5af6df698b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,6 +145,8 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
+extern bool	bsysscan;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
 extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
 extern void TeardownHistoricSnapshot(bool is_error);
-- 
2.23.0

v24/._v24-0006-Bugfix-handling-of-incomplete-toast-spec-insert-.patch000644 000765 000024 00000000334 13663075061 026442 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0006-Bugfix-handling-of-incomplete-toast-spec-insert-.patch000644 000765 000024 00000000036 13663075061 030175 xustar00dilipkumarstaff000000 000000 30 mtime=1590458929.194316127
v24/v24-0006-Bugfix-handling-of-incomplete-toast-spec-insert-.patch000644 000765 000024 00000054773 13663075061 026245 0ustar00dilipkumarstaff000000 000000 From 25ceb8fbf1534abb3bec472a742b22216d32de7c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Tue, 19 May 2020 18:55:23 +0530
Subject: [PATCH v24 06/12] Bugfix handling of incomplete toast/spec insert
 tuple

---
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/replication/logical/decode.c      |  17 +-
 .../replication/logical/reorderbuffer.c       | 324 ++++++++++++------
 src/include/access/heapam_xlog.h              |   1 +
 src/include/replication/reorderbuffer.h       |  39 ++-
 5 files changed, 277 insertions(+), 107 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2d77107c4f..3927448f46 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1955,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 69c1f45ef6..c841687c66 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -727,7 +727,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -794,7 +796,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -851,7 +854,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -887,7 +891,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -987,7 +991,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1025,7 +1029,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2cdfb348af..fe2d0011c4 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -179,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -254,6 +269,8 @@ static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
 static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static inline void ReorderBufferTXNDeleteChange(ReorderBufferTXN *txn,
+												ReorderBufferChange *change);
 
 /* ---------------------------------------
  * toast reassembly support
@@ -646,12 +663,71 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 	return txn;
 }
 
+/*
+ * Handle incomplete tuple during streaming.  If streaming is enabled then we
+ * might need to stream the in-progress transaction.  So the problem is that
+ * sometime we might get some incomplete changes which we can not stream
+ * until we get the complete change. e.g.  toast table insert without the main
+ * table insert.  So this function remember the lsn of the last complete change
+ * and the complete size upto last complete lsn so that if we need to stream
+ * we can only stream upto last complete lsn.
+ */
+static void
+ReorderBufferHandleIncompleteTuple(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   ReorderBufferChange *change,
+								   bool toast_insert)
+{
+	/* If streaming is not enable then nothing to do. */
+	if (!ReorderBufferCanStream(rb))
+		return;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		txn = txn->toptxn;
+
+	/*
+	 * If this is a first incomplete change then set the size of the complete
+	 * change.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(txn)) &&
+		(toast_insert || IsSpecInsert(change->action)))
+		txn->complete_size = txn->total_size;
+
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Basically,
+	 * both update and insert will do the insert in the toast table.  And as
+	 * explained in the function header we can not stream the only toast
+	 * changes.  So whenever we get the toast insert we set the flag and clear
+	 * the same whenever we get the next insert or update on the main table.
+	 */
+	if (toast_insert)
+		txn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(txn) && IsInsertOrUpdate(change->action))
+		txn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial tuple and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		txn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+		txn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+
+	/*
+	 * If we don't have any incomplete change after this change then set this
+	 * LSN as last complete lsn.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(txn)))
+		txn->last_complete_lsn = change->lsn;
+}
+
 /*
  * Queue a change into a transaction so it can be replayed upon commit.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
@@ -660,6 +736,9 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	change->lsn = lsn;
 	change->txn = txn;
 
+	/* Handle the incomplete tuple if it's a toast/spec insert */
+	ReorderBufferHandleIncompleteTuple(rb, txn, change, toast_insert);
+
 	Assert(InvalidXLogRecPtr != lsn);
 	dlist_push_tail(&txn->changes, &change->node);
 	txn->nentries++;
@@ -697,7 +776,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1412,6 +1491,30 @@ static void
 ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
 	dlist_mutable_iter iter;
+	ReorderBufferTXN *toptxn;
+
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
 
 	/* cleanup subtransactions & their changes */
 	dlist_foreach_modify(iter, &txn->subtxns)
@@ -1438,30 +1541,28 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		change = dlist_container(ReorderBufferChange, node, iter.cur);
 
-		/* remove the change from it's containing list */
-		dlist_delete(&change->node);
+		/* We have truncated upto last complete lsn so stop. */
+		if (rbtxn_has_incomplete_tuple(toptxn) &&
+			(change->lsn > toptxn->last_complete_lsn))
+		{
+			/*
+			 * If this is a top transaction then we can reset the
+			 * last_complete_lsn and complete_size, because by now we would
+			 * have stream all the changes upto last_complete_lsn.
+			 */
+			if (txn->toptxn == NULL)
+			{
+				toptxn->last_complete_lsn = InvalidXLogRecPtr;
+				toptxn->complete_size = 0;
+			}
+			break;
+		}
 
+		/* remove the change from it's containing list */
+		ReorderBufferTXNDeleteChange(txn, change);
 		ReorderBufferReturnChange(rb, change);
 	}
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The toplevel transaction, identified by (toptxn==NULL), is marked
-	 * as streamed always, even if it does not contain any changes (that
-	 * is, when all the changes are in subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts
-	 * for XIDs the downstream is not aware of. And of course, it always
-	 * knows about the toplevel xact (we send the XID in all messages),
-	 * but we never stream XIDs of empty subxacts.
-	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
 	 * any memory. We could also keep the hash table and update it with
@@ -1473,9 +1574,15 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
-	/* also reset the number of entries in the transaction */
-	txn->nentries_mem = 0;
-	txn->nentries = 0;
+	/*
+	 * If this txn is serialized and there are no more entries in the disk then
+	 * clean the disk space.
+	 */
+	if (rbtxn_is_serialized(txn) && (txn->nentries == txn->nentries_mem))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
 }
 
 /*
@@ -1732,6 +1839,20 @@ ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					change->data.msg.message);
 }
 
+/*
+ * While streaming a transaction, due to incomplete tuple we can not always
+ * stream all the changes.  So whenever we are deleting any change from the
+ * change list we need to update the entries count.
+ */
+static inline void
+ReorderBufferTXNDeleteChange(ReorderBufferTXN *txn, ReorderBufferChange *change)
+{
+	/* Delete the node and decrement the nentries_mem and nentries count. */
+	dlist_delete(&change->node);
+	change->txn->nentries_mem--;
+	change->txn->nentries--;
+}
+
 /*
  * Function to store the command id and snapshot at the end of the current
  * stream so that we can reuse the same while sending the next stream.
@@ -1955,8 +2076,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						 * disk.
 						 */
 						Assert(change->data.tp.newtuple != NULL);
-
-						dlist_delete(&change->node);
+						ReorderBufferTXNDeleteChange(change->txn, change);
 						ReorderBufferToastAppendChunk(rb, txn, relation,
 													  change);
 					}
@@ -2002,8 +2122,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						specinsert = NULL;
 					}
 
-					/* and memorize the pending insertion */
-					dlist_delete(&change->node);
+					/*
+					 * Remove from the change list and memorize the pending
+					 * insertion
+					 */
+					ReorderBufferTXNDeleteChange(change->txn, change);
 					specinsert = change;
 					break;
 
@@ -2118,6 +2241,15 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
 			}
+
+			/*
+			 * If the transaction contains incomplete tuple and this is the
+			 * last complete change then stop further processing of the
+			 * transaction.
+			 */
+			if (rbtxn_has_incomplete_tuple(txn) &&
+				prev_lsn == txn->last_complete_lsn)
+				break;
 		}
 
 		/*
@@ -2515,7 +2647,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2564,7 +2696,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2587,6 +2719,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2601,8 +2734,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2610,12 +2748,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2676,7 +2822,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 	memcpy(change->data.inval.invalidations, msgs,
 		   sizeof(SharedInvalidationMessage) * nmsgs);
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -2860,18 +3006,28 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 	dlist_foreach(iter, &rb->toplevel_by_lsn)
 	{
 		ReorderBufferTXN *txn;
+		Size	size = 0;
+		Size	largest_size = 0;
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
+		/*
+		 * If this transaction have some incomplete changes then only consider
+		 * the size upto last complete lsn.
+		 */
+		if (rbtxn_has_incomplete_tuple(txn))
+			size = txn->complete_size;
+		else
+			size = txn->total_size;
+
+		/* If the current transaction is larger then remember it. */
+		if ((largest != NULL || size > largest_size) && size > 0)
+		{
 			largest = txn;
+			largest_size = size;
+		}
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2889,66 +3045,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
 	ReorderBufferTXN *txn;
 
-	/* bail out if we haven't exceeded the memory limit */
-	if (rb->size < logical_decoding_work_mem * 1024L)
-		return;
-
-	/*
-	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by streaming, if supported. Otherwise spill to disk.
-	 */
-	if (ReorderBufferCanStream(rb))
-	{
-		/*
-		 * Pick the largest toplevel transaction and evict it from memory by
-		 * streaming the already decoded part.
-		 */
-		txn = ReorderBufferLargestTopTXN(rb);
-
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn && !txn->toptxn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
-
-		ReorderBufferStreamTXN(rb, txn);
-	}
-	else
+	/* Loop until we reach under the memory limit. */
+	while (rb->size >= logical_decoding_work_mem * 1024L)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		ReorderBufferSerializeTXN(rb, txn);
-	}
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
 
-	/*
-	 * After eviction, the transaction should have no entries in memory, and
-	 * should use 0 bytes for changes.
-	 *
-	 * XXX Checking the size is fine for both cases - spill to disk and
-	 * streaming. But for streaming we should really check nentries_mem for
-	 * all subtransactions too.
-	 */
-	Assert(txn->size == 0);
-	Assert(txn->nentries_mem == 0);
+			ReorderBufferSerializeTXN(rb, txn);
+			/*
+			 * After eviction, the transaction should have no entries in memory, and
+			 * should use 0 bytes for changes.
+			 */
+			Assert(txn->size == 0);
+			Assert(txn->nentries_mem == 0);
+		}
+	}
 
-	/*
-	 * And furthermore, evicting the transaction should get us below the
-	 * memory limit again - it is not possible that we're still exceeding the
-	 * memory limit after evicting the transaction.
-	 *
-	 * This follows from the simple fact that the selected transaction is at
-	 * least as large as the most recent change (which caused us to go over
-	 * the memory limit). So by evicting it we're definitely back below the
-	 * memory limit.
-	 */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
@@ -3344,10 +3480,6 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 */
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
-
-	Assert(dlist_is_empty(&txn->changes));
-	Assert(txn->nentries == 0);
-	Assert(txn->nentries_mem == 0);
 }
 
 /*
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cdb12..aa17f7df84 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index b3e2b3f64b..a9b1aacdb1 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -172,6 +172,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -191,6 +193,26 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -199,10 +221,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -350,6 +368,15 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* Size of the commplete changes. */
+	Size		complete_size;
+
+	/* LSN of the last complete change. */
+	XLogRecPtr	last_complete_lsn;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -537,7 +564,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
2.23.0

v24/._v24-0008-Add-support-for-streaming-to-built-in-replicatio.patch000644 000765 000024 00000000334 13663075061 026507 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0008-Add-support-for-streaming-to-built-in-replicatio.patch000644 000765 000024 00000000035 13663075061 030241 xustar00dilipkumarstaff000000 000000 29 mtime=1590458929.19590386
v24/v24-0008-Add-support-for-streaming-to-built-in-replicatio.patch000644 000765 000024 00000263777 13663075061 026320 0ustar00dilipkumarstaff000000 000000 From 059bdcce70378c507d1b79f6df34d4c2b5dc7ff2 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 14 May 2020 21:27:46 +0530
Subject: [PATCH v24 08/12] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml      |    4 +-
 doc/src/sgml/ref/create_subscription.sgml     |   12 +
 src/backend/catalog/pg_subscription.c         |    1 +
 src/backend/commands/subscriptioncmds.c       |   45 +-
 src/backend/postmaster/pgstat.c               |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |    3 +
 src/backend/replication/logical/proto.c       |  140 ++-
 src/backend/replication/logical/worker.c      | 1046 +++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c   |  318 ++++-
 src/backend/replication/slotfuncs.c           |    6 +
 src/backend/replication/walsender.c           |    6 +
 src/include/catalog/pg_subscription.h         |    3 +
 src/include/pgstat.h                          |    6 +-
 src/include/replication/logicalproto.h        |   42 +-
 src/include/replication/walreceiver.h         |    1 +
 src/test/subscription/t/009_stream_simple.pl  |   86 ++
 src/test/subscription/t/010_stream_subxact.pl |  102 ++
 src/test/subscription/t/011_stream_ddl.pl     |   95 ++
 .../t/012_stream_subxact_abort.pl             |   82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |   84 ++
 20 files changed, 2054 insertions(+), 40 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 1b8beadbaa..95b7c24ef9 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -164,8 +164,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c244fb..3349cc4bfc 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731115..f28482f0f4 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9ebb026187..9065a1be1b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index d7f99d9944..e843d1e658 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4139,6 +4139,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9bb6..5257ab0394 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd171..83d0642cf3 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a1224d..e4e52f10f8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,6 +54,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -64,6 +86,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +94,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -110,12 +134,57 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool	in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;						/* XID of the subxact */
+	off_t           offset;					/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +256,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -552,6 +657,329 @@ apply_handle_origin(StringInfo s)
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		/*
+		 * Pass the missing_ok as true so that if we haven't got any changes
+		 * for the top transaction (empty transaction) we don't give error.
+		 */
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, true);
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		int			fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just free the memory and return.
+		 */
+		if (!found)
+		{
+			/* Free the subxacts memory */
+			if (subxacts)
+				pfree(subxacts);
+
+			subxacts = NULL;
+			subxact_last = InvalidTransactionId;
+			nsubxacts = 0;
+			nsubxacts_max = 0;
+
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = OpenTransientFile(path, O_WRONLY | PG_BINARY);
+		if (fd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+		}
+
+		/* OK, truncate the file at the right offset. */
+		if (ftruncate(fd, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+		CloseTransientFile(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	ensure_transaction();
+
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -565,6 +993,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1011,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1050,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1168,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1313,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1686,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1827,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1477,6 +1939,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 	}
 }
 
+/*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
 /*
  * Apply main loop.
  */
@@ -1493,6 +1971,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1941,6 +2422,570 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd >= 0);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.subxacts",
+			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.changes",
+			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3151,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 15379e3118..1509f9b826 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +122,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +148,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +211,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +240,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +263,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +373,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +423,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +465,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +486,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +518,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +538,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +562,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +582,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +607,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +639,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -587,6 +719,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -623,6 +840,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -753,11 +1002,44 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -793,7 +1075,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 6fed3cfd23..e1344ab4cc 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -157,6 +157,12 @@ create_logical_replication_slot(char *name, char *plugin,
 											   .segment_close = wal_segment_close),
 									NULL, NULL, NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
 	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index adb7d7962e..9731b86d1f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1020,6 +1020,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndPrepareWrite, WalSndWriteData,
 										WalSndUpdateProgress);
 
+		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
 		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d42d8..617b9094d4 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c55dc1481c..899d7e2013 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -980,7 +980,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561be9..89158ed46f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index ac1acbb27e..95132062c6 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v24/._v24-0002-Issue-individual-invalidations-with-wal_level-lo.patch000644 000765 000024 00000000334 13663075061 026643 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0002-Issue-individual-invalidations-with-wal_level-lo.patch000644 000765 000024 00000000036 13663075061 030376 xustar00dilipkumarstaff000000 000000 30 mtime=1590458929.190746761
v24/v24-0002-Issue-individual-invalidations-with-wal_level-lo.patch000644 000765 000024 00000041336 13663075061 026435 0ustar00dilipkumarstaff000000 000000 From ebb6dbc7cd2e8227cd085d62736f95015334c5ad Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Tue, 14 Apr 2020 11:11:37 +0530
Subject: [PATCH v24 02/12] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.
---
 src/backend/access/rmgrdesc/xactdesc.c        |  40 +++++++
 src/backend/access/transam/xact.c             |   7 ++
 src/backend/replication/logical/decode.c      |  16 +++
 .../replication/logical/reorderbuffer.c       | 104 +++++++++++++++---
 src/backend/utils/cache/inval.c               |  49 +++++++++
 src/include/access/xact.h                     |  13 ++-
 src/include/replication/reorderbuffer.h       |  11 ++
 7 files changed, 226 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce75565f..7ab0d11ea9 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3af8e81af1..e576b10055 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6020,6 +6020,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 122c581d0f..69c1f45ef6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,22 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				Assert(TransactionIdIsValid(xid));
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->nmsgs, invals->msgs);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4594cf9509..b889edf461 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -220,7 +220,7 @@ static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
 static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
-static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
 
 /*
  * ---------------------------------------
@@ -455,6 +455,11 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 				pfree(change->data.msg.message);
 			change->data.msg.message = NULL;
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			if (change->data.inval.invalidations)
+				pfree(change->data.inval.invalidations);
+			change->data.inval.invalidations = NULL;
+			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			if (change->data.snapshot)
 			{
@@ -1814,17 +1819,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation messages locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					ReorderBufferExecuteInvalidations(
+							change->data.inval.ninvalidations,
+							change->data.inval.invalidations);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -1866,7 +1878,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1892,7 +1905,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2202,6 +2216,33 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	txn->ntuplecids++;
 }
 
+/*
+ * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn, int nmsgs,
+							 SharedInvalidationMessage *msgs)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.ninvalidations = nmsgs;
+	change->data.inval.invalidations = (SharedInvalidationMessage *)
+		MemoryContextAlloc(rb->context,
+						   sizeof(SharedInvalidationMessage) * nmsgs);
+	memcpy(change->data.inval.invalidations, msgs,
+		   sizeof(SharedInvalidationMessage) * nmsgs);
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Setup the invalidation of the toplevel transaction.
  *
@@ -2234,12 +2275,12 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  * in the changestream but we don't know which those are.
  */
 static void
-ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
 {
 	int			i;
 
-	for (i = 0; i < txn->ninvalidations; i++)
-		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	for (i = 0; i < nmsgs; i++)
+		LocalExecuteInvalidationMessage(&msgs[i]);
 }
 
 /*
@@ -2589,6 +2630,24 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				char	   *data;
+				Size		inval_size = sizeof(SharedInvalidationMessage) *
+										change->data.inval.ninvalidations;
+
+				sz += inval_size;
+
+				ReorderBufferSerializeReserve(rb, sz);
+				data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+
+				/* might have been reallocated above */
+				ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+				memcpy(data, change->data.inval.invalidations, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -2736,6 +2795,11 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			sz += sizeof(SharedInvalidationMessage) *
+					change->data.inval.ninvalidations;
+			break;
+
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -3002,6 +3066,20 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				Size	inval_size = sizeof(SharedInvalidationMessage) *
+									change->data.inval.ninvalidations;
+
+				change->data.inval.invalidations =
+						MemoryContextAlloc(rb->context, inval_size);
+
+				/* read the message */
+				memcpy(change->data.inval.invalidations, data, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33be6..cba5b6c64b 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transaction.  As of now it was
+ *	enough to log invalidation only at commit because we are only decoding the
+ *	transaction at the commit time.   We only need to log the catalog cache and
+ *	relcache invalidation.  There can not be any active MVCC scan in logical
+ *	decoding so we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
 	if (transInvalInfo == NULL)
 		return;
 
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
@@ -1501,3 +1513,40 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage	*invalMessages;
+	int	nmsgs = 0;
+
+	if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+	{
+		ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+								 MakeSharedInvalidMessagesArray);
+		invalMessages = SharedInvalidMessagesArray;
+		nmsgs  = numSharedInvalidMessagesArray;
+		SharedInvalidMessagesArray = NULL;
+		numSharedInvalidMessagesArray = 0;
+	}
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b3816c..b822c5e4b2 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -197,6 +197,17 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+/*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4dc9..af35287896 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,14 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations;		/* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation
+														 * message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +468,8 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  int nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
2.23.0

v24/._v24-0010-Add-TAP-test-for-streaming-vs.-DDL.patch000644 000765 000024 00000000334 13663075061 023200 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0010-Add-TAP-test-for-streaming-vs.-DDL.patch000644 000765 000024 00000000036 13663075061 024733 xustar00dilipkumarstaff000000 000000 30 mtime=1590458929.198073453
v24/v24-0010-Add-TAP-test-for-streaming-vs.-DDL.patch000644 000765 000024 00000010620 13663075061 022762 0ustar00dilipkumarstaff000000 000000 From 2901b64dfa1bf8f23bbec43f8537f2f67fe1c1af Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v24 10/12] Add TAP test for streaming vs. DDL

---
 .../subscription/t/014_stream_through_ddl.pl  | 98 +++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000000..b8d78b1972
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v24/._v24-0012-Add-streaming-option-in-pg_dump.patch000644 000765 000024 00000000334 13663075061 023255 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0012-Add-streaming-option-in-pg_dump.patch000644 000765 000024 00000000036 13663075061 025010 xustar00dilipkumarstaff000000 000000 30 mtime=1590458929.200261969
v24/v24-0012-Add-streaming-option-in-pg_dump.patch000644 000765 000024 00000005337 13663075061 023050 0ustar00dilipkumarstaff000000 000000 From c1c1e920db284580106e270053dc73fb5fe788c7 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v24 12/12] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index a4e949c636..debb52af49 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4235,8 +4236,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4249,6 +4250,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4265,6 +4267,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4342,6 +4345,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0c2fcfb3a9..af64270c55 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
2.23.0

v24/._v24-0003-Extend-the-output-plugin-API-with-stream-methods.patch000644 000765 000024 00000000334 13663075061 026404 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl���x&@w�|G*��lv24/PaxHeader/v24-0003-Extend-the-output-plugin-API-with-stream-methods.patch000644 000765 000024 00000000036 13663075061 030137 xustar00dilipkumarstaff000000 000000 30 mtime=1590458929.191659864
v24/v24-0003-Extend-the-output-plugin-API-with-stream-methods.patch000644 000765 000024 00000106360 13663075061 026175 0ustar00dilipkumarstaff000000 000000 From bbd0b76f1ada9c6d5e53aef22f6e96ac53ce4017 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v24 03/12] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 ++++++
 doc/src/sgml/logicaldecoding.sgml         | 213 +++++++++++++
 src/backend/replication/logical/logical.c | 365 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++
 src/include/replication/reorderbuffer.h   |  59 ++++
 6 files changed, 811 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c948856e..64f651fa72 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index bad3bfe620..1b56daa4bb 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index dc69e5ce5f..0cff1ac393 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similar to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,321 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475e5d..deef31825d 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -79,6 +79,11 @@ typedef struct LogicalDecodingContext
 	 */
 	void	   *output_writer_private;
 
+	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236c57..0d0a94a648 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
 /*
  * Output plugin callbacks
  */
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index af35287896..65814af9f5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -354,6 +354,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
 struct ReorderBuffer
 {
 	/*
@@ -392,6 +440,17 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
 	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
-- 
2.23.0

#326Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#317)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, May 22, 2020 at 6:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Few comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple
1.
+ /*
+ * If this is a toast insert then set the corresponding bit.  Otherwise, if
+ * we have toast insert bit set and this is insert/update then clear the
+ * bit.
+ */
+ if (toast_insert)
+ toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+ else if (rbtxn_has_toast_insert(txn) &&
+ ChangeIsInsertOrUpdate(change->action))
+ {

Here, it might better to add a comment on why we expect only
Insert/Update? Also, it might be better that we add an assert for
other operations.

I have added comments that why on Insert/Update we clean the flag.
But I don't think we only expect insert/update, we might get the
toast delete right? because in toast update we will do toast delete +
toast insert. So when we get toast delete we just don't want to do
anything.

Okay, that makes sense.

2.
@@ -1865,8 +1920,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
* disk.
*/
dlist_delete(&change->node);
- ReorderBufferToastAppendChunk(rb, txn, relation,
-   change);
+ ReorderBufferToastAppendChunk(rb, txn, relation,
+   change);
}

This seems to be a spurious change.

Done

2. There is a bug fix in handling the stream abort in 0008 (earlier it
was 0006).

The code changes look fine but it is not clear what was the exact
issue. Can you explain?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#327Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#318)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, May 22, 2020 at 6:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Review comments:
------------------------------
1.
@@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
TransactionId xid,
}

case REORDER_BUFFER_CHANGE_MESSAGE:
- rb->message(rb, txn, change->lsn, true,
- change->data.msg.prefix,
- change->data.msg.message_size,
- change->data.msg.message);
+ if (streaming)
+ rb->stream_message(rb, txn, change->lsn, true,
+    change->data.msg.prefix,
+    change->data.msg.message_size,
+    change->data.msg.message);
+ else
+ rb->message(rb, txn, change->lsn, true,
+    change->data.msg.prefix,
+    change->data.msg.message_size,
+    change->data.msg.message);

Don't we need to set any_data_sent flag while streaming messages as we
do for other types of changes?

I think any_data_sent, was added to avoid sending abort to the
subscriber if we haven't sent any data, but this is not complete as
the output plugin can also take the decision not to send. So I think
this should not be done as part of this patch and can be done
separately. I think there is already a thread for handling the
same[1]

Hmm, but prior to this patch, we never use to send (empty) aborts but
now that will be possible. It is probably okay to deal that with
another patch mentioned by you but I felt at least any_data_sent will
work for some cases. OTOH, it appears to be half-baked solution, so
we should probably refrain from adding it. BTW, how do the pgoutput
plugin deal with it? I see that apply_handle_stream_abort will
unconditionally try to unlink the file and it will probably fail.
Have you tested this scenario after your latest changes?

Yeah, I see, I think this is a problem, but this exists without my
latest change as well, if pgoutput ignore some changes because it is
not published then we will see a similar error. Shall we handle the
ENOENT error case from unlink?

Isn't this problem only for subxact file as we anyway create changes
file as part of start stream message which should have come after
abort? If so, can't we detect whether subxact file exists probably by
using nsubxacts or something like that? Can you please once try to
reproduce this scenario to ensure that we are not missing anything?

8.
@@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
*rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);

txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+ /*
+ * TOCHECK: Mark toplevel transaction as having catalog changes too
+ * if one of its children has.
+ */
+ if (txn->toptxn != NULL)
+ txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}

Why are we marking top transaction here?

We need to mark top transaction to decide whether to build tuplecid
hash or not. In non-streaming mode, we are only sending during the
commit time, and during commit time we know whether the top
transaction has any catalog changes or not based on the invalidation
message so we are marking the top transaction there in DecodeCommit.
Since here we are not waiting till commit so we need to mark the top
transaction as soon as we mark any of its child transactions.

But how does it help? We use this flag (via
ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
anyway done in DecodeCommit and that too after setting this flag for
the top transaction if required. So, how will it help in setting it
while processing for subxid. Also, even if we have to do it won't it
add the xid needlessly in builder->committed.xip array?

In ReorderBufferBuildTupleCidHash, we use this flag to decide whether
to build the tuplecid hash or not based on whether it has catalog
changes or not.

Okay, but you haven't answered the second part of the question: "won't
it add the xid of top transaction needlessly in builder->committed.xip
array, see function SnapBuildCommitTxn?" IIUC, this can happen
without patch as well because DecodeCommit also sets the flags just
based on invalidation messages irrespective of whether the messages
are generated by top transaction or not, is that right? If this is
correct, please explain why we are doing so in the comments.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#328Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#326)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, May 26, 2020 at 10:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 22, 2020 at 6:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Few comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple
1.
+ /*
+ * If this is a toast insert then set the corresponding bit.  Otherwise, if
+ * we have toast insert bit set and this is insert/update then clear the
+ * bit.
+ */
+ if (toast_insert)
+ toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+ else if (rbtxn_has_toast_insert(txn) &&
+ ChangeIsInsertOrUpdate(change->action))
+ {

Here, it might better to add a comment on why we expect only
Insert/Update? Also, it might be better that we add an assert for
other operations.

I have added comments that why on Insert/Update we clean the flag.
But I don't think we only expect insert/update, we might get the
toast delete right? because in toast update we will do toast delete +
toast insert. So when we get toast delete we just don't want to do
anything.

Okay, that makes sense.

2.
@@ -1865,8 +1920,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
* disk.
*/
dlist_delete(&change->node);
- ReorderBufferToastAppendChunk(rb, txn, relation,
-   change);
+ ReorderBufferToastAppendChunk(rb, txn, relation,
+   change);
}

This seems to be a spurious change.

Done

2. There is a bug fix in handling the stream abort in 0008 (earlier it
was 0006).

The code changes look fine but it is not clear what was the exact
issue. Can you explain?

Basically, in case of an empty subtransaction, we were reading the
subxacts info but when we could not find the subxid in the subxacts
info we were not releasing the memory. So on next subxact_info_read
it will expect that subxacts should be freed but we did not free it in
that !found case.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#329Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#322)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, May 25, 2020 at 8:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

4.
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
{
..

For this and other places in a patch like in function
stream_open_file(), instead of using TopMemoryContext, can we consider
using a new memory context LogicalStreamingContext or something like
that. We can create LogicalStreamingContext under TopMemoryContext. I
don't see any need of using TopMemoryContext here.

But, when we will delete/reset the LogicalStreamingContext?

Why can't we reset it at each stream stop message?

because
we are planning to keep this memory until the worker is alive so that
supposed to be the top memory context.

Which part of allocation do we want to keep till the worker is alive?
Why we need memory-related to subxacts till the worker is alive? As
we have now, after reading subxact info (subxact_info_read), we need
to ensure that it is freed after its usage due to which we need to
remember and perform pfree at various places.

I think we should once see the possibility that such that we could
switch to this new context in start stream message and reset it in
stop stream message. That might help in avoiding
MemoryContextSwitchTo TopMemoryContext at various places.

If we create any other context
with the same life span as TopMemoryContext then what is the point?

It is helpful for debugging. It is recommended that we don't use the
top memory context unless it is really required. Read about it in
src/backend/utils/mmgr/README.

8.
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)

Do we really need to have the checksum for temporary files? I have
checked a few other similar cases like SharedFileSet stuff for
parallel hash join but didn't find them using checksums. Can you also
once see other usages of temporary files and then let us decide if we
see any reason to have checksums for this?

Yeah, even I can see other places checksum is not used.

So, unless someone speaks up before you are ready for the next version
of the patch, can we remove it?

Another point is we don't seem to be doing this for 'changes' file,
see stream_write_change. So, not sure, there is any sense to write
checksum for subxact file.

I can see there are comment atop this function

* XXX The subxact file includes CRC32C of the contents. Maybe we should
* include something like that here too, but doing so will not be as
* straighforward, because we write the file in chunks.

You can remove this comment as well. I don't know how advantageous it
is to checksum temporary files. We can anyway add it later if there
is a reason for doing so.

12.
maybe_send_schema()
{
..
+ if (in_streaming)
+ {
+ /*
+ * TOCHECK: We have to send schema after each catalog change and it may
+ * occur when streaming already started, so we have to track new catalog
+ * changes somehow.
+ */
+ schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
..
..
}

I think it is good to once verify/test what this comment says but as
per code we should be sending the schema after each catalog change as
we invalidate the streamed_txns list in rel_sync_cache_relation_cb
which must be called during relcache invalidation. Do we see any
problem with that mechanism?

I have tested this, I think we are already sending the schema after
each catalog change.

Then remove "TOCHECK" in the above comment.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#330Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#328)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, May 26, 2020 at 2:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 26, 2020 at 10:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

2. There is a bug fix in handling the stream abort in 0008 (earlier it
was 0006).

The code changes look fine but it is not clear what was the exact
issue. Can you explain?

Basically, in case of an empty subtransaction, we were reading the
subxacts info but when we could not find the subxid in the subxacts
info we were not releasing the memory. So on next subxact_info_read
it will expect that subxacts should be freed but we did not free it in
that !found case.

Okay, on looking at it again, the same code exists in
subxact_info_write as well. It is better to have a function for it.
Can we have a structure like SubXactContext for all the variables used
for subxact? As mentioned earlier I find the allocation/deallocation
of subxacts a bit ad-hoc, so there will always be a chance that we can
forget to free it. Having it allocated in memory context which we can
reset later might reduce that risk. One idea could be that we have a
special memory context for start and stop messages which can be used
to allocate the subxacts there. In case of commit/abort, we can allow
subxacts information to be allocated in ApplyMessageContext which is
reset at the end of each protocol message.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#331Mahendra Singh Thalor
mahi6run@gmail.com
In reply to: Amit Kapila (#330)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, 26 May 2020 at 16:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, May 26, 2020 at 2:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 26, 2020 at 10:27 AM Amit Kapila <amit.kapila16@gmail.com>

wrote:

2. There is a bug fix in handling the stream abort in 0008 (earlier

it

was 0006).

The code changes look fine but it is not clear what was the exact
issue. Can you explain?

Basically, in case of an empty subtransaction, we were reading the
subxacts info but when we could not find the subxid in the subxacts
info we were not releasing the memory. So on next subxact_info_read
it will expect that subxacts should be freed but we did not free it in
that !found case.

Okay, on looking at it again, the same code exists in
subxact_info_write as well. It is better to have a function for it.
Can we have a structure like SubXactContext for all the variables used
for subxact? As mentioned earlier I find the allocation/deallocation
of subxacts a bit ad-hoc, so there will always be a chance that we can
forget to free it. Having it allocated in memory context which we can
reset later might reduce that risk. One idea could be that we have a
special memory context for start and stop messages which can be used
to allocate the subxacts there. In case of commit/abort, we can allow
subxacts information to be allocated in ApplyMessageContext which is
reset at the end of each protocol message.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Hi all,
On the top of v16 patch set [1]/messages/by-id/CAFiTN-vnnrk580ucZVYnub_UQ-ayROew8fQ2Yn5aFYMeF0U03w@mail.gmail.com, I did some testing for DDL's and DML's to
test wal size and performance. Below is the testing summary;

*Test parameters:*
wal_level= 'logical
max_connections = '150'
wal_receiver_timeout = '600s'
max_wal_size = '2GB'
min_wal_size = '2GB'
autovacuum= 'off'
checkpoint_timeout= '1d'

*Test results:*

CREATE index operations Add col int(date) operations Add col text operations
SN. operation name LSN diff (in bytes) time (in sec) % LSN change LSN diff
(in bytes) time (in sec) % LSN change LSN diff (in bytes) time (in sec) %
LSN change

1
1 DDL without patch 17728 0.89116
1.624548
976 0.764393
11.475409
33904 0.80044
2.80792
with patch 18016 0.804868 1088 0.763602 34856 0.787108

2
2 DDL without patch 19872 0.860348
2.73752
1632 0.763199
13.7254902
34560 0.806086
3.078703
with patch 20416 0.839065 1856 0.733147 35624 0.829281

3
3 DDL without patch 22016 0.894891
3.63372093
2288 0.776871
14.685314
35216 0.803493
3.339391186
with patch 22816 0.828028 2624 0.737177 36392 0.800194

4
4 DDL without patch 24160 0.901686
4.4701986
2944 0.768445
15.217391
35872 0.77489
3.590544
with patch 25240 0.887143 3392 0.768382 37160 0.82777

5
5 DDL without patch 26328 0.901686
4.9832877
3600 0.751879
15.555555
36528 0.817928
3.832676
with patch 27640 0.914078 4160 0.74709 37928 0.820621

6
6 DDL without patch 28472 0.936385
5.5071649
4256 0.745179
15.78947368
37184 0.797043
4.066265
with patch 30040 0.958226 4928 0.725321 38696 0.814535

7
8 DDL without patch 32760 1.0022203
6.422466
5568 0.757468
16.091954
38496 0.83207
4.509559
with patch 34864 0.966777 6464 0.769072 40232 0.903604

8
11 DDL without patch 50296 1.0022203
5.662478
7536 0.748332
16.666666
40464 0.822266
5.179913
with patch 53144 0.966777 8792 0.750553 42560 0.797133

9
15 DDL without patch <#gid=2095312519&range=B9> 58896 1.267253
5.662478
10184 0.776875
16.496465
43112 0.821916
5.84524
with patch 62768 1.27234 11864 0.746844 45632 0.812567

10
1 DDL & 3 DML without patch 18240 0.812551
1.6228
1192 0.771993
10.067114
34120 0.849467
2.8113599
with patch 18536 0.819089 1312 0.785117 35080 0.855456

11
3 DDL & 5 DML without patch 23656 0.926616
3.4832606
2656 0.758029
13.55421687
35584 0.829377
3.372302
with patch 24480 0.915517 3016 0.797206 36784 0.839176

12
10 DDL & 5 DML without patch 52760 1.101005
4.958301744
7288 0.763065
16.02634468
40216 0.837843
4.993037
with patch 55376 1.105241 8456 0.779257 42224 0.835206

13
10 DML without patch 1008 0.791091
6.349206
1008 0.81105
6.349206
1008 0.78817
6.349206
with patch 1072 0.807875 1072 0.771113 1072 0.759789

To see all operations, please see[2]https://docs.google.com/spreadsheets/d/1g11MrSd_I39505OnGoLFVslz3ykbZ1nmfR_gUiE_O9k/edit?usp=sharing test_results
<https://docs.google.com/spreadsheets/d/1g11MrSd_I39505OnGoLFVslz3ykbZ1nmfR_gUiE_O9k/edit?usp=sharing&gt;

*Summary:*
Basically, we are writing per command invalidation message and for testing
that I have tested with different combinations of the DDL and DML
operation. I have not observed any performance degradation with the patch.
For "create index" DDL's, %change in wal is 1-7% for 1-15 DDL's. For "add
col int/date" DDL's, it is 11-17% for 1-15 DDL's and for "add col text"
DDL's, it is 2-6% for 1-15 DDL's. For mix (DDL & DML), it is 2-10%.

why are we seeing 11-13 % of the extra wall, basically, the amount of
extra WAL is not very high but the amount of WAL generated with add column
int/date is just ~1000 bytes so additional 100 bytes will be around 10% and
for add column text it is ~35000 bytes so % is less. For text, these
~35000 bytes are due to toast.

[1]: /messages/by-id/CAFiTN-vnnrk580ucZVYnub_UQ-ayROew8fQ2Yn5aFYMeF0U03w@mail.gmail.com
/messages/by-id/CAFiTN-vnnrk580ucZVYnub_UQ-ayROew8fQ2Yn5aFYMeF0U03w@mail.gmail.com
[2]: https://docs.google.com/spreadsheets/d/1g11MrSd_I39505OnGoLFVslz3ykbZ1nmfR_gUiE_O9k/edit?usp=sharing
https://docs.google.com/spreadsheets/d/1g11MrSd_I39505OnGoLFVslz3ykbZ1nmfR_gUiE_O9k/edit?usp=sharing

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

#332Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#325)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, May 26, 2020 at 7:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 25, 2020 at 8:48 PM Erik Rijkers <er@xs4all.nl> wrote:

Hi,

I am not able to extract all files correctly from this tar.

The first file v24-0001-* seems to have some 'binary' junk at the top.

(The other 11 files seem normally readably)

Okay, sending again.

While reviewing/testing I have found a couple of problems in 0005 and
0006 which I have fixed in the attached version.

In 0005: Basically, in the latest version, we are starting a stream
or begin txn only if there are any changes because we are doing in the
while loop, so we need to stream_stop/commit also if we have started
the stream.

In 0006: If we are streaming the serialized changed and there are
still few incomplete changes, then currently we are not deleting the
spilled file, but the spill file contains all the changes of the
transaction because there is no way to partially truncate it. So in
the next stream, it will try to resend those. I have fixed this by
sending the spilled transaction as soon as its changes are complete so
ideally, we can always delete the spilled file. It is also a better
solution because this transaction is already spilled once and that
happened because we could not stream it, so we better stream it on
the first opportunity that will reduce the replay lag which is our
whole purpose here.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v25.tarapplication/x-tar; name=v25.tarDownload
v25/000755 000765 000024 00000000000 13663476003 013006 5ustar00dilipkumarstaff000000 000000 v25/v25-0009-Enable-streaming-for-all-subscription-TAP-tests.patch000644 000765 000024 00000035517 13663476160 026061 0ustar00dilipkumarstaff000000 000000 From 88a9d5019be509492b2593f2939fe0213da9c40b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v25 09/12] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318fc7c..6f7bedc130 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2fbc..94c71f8ae2 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f871a..4ba80869b9 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9181..a6fae9c3f1 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17a78..202871a658 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10a19..70c86b22ac 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6d63..f9c8d1d348 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334ed89..cdf9b8e7bb 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bdba9..21f50c7012 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133f69..30561d8f96 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae38592b..9a6bac6822 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..ed56fbf96c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cba4c..4df1ddef63 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1a8a..c3caff6149 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7c2f..c62eb521e7 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30f59..2be7542831 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0578..2da9607a7d 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9435..96ffc091b0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
2.23.0

v25/v25-0007-Track-statistics-for-streaming.patch000644 000765 000024 00000027722 13663476160 023052 0ustar00dilipkumarstaff000000 000000 From 20ee0f934961866643fe0e4c65d874323ea604a8 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Tue, 19 May 2020 19:08:16 +0530
Subject: [PATCH v25 07/12] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                  | 33 +++++++++++++++++++
 src/backend/catalog/system_views.sql          |  5 ++-
 .../replication/logical/reorderbuffer.c       | 12 +++++++
 src/backend/replication/walsender.c           | 32 +++++++++++++++---
 src/include/catalog/pg_proc.dat               |  6 ++--
 src/include/replication/reorderbuffer.h       | 13 +++++---
 src/include/replication/walsender_private.h   |  5 +++
 src/test/regress/expected/rules.out           |  7 ++--
 8 files changed, 98 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 49d4bb13b9..0fc896ca7e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2453,6 +2453,39 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        Amount of decoded transaction data spilled to disk.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_txns</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of in-progress transactions streamed to subscriber after
+       memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+       Streaming only works with toplevel transactions (subtransactions can't
+       be streamed independently), so the counter does not get incremented for
+       subtransactions.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_count</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times in-progress transactions were streamed to subscriber.
+       Transactions may get streamed repeatedly, and this counter gets incremented
+       on every such invocation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_bytes</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Amount of decoded in-progress transaction data streamed to subscriber.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 56420bbc9d..9f509fbc21 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -788,7 +788,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b79752cae4..ee922e2271 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -348,6 +348,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3526,6 +3530,14 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		ReorderBufferFreeSnap(rb, txn->snapshot_now);
 	}
 
+	/* Update the stream statistics. */
+	rb->streamCount += 1;
+	rb->streamBytes += (rbtxn_has_incomplete_tuple(txn)) ?
+						txn->complete_size : txn->total_size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	/*
 	 * Access the main routine to decode the changes and send to output plugin.
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 86847cbb54..adb7d7962e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1353,7 +1353,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1374,7 +1374,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2423,6 +2424,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3258,7 +3262,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3316,6 +3320,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillCount;
 		int64		spillBytes;
 		bool		is_sync_standby;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 		int			j;
@@ -3341,6 +3348,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		/*
@@ -3443,6 +3453,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3691,11 +3706,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 61f2c2f5b4..7869f721da 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5237,9 +5237,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 4004fd6684..fa8c077b02 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -551,15 +551,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 734acec2a4..b997d1710e 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -83,6 +83,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b813e32215..cf22f8a038 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2005,9 +2005,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_slru| SELECT s.name,
     s.blks_zeroed,
-- 
2.23.0

v25/v25-0001-Immediately-WAL-log-assignments.patch000644 000765 000024 00000025100 13663476160 023015 0ustar00dilipkumarstaff000000 000000 From 975234c54873bf67f34488809b9d42a8eaf02af3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 20 Mar 2020 15:03:01 +0530
Subject: [PATCH v25 01/12] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 46 ++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 +++
 src/backend/replication/logical/decode.c | 39 ++++++++++----------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 99 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index cd30b62d36..3af8e81af1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5118,6 +5120,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6020,3 +6023,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f09e..53be2b3059 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5995798b58..560ec27fa0 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1195,6 +1195,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1233,6 +1234,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3abf8..122c581d0f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel xid is valid, we need to assign the subxact to the
+	 * toplevel xact. We need to do this for all records, hence we do it before
+	 * the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(XLogRecGetTopXid(r)))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04babc2..8645b3816c 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e917dfe92d..26426cc779 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index c21b0ba972..83170a663c 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -191,6 +191,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -308,6 +310,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0194..2f0c8bf589 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
2.23.0

v25/v25-0008-Add-support-for-streaming-to-built-in-replicatio.patch000644 000765 000024 00000263777 13663476160 026327 0ustar00dilipkumarstaff000000 000000 From e165cbebf432d3156173937b42ee25533a9e098c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 14 May 2020 21:27:46 +0530
Subject: [PATCH v25 08/12] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml      |    4 +-
 doc/src/sgml/ref/create_subscription.sgml     |   12 +
 src/backend/catalog/pg_subscription.c         |    1 +
 src/backend/commands/subscriptioncmds.c       |   45 +-
 src/backend/postmaster/pgstat.c               |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |    3 +
 src/backend/replication/logical/proto.c       |  140 ++-
 src/backend/replication/logical/worker.c      | 1046 +++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c   |  318 ++++-
 src/backend/replication/slotfuncs.c           |    6 +
 src/backend/replication/walsender.c           |    6 +
 src/include/catalog/pg_subscription.h         |    3 +
 src/include/pgstat.h                          |    6 +-
 src/include/replication/logicalproto.h        |   42 +-
 src/include/replication/walreceiver.h         |    1 +
 src/test/subscription/t/009_stream_simple.pl  |   86 ++
 src/test/subscription/t/010_stream_subxact.pl |  102 ++
 src/test/subscription/t/011_stream_ddl.pl     |   95 ++
 .../t/012_stream_subxact_abort.pl             |   82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |   84 ++
 20 files changed, 2054 insertions(+), 40 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 1b8beadbaa..95b7c24ef9 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -164,8 +164,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c244fb..3349cc4bfc 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731115..f28482f0f4 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9ebb026187..9065a1be1b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index d7f99d9944..e843d1e658 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4139,6 +4139,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9bb6..5257ab0394 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd171..83d0642cf3 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a1224d..e4e52f10f8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,6 +54,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -64,6 +86,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +94,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -110,12 +134,57 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool	in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;						/* XID of the subxact */
+	off_t           offset;					/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo * subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +256,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -552,6 +657,329 @@ apply_handle_origin(StringInfo s)
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		/*
+		 * Pass the missing_ok as true so that if we haven't got any changes
+		 * for the top transaction (empty transaction) we don't give error.
+		 */
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, true);
+
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		int			fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just free the memory and return.
+		 */
+		if (!found)
+		{
+			/* Free the subxacts memory */
+			if (subxacts)
+				pfree(subxacts);
+
+			subxacts = NULL;
+			subxact_last = InvalidTransactionId;
+			nsubxacts = 0;
+			nsubxacts_max = 0;
+
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = OpenTransientFile(path, O_WRONLY | PG_BINARY);
+		if (fd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+		}
+
+		/* OK, truncate the file at the right offset. */
+		if (ftruncate(fd, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+		CloseTransientFile(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	ensure_transaction();
+
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -565,6 +993,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1011,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1050,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1168,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1313,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1686,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1827,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1477,6 +1939,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 	}
 }
 
+/*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
 /*
  * Apply main loop.
  */
@@ -1493,6 +1971,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1941,6 +2422,570 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole, and we also include CRC32C
+ * checksum of the information.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ *
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* compute the checksum */
+	INIT_CRC32C(checksum);
+	COMP_CRC32C(checksum, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum, (char *) subxacts, len);
+	FIN_CRC32C(checksum);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables, and while
+ * reading the information verify the checksum.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	uint32		checksum;
+	uint32		checksum_new;
+	Size		len;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read the checksum */
+	if (read(fd, &checksum, sizeof(checksum)) != sizeof(checksum))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/* subxacts are long-lived */
+	oldctx = MemoryContextSwitchTo(TopMemoryContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/* recompute the checksum */
+	INIT_CRC32C(checksum_new);
+	COMP_CRC32C(checksum_new, (char *) &nsubxacts, sizeof(nsubxacts));
+	COMP_CRC32C(checksum_new, (char *) subxacts, len);
+	FIN_CRC32C(checksum_new);
+
+	if (checksum_new != checksum)
+		ereport(ERROR,
+				(errmsg("checksum failure when reading subxacts")));
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ *
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd >= 0);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+		oldctx = MemoryContextSwitchTo(TopMemoryContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.subxacts",
+			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.changes",
+			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3151,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 15379e3118..1509f9b826 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +122,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +148,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +211,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +240,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +263,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +373,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +423,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +465,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +486,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +518,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +538,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +562,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +582,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +607,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +639,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -587,6 +719,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -623,6 +840,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -753,11 +1002,44 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -793,7 +1075,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 6fed3cfd23..e1344ab4cc 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -157,6 +157,12 @@ create_logical_replication_slot(char *name, char *plugin,
 											   .segment_close = wal_segment_close),
 									NULL, NULL, NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
 	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index adb7d7962e..9731b86d1f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1020,6 +1020,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndPrepareWrite, WalSndWriteData,
 										WalSndUpdateProgress);
 
+		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
 		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d42d8..617b9094d4 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c55dc1481c..899d7e2013 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -980,7 +980,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561be9..89158ed46f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index ac1acbb27e..95132062c6 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v25/v25-0006-Bugfix-handling-of-incomplete-toast-spec-insert-.patch000644 000765 000024 00000064214 13663476160 026243 0ustar00dilipkumarstaff000000 000000 From 19eb8d18a62b4a07fe817d0ef522688742ed972c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Tue, 19 May 2020 18:55:23 +0530
Subject: [PATCH v25 06/12] Bugfix handling of incomplete toast/spec insert
 tuple

---
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/replication/logical/decode.c      |  17 +-
 .../replication/logical/reorderbuffer.c       | 379 +++++++++++++-----
 src/include/access/heapam_xlog.h              |   1 +
 src/include/replication/reorderbuffer.h       |  49 ++-
 5 files changed, 340 insertions(+), 109 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2d77107c4f..3927448f46 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1955,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 69c1f45ef6..c841687c66 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -727,7 +727,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -794,7 +796,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -851,7 +854,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -887,7 +891,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -987,7 +991,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1025,7 +1029,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 366a8e6386..b79752cae4 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -179,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -237,7 +252,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool partial_truncate);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -646,14 +662,89 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 	return txn;
 }
 
+/*
+ * Handle incomplete tuple during streaming.  If streaming is enabled then we
+ * might need to stream the in-progress transaction.  So the problem is that
+ * sometime we might get some incomplete changes which we can not stream
+ * until we get the complete change. e.g.  toast table insert without the main
+ * table insert.  So this function remember the lsn of the last complete change
+ * and the complete size upto last complete lsn so that if we need to stream
+ * we can only stream upto last complete lsn.
+ */
+static void
+ReorderBufferHandleIncompleteTuple(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   ReorderBufferChange *change,
+								   bool toast_insert, Size total_size)
+{
+	ReorderBufferTXN *toptxn;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * If this is a first incomplete change then set the size of the complete
+	 * change.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		(toast_insert || IsSpecInsert(change->action)))
+		toptxn->complete_size = total_size;
+
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Basically,
+	 * both update and insert will do the insert in the toast table.  And as
+	 * explained in the function header we can not stream the only toast
+	 * changes.  So whenever we get the toast insert we set the flag and clear
+	 * the same whenever we get the next insert or update on the main table.
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial tuple and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+
+	/*
+	 * If we don't have any incomplete change after this change then set this
+	 * LSN as last complete lsn.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(toptxn)))
+	{
+		toptxn->last_complete_lsn = change->lsn;
+
+		/*
+		 * If the transaction is serialized and the the changes are complete in
+		 * the top level transaction then immediately stream the transaction.
+		 * The reason for not waiting for memory limit to get full is that in
+		 * the streaming mode, if the transaction serialized that means we have
+		 * already reached the memory limit but that time we could not stream
+		 * this due to incomplete tuple so now stream it as soon as the tuple
+		 * is complete.
+		 */
+		if (rbtxn_is_serialized(txn))
+			ReorderBufferStreamTXN(rb, toptxn);
+	}
+}
+
 /*
  * Queue a change into a transaction so it can be replayed upon commit.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
+	Size	total_size = 0;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
@@ -665,9 +756,28 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * Get the total size of the top transaction before updating the size for
+	 * current change so that if this is the incomplete tuple we know the size
+	 * prior to this change.  That will be used for updating the size of the
+	 * complete changes in the top transaction for streaming.
+	 */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			total_size = txn->toptxn->total_size;
+		else
+			total_size = txn->total_size;
+	}
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* Handle the incomplete tuple, If streaming is enabled */
+	if (ReorderBufferCanStream(rb))
+		ReorderBufferHandleIncompleteTuple(rb, txn, change, toast_insert,
+										   total_size);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -697,7 +807,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1407,11 +1517,40 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 /*
  * Discard changes from a transaction (and subtransactions), after streaming
  * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ * If complete_truncate is set we completely truncate the transaction,
+ * otherwise we truncate upto last_complete_lsn if the transaction has
+ * incomplete changes.  Basically, complete_truncate is passed true only if
+ * concurrent abort is detected while processing the TXN.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 bool partial_truncate)
 {
 	dlist_mutable_iter iter;
+	ReorderBufferTXN *toptxn;
+
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
 
 	/* cleanup subtransactions & their changes */
 	dlist_foreach_modify(iter, &txn->subtxns)
@@ -1428,7 +1567,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, partial_truncate);
 	}
 
 	/* cleanup changes in the toplevel txn */
@@ -1438,30 +1577,28 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		change = dlist_container(ReorderBufferChange, node, iter.cur);
 
+		/* We have truncated upto last complete lsn so stop. */
+		if (partial_truncate && rbtxn_has_incomplete_tuple(toptxn) &&
+			(change->lsn > toptxn->last_complete_lsn))
+		{
+			/*
+			 * If this is a top transaction then we can reset the
+			 * last_complete_lsn and complete_size, because by now we would
+			 * have stream all the changes upto last_complete_lsn.
+			 */
+			if (txn->toptxn == NULL)
+			{
+				toptxn->last_complete_lsn = InvalidXLogRecPtr;
+				toptxn->complete_size = 0;
+			}
+			break;
+		}
+
 		/* remove the change from it's containing list */
 		dlist_delete(&change->node);
-
 		ReorderBufferReturnChange(rb, change);
 	}
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The toplevel transaction, identified by (toptxn==NULL), is marked
-	 * as streamed always, even if it does not contain any changes (that
-	 * is, when all the changes are in subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts
-	 * for XIDs the downstream is not aware of. And of course, it always
-	 * knows about the toplevel xact (we send the XID in all messages),
-	 * but we never stream XIDs of empty subxacts.
-	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
 	 * any memory. We could also keep the hash table and update it with
@@ -1473,9 +1610,30 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
-	/* also reset the number of entries in the transaction */
-	txn->nentries_mem = 0;
-	txn->nentries = 0;
+	/*
+	 * Subtract the processed changes from the nentries/nentries_mem Refer
+	 * detailed comment atop this variable in ReorderBufferTXN structure.
+	 * We do this only ff we are truncating the partial changes otherwise
+	 * reset these values directly to 0.
+	 */
+	if (partial_truncate)
+	{
+		txn->nentries -= txn->nprocessed;
+		txn->nentries_mem -= txn->nprocessed;
+	}
+	else
+	{
+		txn->nentries = 0;
+		txn->nentries_mem = 0;
+	}
+	txn->nprocessed = 0;
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
 }
 
 /*
@@ -1762,7 +1920,7 @@ ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
 								   ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed. */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Stop the stream. */
 	rb->stream_stop(rb, txn, last_lsn);
@@ -1794,6 +1952,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 	ReorderBufferChange *volatile specinsert = NULL;
 	volatile bool	stream_started = false;
+	volatile bool	partial_truncate = false;
+
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1850,10 +2010,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			prev_lsn = change->lsn;
 
-			/* Set the xid for concurrent abort check. */
+			/* Per-change pre-processing for streaming mode. */
 			if (streaming)
+			{
+				/* Set the xid for concurrent abort check. */
 				SetupCheckXidLive(change->txn->xid);
 
+				/*
+				 * Increment the nprocessed count.  See the detailed comment
+				 * for usage of this in ReorderBufferTXN structure.
+				 */
+				change->txn->nprocessed++;
+			}
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -2116,6 +2285,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
 			}
+
+			/*
+			 * If the transaction contains incomplete tuple and this is the
+			 * last complete change then stop further processing of the
+			 * transaction.  And, set the partial truncate flag to true.
+			 */
+			if (rbtxn_has_incomplete_tuple(txn) &&
+				prev_lsn == txn->last_complete_lsn)
+			{
+				/* Only in streaming mode we should get here. */
+				Assert(streaming);
+				partial_truncate = true;
+				break;
+			}
 		}
 
 		/*
@@ -2135,7 +2318,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * Done with current changes, call stream_stop callback for streaming
-		 * transaction, commit callback otherwise.  If we have sent
+		 * transaction, commit callback otherwise.  Only If we have sent
 		 * start/begin.
 		 */
 		if (stream_started)
@@ -2186,7 +2369,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 * deallocate the ReorderBufferTXN.
 		 */
 		if (streaming)
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, partial_truncate);
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2515,7 +2698,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2564,7 +2747,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2587,6 +2770,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2601,8 +2785,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2610,12 +2799,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2676,7 +2873,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 	memcpy(change->data.inval.invalidations, msgs,
 		   sizeof(SharedInvalidationMessage) * nmsgs);
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -2860,18 +3057,28 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 	dlist_foreach(iter, &rb->toplevel_by_lsn)
 	{
 		ReorderBufferTXN *txn;
+		Size	size = 0;
+		Size	largest_size = 0;
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
+		/*
+		 * If this transaction have some incomplete changes then only consider
+		 * the size upto last complete lsn.
+		 */
+		if (rbtxn_has_incomplete_tuple(txn))
+			size = txn->complete_size;
+		else
+			size = txn->total_size;
+
+		/* If the current transaction is larger then remember it. */
+		if ((largest != NULL || size > largest_size) && size > 0)
+		{
 			largest = txn;
+			largest_size = size;
+		}
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2889,66 +3096,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
 	ReorderBufferTXN *txn;
 
-	/* bail out if we haven't exceeded the memory limit */
-	if (rb->size < logical_decoding_work_mem * 1024L)
-		return;
-
-	/*
-	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by streaming, if supported. Otherwise spill to disk.
-	 */
-	if (ReorderBufferCanStream(rb))
-	{
-		/*
-		 * Pick the largest toplevel transaction and evict it from memory by
-		 * streaming the already decoded part.
-		 */
-		txn = ReorderBufferLargestTopTXN(rb);
-
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn && !txn->toptxn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
-
-		ReorderBufferStreamTXN(rb, txn);
-	}
-	else
+	/* Loop until we reach under the memory limit. */
+	while (rb->size >= logical_decoding_work_mem * 1024L)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		ReorderBufferSerializeTXN(rb, txn);
-	}
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
 
-	/*
-	 * After eviction, the transaction should have no entries in memory, and
-	 * should use 0 bytes for changes.
-	 *
-	 * XXX Checking the size is fine for both cases - spill to disk and
-	 * streaming. But for streaming we should really check nentries_mem for
-	 * all subtransactions too.
-	 */
-	Assert(txn->size == 0);
-	Assert(txn->nentries_mem == 0);
+			ReorderBufferSerializeTXN(rb, txn);
+			/*
+			 * After eviction, the transaction should have no entries in memory, and
+			 * should use 0 bytes for changes.
+			 */
+			Assert(txn->size == 0);
+			Assert(txn->nentries_mem == 0);
+		}
+	}
 
-	/*
-	 * And furthermore, evicting the transaction should get us below the
-	 * memory limit again - it is not possible that we're still exceeding the
-	 * memory limit after evicting the transaction.
-	 *
-	 * This follows from the simple fact that the selected transaction is at
-	 * least as large as the most recent change (which caused us to go over
-	 * the memory limit). So by evicting it we're definitely back below the
-	 * memory limit.
-	 */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
@@ -3344,10 +3531,6 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 */
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
-
-	Assert(dlist_is_empty(&txn->changes));
-	Assert(txn->nentries == 0);
-	Assert(txn->nentries_mem == 0);
 }
 
 /*
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cdb12..aa17f7df84 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index b3e2b3f64b..4004fd6684 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -172,6 +172,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -191,6 +193,26 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -199,10 +221,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -350,6 +368,25 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* Size of the commplete changes. */
+	Size		complete_size;
+
+	/* LSN of the last complete change. */
+	XLogRecPtr	last_complete_lsn;
+
+	/*
+	 * In streaming mode, sometime we can't stream all the changes due to the
+	 * incomplete changes.  So we can not directly reset the values of
+	 * nentries/nentries_mem to 0 after one stream is sent like we do in
+	 * non-streaming mode.  So while sending one stream we keep count of the
+	 * changes processed in thi stream and only those many changes we decrement
+	 * from the nentries/nentries_mem.
+	*/
+	uint64		nprocessed;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -537,7 +574,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
2.23.0

v25/v25-0002-Issue-individual-invalidations-with-wal_level-lo.patch000644 000765 000024 00000041336 13663476160 026444 0ustar00dilipkumarstaff000000 000000 From 39038dc312020cc1f01dc25281e802366a0cf756 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Tue, 14 Apr 2020 11:11:37 +0530
Subject: [PATCH v25 02/12] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.
---
 src/backend/access/rmgrdesc/xactdesc.c        |  40 +++++++
 src/backend/access/transam/xact.c             |   7 ++
 src/backend/replication/logical/decode.c      |  16 +++
 .../replication/logical/reorderbuffer.c       | 104 +++++++++++++++---
 src/backend/utils/cache/inval.c               |  49 +++++++++
 src/include/access/xact.h                     |  13 ++-
 src/include/replication/reorderbuffer.h       |  11 ++
 7 files changed, 226 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce75565f..7ab0d11ea9 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3af8e81af1..e576b10055 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6020,6 +6020,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 122c581d0f..69c1f45ef6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,22 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				Assert(TransactionIdIsValid(xid));
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->nmsgs, invals->msgs);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4594cf9509..b889edf461 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -220,7 +220,7 @@ static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
 static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
-static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
 
 /*
  * ---------------------------------------
@@ -455,6 +455,11 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 				pfree(change->data.msg.message);
 			change->data.msg.message = NULL;
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			if (change->data.inval.invalidations)
+				pfree(change->data.inval.invalidations);
+			change->data.inval.invalidations = NULL;
+			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			if (change->data.snapshot)
 			{
@@ -1814,17 +1819,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation messages locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					ReorderBufferExecuteInvalidations(
+							change->data.inval.ninvalidations,
+							change->data.inval.invalidations);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -1866,7 +1878,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1892,7 +1905,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2202,6 +2216,33 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	txn->ntuplecids++;
 }
 
+/*
+ * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn, int nmsgs,
+							 SharedInvalidationMessage *msgs)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.ninvalidations = nmsgs;
+	change->data.inval.invalidations = (SharedInvalidationMessage *)
+		MemoryContextAlloc(rb->context,
+						   sizeof(SharedInvalidationMessage) * nmsgs);
+	memcpy(change->data.inval.invalidations, msgs,
+		   sizeof(SharedInvalidationMessage) * nmsgs);
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Setup the invalidation of the toplevel transaction.
  *
@@ -2234,12 +2275,12 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  * in the changestream but we don't know which those are.
  */
 static void
-ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
 {
 	int			i;
 
-	for (i = 0; i < txn->ninvalidations; i++)
-		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	for (i = 0; i < nmsgs; i++)
+		LocalExecuteInvalidationMessage(&msgs[i]);
 }
 
 /*
@@ -2589,6 +2630,24 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				char	   *data;
+				Size		inval_size = sizeof(SharedInvalidationMessage) *
+										change->data.inval.ninvalidations;
+
+				sz += inval_size;
+
+				ReorderBufferSerializeReserve(rb, sz);
+				data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+
+				/* might have been reallocated above */
+				ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+				memcpy(data, change->data.inval.invalidations, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -2736,6 +2795,11 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			sz += sizeof(SharedInvalidationMessage) *
+					change->data.inval.ninvalidations;
+			break;
+
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -3002,6 +3066,20 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				Size	inval_size = sizeof(SharedInvalidationMessage) *
+									change->data.inval.ninvalidations;
+
+				change->data.inval.invalidations =
+						MemoryContextAlloc(rb->context, inval_size);
+
+				/* read the message */
+				memcpy(change->data.inval.invalidations, data, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33be6..cba5b6c64b 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transaction.  As of now it was
+ *	enough to log invalidation only at commit because we are only decoding the
+ *	transaction at the commit time.   We only need to log the catalog cache and
+ *	relcache invalidation.  There can not be any active MVCC scan in logical
+ *	decoding so we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
 	if (transInvalInfo == NULL)
 		return;
 
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
@@ -1501,3 +1513,40 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage	*invalMessages;
+	int	nmsgs = 0;
+
+	if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+	{
+		ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+								 MakeSharedInvalidMessagesArray);
+		invalMessages = SharedInvalidMessagesArray;
+		nmsgs  = numSharedInvalidMessagesArray;
+		SharedInvalidMessagesArray = NULL;
+		numSharedInvalidMessagesArray = 0;
+	}
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b3816c..b822c5e4b2 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -197,6 +197,17 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+/*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4dc9..af35287896 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,14 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations;		/* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation
+														 * message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +468,8 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  int nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
2.23.0

v25/v25-0011-Provide-new-api-to-get-the-streaming-changes.patch000644 000765 000024 00000014524 13663476160 025350 0ustar00dilipkumarstaff000000 000000 From dfde89390e5f5a9432336eca0168024014790abb Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Sat, 2 May 2020 11:41:59 +0530
Subject: [PATCH v25 11/12] Provide new api to get the streaming changes

---
 .gitignore                                    |  1 +
 doc/src/sgml/test-decoding.sgml               | 22 ++++++++++++++++++
 src/backend/catalog/system_views.sql          |  8 +++++++
 .../replication/logical/logicalfuncs.c        | 23 +++++++++++++++----
 src/include/catalog/pg_proc.dat               |  9 ++++++++
 5 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/.gitignore b/.gitignore
index 794e35b73c..6083744c07 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d67b..eed6e9d134 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_streaming_changes('test_slot', NULL, NULL);
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 9f509fbc21..5fe6f28ba2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1243,6 +1243,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
 
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_peek_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index b99c94e848..70c28ffa91 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -108,7 +108,8 @@ check_permissions(void)
  * Helper function for the various SQL callable logical decoding functions.
  */
 static Datum
-pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool binary)
+pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
+								 bool binary, bool streaming)
 {
 	Name		name;
 	XLogRecPtr	upto_lsn;
@@ -252,6 +253,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or has been invalidated.")));
 
+		/* If called has not asked for streaming changes then disable it. */
+		ctx->streaming &= streaming;
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
@@ -362,7 +366,16 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 Datum
 pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, false);
+}
+
+/*
+ * SQL function to get the streaming changes as text, consuming the data.
+ */
+Datum
+pg_logical_slot_get_streaming_changes(PG_FUNCTION_ARGS)
+{
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, true);
 }
 
 /*
@@ -371,7 +384,7 @@ pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, false, false);
 }
 
 /*
@@ -380,7 +393,7 @@ pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, true, false);
 }
 
 /*
@@ -389,7 +402,7 @@ pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, true, false);
 }
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7869f721da..875e0bef28 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10115,6 +10115,15 @@
   proargmodes => '{i,i,i,v,o,o,o}',
   proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
   prosrc => 'pg_logical_slot_get_binary_changes' },
+{ oid => '6150', descr => 'get streaming changes from replication slot',
+  proname => 'pg_logical_slot_get_streaming_changes', procost => '1000',
+  prorows => '1000', provariadic => 'text', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'name pg_lsn int4 _text',
+  proallargtypes => '{name,pg_lsn,int4,_text,pg_lsn,xid,text}',
+  proargmodes => '{i,i,i,v,o,o,o}',
+  proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
+  prosrc => 'pg_logical_slot_get_streaming_changes' },
 { oid => '3784', descr => 'peek at changes from replication slot',
   proname => 'pg_logical_slot_peek_changes', procost => '1000',
   prorows => '1000', provariadic => 'text', proisstrict => 'f',
-- 
2.23.0

v25/v25-0012-Add-streaming-option-in-pg_dump.patch000644 000765 000024 00000005337 13663476160 023057 0ustar00dilipkumarstaff000000 000000 From bda2de9a426814a541caa3a50a725441ee88c387 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v25 12/12] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index dfe43968b8..8ca4a05822 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4235,8 +4236,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4249,6 +4250,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4265,6 +4267,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4342,6 +4345,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0c2fcfb3a9..af64270c55 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
2.23.0

v25/v25-0003-Extend-the-output-plugin-API-with-stream-methods.patch000644 000765 000024 00000106360 13663476160 026204 0ustar00dilipkumarstaff000000 000000 From 38b053bbc4294ab2b68f38545ad0283c1b39b218 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v25 03/12] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 ++++++
 doc/src/sgml/logicaldecoding.sgml         | 213 +++++++++++++
 src/backend/replication/logical/logical.c | 365 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++
 src/include/replication/reorderbuffer.h   |  59 ++++
 6 files changed, 811 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c948856e..64f651fa72 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index bad3bfe620..1b56daa4bb 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index dc69e5ce5f..0cff1ac393 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similar to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,321 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475e5d..deef31825d 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -79,6 +79,11 @@ typedef struct LogicalDecodingContext
 	 */
 	void	   *output_writer_private;
 
+	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236c57..0d0a94a648 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
 /*
  * Output plugin callbacks
  */
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index af35287896..65814af9f5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -354,6 +354,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
 struct ReorderBuffer
 {
 	/*
@@ -392,6 +440,17 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
 	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
-- 
2.23.0

v25/v25-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch000644 000765 000024 00000032536 13663476160 026454 0ustar00dilipkumarstaff000000 000000 From 72d4c79f2b04b04efbebd1831951d4cbbf7a5b2e Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 10:55:19 +0530
Subject: [PATCH v25 04/12] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml  |  9 +++--
 src/backend/access/heap/heapam.c   | 10 ++++++
 src/backend/access/index/genam.c   | 53 ++++++++++++++++++++++++++++
 src/backend/access/table/tableam.c |  8 +++++
 src/backend/utils/time/snapmgr.c   | 13 +++++++
 src/include/access/tableam.h       | 55 ++++++++++++++++++++++++++++++
 src/include/utils/snapmgr.h        |  2 ++
 7 files changed, 147 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 1b56daa4bb..5f7394f3c1 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,9 +432,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 94eb37d48d..2d77107c4f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments at snapmgr.c
+	 * where these variables are declared.  Normally we have such a check at
+	 * tableam level API but this is called from many places so we need to
+	 * ensure it here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae39a..446b8cbc86 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,9 +430,36 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments at snapmgr.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
+/*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments at snapmgr.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
 /*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments at snapmgr.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index c814733b22..2f52b407c6 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -230,6 +230,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	Relation	rel = scan->rs_rd;
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
+	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
 	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..9f1ecd123f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -153,6 +153,19 @@ static Snapshot SecondarySnapshot = NULL;
 static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool bsysscan = false;
+
 /*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8c34935c34..9d890d3c4b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
+#include "utils/snapmgr.h"
 #include "utils/snapshot.h"
 
 
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1015,6 +1025,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1054,6 +1071,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1710,6 +1735,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1727,6 +1760,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1745,6 +1786,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1761,6 +1809,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce84..5af6df698b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,6 +145,8 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
+extern bool	bsysscan;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
 extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
 extern void TeardownHistoricSnapshot(bool is_error);
-- 
2.23.0

v25/v25-0005-Implement-streaming-mode-in-ReorderBuffer.patch000644 000765 000024 00000113333 13663476160 025034 0ustar00dilipkumarstaff000000 000000 From 755b26495ec8ff55a81c34232046668d61244a2d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 1 May 2020 19:56:35 +0530
Subject: [PATCH v25 05/12] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes
we have in memory and invoke new stream API methods. This happens
in ReorderBufferStreamTXN() using about the same logic as in
ReorderBufferCommit() logic.  However, sometime if we have incomplete
toast or speculative insert we spill to the disk because we can not
generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c   |  38 +-
 .../replication/logical/reorderbuffer.c       | 758 ++++++++++++++++--
 src/include/replication/reorderbuffer.h       |  31 +
 3 files changed, 750 insertions(+), 77 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aa..160b167adb 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b889edf461..366a8e6386 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -236,6 +237,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +246,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +382,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -772,6 +786,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+/*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -865,6 +911,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -1024,6 +1073,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1038,6 +1090,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1315,6 +1370,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		dlist_delete(&txn->base_snapshot_node);
 	}
 
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
 	/*
 	 * Remove TXN from its containing list.
 	 *
@@ -1340,6 +1404,80 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferReturnTXN(rb, txn);
 }
 
+/*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
 /*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1491,57 +1629,171 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+	 * aborted. That will happen during catalog access.  Also, reset the
+	 * bsysscan flag.
+	 */
+	if (!TransactionIdDidCommit(xid))
+	{
+		CheckXidAlive = xid;
+		bsysscan = false;
 	}
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction.
+ */
+static void
+ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   Snapshot snapshot_now,
+								   CommandId command_id,
+								   XLogRecPtr last_lsn,
+								   ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+
+	ReorderBufferToastReset(rb, txn);
+	if (specinsert != NULL)
+		ReorderBufferReturnChange(rb, specinsert);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool	stream_started = false;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1564,21 +1816,44 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
-
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
 		{
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Start stream or begin transaction for the first change in the
+			 * current stream.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+					rb->stream_start(rb, txn, change->lsn);
+				else
+					rb->begin(rb, txn);
+				stream_started = true;
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1655,7 +1930,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1695,7 +1971,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1753,7 +2029,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1762,10 +2041,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1796,7 +2072,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1858,14 +2133,34 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.  If we have sent
+		 * start/begin.
+		 */
+		if (stream_started)
+		{
+			if (streaming)
+				rb->stream_stop(rb, txn, prev_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+			stream_started = false;
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1884,14 +2179,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
-
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+			ReorderBufferTruncateTXN(rb, txn);
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,17 +2214,118 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If the error code is ERRCODE_TRANSACTION_ROLLBACK,  that means we
+		 * have detected a concurrent abort of the (sub)transaction we are
+		 * streaming.  So just do the cleanup and return gracefully.
+		 * Otherwise, Re-throw the error.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * Only in the streaming mode we can get this error, because only
+			 * in the streaming mode we send in-progress transaction.
+			 */
+			Assert(streaming);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/*
+			 * In the TRY block we only stop the stream after we have send
+			 * all the changes.  So we have detected the concurrent abort then
+			 * the stream should have not been stopped yet.
+			 */
+			Assert(stream_started);
 
-		PG_RE_THROW();
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+
+			/* Handle the concurrent abort. */
+			ReorderBufferHandleConcurrentAbort(rb, txn, snapshot_now,
+											   command_id, prev_lsn,
+											   specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.  We iterate over the top and
+ * subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot snapshot_now;
+	CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
@@ -1946,6 +2350,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2015,6 +2426,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2150,8 +2568,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2159,6 +2586,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2170,19 +2598,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2211,6 +2648,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2295,6 +2733,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * TOCHECK: Mark toplevel transaction as having catalog changes too
+	 * if one of its children has.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2398,6 +2843,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
@@ -2418,15 +2895,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2746,6 +3254,102 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot snapshot_now;
+	CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -3864,6 +4468,16 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases we assume the CID is from the future command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 65814af9f5..b3e2b3f64b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -171,6 +171,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -190,6 +191,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -224,6 +243,11 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	final_lsn;
 
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
@@ -254,6 +278,13 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
+	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
 	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
-- 
2.23.0

v25/v25-0010-Add-TAP-test-for-streaming-vs.-DDL.patch000644 000765 000024 00000010620 13663476160 022771 0ustar00dilipkumarstaff000000 000000 From 964a1fb0350b1da0919b68c1a0c4822d87dd4154 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v25 10/12] Add TAP test for streaming vs. DDL

---
 .../subscription/t/014_stream_through_ddl.pl  | 98 +++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000000..b8d78b1972
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

#333Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#327)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, May 26, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 22, 2020 at 6:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Review comments:
------------------------------
1.
@@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
TransactionId xid,
}

case REORDER_BUFFER_CHANGE_MESSAGE:
- rb->message(rb, txn, change->lsn, true,
- change->data.msg.prefix,
- change->data.msg.message_size,
- change->data.msg.message);
+ if (streaming)
+ rb->stream_message(rb, txn, change->lsn, true,
+    change->data.msg.prefix,
+    change->data.msg.message_size,
+    change->data.msg.message);
+ else
+ rb->message(rb, txn, change->lsn, true,
+    change->data.msg.prefix,
+    change->data.msg.message_size,
+    change->data.msg.message);

Don't we need to set any_data_sent flag while streaming messages as we
do for other types of changes?

I think any_data_sent, was added to avoid sending abort to the
subscriber if we haven't sent any data, but this is not complete as
the output plugin can also take the decision not to send. So I think
this should not be done as part of this patch and can be done
separately. I think there is already a thread for handling the
same[1]

Hmm, but prior to this patch, we never use to send (empty) aborts but
now that will be possible. It is probably okay to deal that with
another patch mentioned by you but I felt at least any_data_sent will
work for some cases. OTOH, it appears to be half-baked solution, so
we should probably refrain from adding it. BTW, how do the pgoutput
plugin deal with it? I see that apply_handle_stream_abort will
unconditionally try to unlink the file and it will probably fail.
Have you tested this scenario after your latest changes?

Yeah, I see, I think this is a problem, but this exists without my
latest change as well, if pgoutput ignore some changes because it is
not published then we will see a similar error. Shall we handle the
ENOENT error case from unlink?

Isn't this problem only for subxact file as we anyway create changes
file as part of start stream message which should have come after
abort? If so, can't we detect whether subxact file exists probably by
using nsubxacts or something like that? Can you please once try to
reproduce this scenario to ensure that we are not missing anything?

I have tested this, as of now, by default we create both changes and
subxact files irrespective of whether we get any subtransactions or
not. Maybe this could be optimized that only if we have any subxact
then only create that file otherwise not? What's your opinion on the
same.

8.
@@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
*rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);

txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+ /*
+ * TOCHECK: Mark toplevel transaction as having catalog changes too
+ * if one of its children has.
+ */
+ if (txn->toptxn != NULL)
+ txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}

Why are we marking top transaction here?

We need to mark top transaction to decide whether to build tuplecid
hash or not. In non-streaming mode, we are only sending during the
commit time, and during commit time we know whether the top
transaction has any catalog changes or not based on the invalidation
message so we are marking the top transaction there in DecodeCommit.
Since here we are not waiting till commit so we need to mark the top
transaction as soon as we mark any of its child transactions.

But how does it help? We use this flag (via
ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
anyway done in DecodeCommit and that too after setting this flag for
the top transaction if required. So, how will it help in setting it
while processing for subxid. Also, even if we have to do it won't it
add the xid needlessly in builder->committed.xip array?

In ReorderBufferBuildTupleCidHash, we use this flag to decide whether
to build the tuplecid hash or not based on whether it has catalog
changes or not.

Okay, but you haven't answered the second part of the question: "won't
it add the xid of top transaction needlessly in builder->committed.xip
array, see function SnapBuildCommitTxn?" IIUC, this can happen
without patch as well because DecodeCommit also sets the flags just
based on invalidation messages irrespective of whether the messages
are generated by top transaction or not, is that right?

Yes, with or without the patch it always adds the topxid. I think
purpose for doing this with/without patch is not for the snapshot
instead we are marking the top itself that some of its subtxn has the
catalog changes so that while building the tuplecid has we can know
whether to build the hash or not. But, having said that I feel in
ReorderBufferBuildTupleCidHash why do we need these two checks
if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;

I mean it should be enough to just have the check, because if we have
added something to the tuplecids then catalog changes must be there
because that time we are setting the catalog changes to true.

if (dlist_is_empty(&txn->tuplecids))
return;

I think in the base code there are multiple things going on
1. If we get new CID we always set the catalog change in that
transaction but add the tuplecids in the top transaction. So
basically, top transaction is so far not marked with catalog changes
but it has tuplecids.
2. Now, in DecodeCommit the top xid will be marked that it has catalog
changes based on the invalidation messages.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#334Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#329)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, May 26, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 25, 2020 at 8:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

4.
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
{
..

For this and other places in a patch like in function
stream_open_file(), instead of using TopMemoryContext, can we consider
using a new memory context LogicalStreamingContext or something like
that. We can create LogicalStreamingContext under TopMemoryContext. I
don't see any need of using TopMemoryContext here.

But, when we will delete/reset the LogicalStreamingContext?

Why can't we reset it at each stream stop message?

because
we are planning to keep this memory until the worker is alive so that
supposed to be the top memory context.

Which part of allocation do we want to keep till the worker is alive?

static TransactionId *xids = NULL; we need to keep till worker life space.

Why we need memory-related to subxacts till the worker is alive? As
we have now, after reading subxact info (subxact_info_read), we need
to ensure that it is freed after its usage due to which we need to
remember and perform pfree at various places.

I think we should once see the possibility that such that we could
switch to this new context in start stream message and reset it in
stop stream message. That might help in avoiding
MemoryContextSwitchTo TopMemoryContext at various places.

Ok, I understand, I think subxacts can be allocated in new
LogicalStreamingContext which we can reset at the stream stop. How
about xids?
shall we create another context that will stay until the worker lifespan?

If we create any other context
with the same life span as TopMemoryContext then what is the point?

It is helpful for debugging. It is recommended that we don't use the
top memory context unless it is really required. Read about it in
src/backend/utils/mmgr/README.

I see.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#335Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#333)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, May 28, 2020 at 12:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 26, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Isn't this problem only for subxact file as we anyway create changes
file as part of start stream message which should have come after
abort? If so, can't we detect whether subxact file exists probably by
using nsubxacts or something like that? Can you please once try to
reproduce this scenario to ensure that we are not missing anything?

I have tested this, as of now, by default we create both changes and
subxact files irrespective of whether we get any subtransactions or
not. Maybe this could be optimized that only if we have any subxact
then only create that file otherwise not? What's your opinion on the
same.

Yeah, that makes sense.

8.
@@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
*rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);

txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+ /*
+ * TOCHECK: Mark toplevel transaction as having catalog changes too
+ * if one of its children has.
+ */
+ if (txn->toptxn != NULL)
+ txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}

Why are we marking top transaction here?

We need to mark top transaction to decide whether to build tuplecid
hash or not. In non-streaming mode, we are only sending during the
commit time, and during commit time we know whether the top
transaction has any catalog changes or not based on the invalidation
message so we are marking the top transaction there in DecodeCommit.
Since here we are not waiting till commit so we need to mark the top
transaction as soon as we mark any of its child transactions.

But how does it help? We use this flag (via
ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
anyway done in DecodeCommit and that too after setting this flag for
the top transaction if required. So, how will it help in setting it
while processing for subxid. Also, even if we have to do it won't it
add the xid needlessly in builder->committed.xip array?

In ReorderBufferBuildTupleCidHash, we use this flag to decide whether
to build the tuplecid hash or not based on whether it has catalog
changes or not.

Okay, but you haven't answered the second part of the question: "won't
it add the xid of top transaction needlessly in builder->committed.xip
array, see function SnapBuildCommitTxn?" IIUC, this can happen
without patch as well because DecodeCommit also sets the flags just
based on invalidation messages irrespective of whether the messages
are generated by top transaction or not, is that right?

Yes, with or without the patch it always adds the topxid. I think
purpose for doing this with/without patch is not for the snapshot
instead we are marking the top itself that some of its subtxn has the
catalog changes so that while building the tuplecid has we can know
whether to build the hash or not. But, having said that I feel in
ReorderBufferBuildTupleCidHash why do we need these two checks
if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;

I mean it should be enough to just have the check, because if we have
added something to the tuplecids then catalog changes must be there
because that time we are setting the catalog changes to true.

if (dlist_is_empty(&txn->tuplecids))
return;

I think in the base code there are multiple things going on
1. If we get new CID we always set the catalog change in that
transaction but add the tuplecids in the top transaction. So
basically, top transaction is so far not marked with catalog changes
but it has tuplecids.
2. Now, in DecodeCommit the top xid will be marked that it has catalog
changes based on the invalidation messages.

I don't think it is advisable to remove that check from base code
unless we have a strong reason for doing so. I think here you can
write better comments about why you are marking the flag for top
transaction and remove TOCHECK from the comment.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#336Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#334)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, May 28, 2020 at 12:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 26, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Why we need memory-related to subxacts till the worker is alive? As
we have now, after reading subxact info (subxact_info_read), we need
to ensure that it is freed after its usage due to which we need to
remember and perform pfree at various places.

I think we should once see the possibility that such that we could
switch to this new context in start stream message and reset it in
stop stream message. That might help in avoiding
MemoryContextSwitchTo TopMemoryContext at various places.

Ok, I understand, I think subxacts can be allocated in new
LogicalStreamingContext which we can reset at the stream stop. How
about xids?

How about storing xids in ApplyContext? We do store similar lifespan
things in that context, for ex. see store_flush_position.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#337Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#336)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, May 28, 2020 at 3:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 28, 2020 at 12:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 26, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Why we need memory-related to subxacts till the worker is alive? As
we have now, after reading subxact info (subxact_info_read), we need
to ensure that it is freed after its usage due to which we need to
remember and perform pfree at various places.

I think we should once see the possibility that such that we could
switch to this new context in start stream message and reset it in
stop stream message. That might help in avoiding
MemoryContextSwitchTo TopMemoryContext at various places.

Ok, I understand, I think subxacts can be allocated in new
LogicalStreamingContext which we can reset at the stream stop. How
about xids?

How about storing xids in ApplyContext? We do store similar lifespan
things in that context, for ex. see store_flush_position.

That sounds good to me, I will make this change in the next patch
set, along with other changes.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#338Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#335)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, May 28, 2020 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 28, 2020 at 12:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 26, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Isn't this problem only for subxact file as we anyway create changes
file as part of start stream message which should have come after
abort? If so, can't we detect whether subxact file exists probably by
using nsubxacts or something like that? Can you please once try to
reproduce this scenario to ensure that we are not missing anything?

I have tested this, as of now, by default we create both changes and
subxact files irrespective of whether we get any subtransactions or
not. Maybe this could be optimized that only if we have any subxact
then only create that file otherwise not? What's your opinion on the
same.

Yeah, that makes sense.

8.
@@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
*rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);

txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+ /*
+ * TOCHECK: Mark toplevel transaction as having catalog changes too
+ * if one of its children has.
+ */
+ if (txn->toptxn != NULL)
+ txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}

Why are we marking top transaction here?

We need to mark top transaction to decide whether to build tuplecid
hash or not. In non-streaming mode, we are only sending during the
commit time, and during commit time we know whether the top
transaction has any catalog changes or not based on the invalidation
message so we are marking the top transaction there in DecodeCommit.
Since here we are not waiting till commit so we need to mark the top
transaction as soon as we mark any of its child transactions.

But how does it help? We use this flag (via
ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
anyway done in DecodeCommit and that too after setting this flag for
the top transaction if required. So, how will it help in setting it
while processing for subxid. Also, even if we have to do it won't it
add the xid needlessly in builder->committed.xip array?

In ReorderBufferBuildTupleCidHash, we use this flag to decide whether
to build the tuplecid hash or not based on whether it has catalog
changes or not.

Okay, but you haven't answered the second part of the question: "won't
it add the xid of top transaction needlessly in builder->committed.xip
array, see function SnapBuildCommitTxn?" IIUC, this can happen
without patch as well because DecodeCommit also sets the flags just
based on invalidation messages irrespective of whether the messages
are generated by top transaction or not, is that right?

Yes, with or without the patch it always adds the topxid. I think
purpose for doing this with/without patch is not for the snapshot
instead we are marking the top itself that some of its subtxn has the
catalog changes so that while building the tuplecid has we can know
whether to build the hash or not. But, having said that I feel in
ReorderBufferBuildTupleCidHash why do we need these two checks
if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;

I mean it should be enough to just have the check, because if we have
added something to the tuplecids then catalog changes must be there
because that time we are setting the catalog changes to true.

if (dlist_is_empty(&txn->tuplecids))
return;

I think in the base code there are multiple things going on
1. If we get new CID we always set the catalog change in that
transaction but add the tuplecids in the top transaction. So
basically, top transaction is so far not marked with catalog changes
but it has tuplecids.
2. Now, in DecodeCommit the top xid will be marked that it has catalog
changes based on the invalidation messages.

I don't think it is advisable to remove that check from base code
unless we have a strong reason for doing so. I think here you can
write better comments about why you are marking the flag for top
transaction and remove TOCHECK from the comment.

Ok, I will do that.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#339Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#332)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, May 27, 2020 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 26, 2020 at 7:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Okay, sending again.

While reviewing/testing I have found a couple of problems in 0005 and
0006 which I have fixed in the attached version.

I haven't reviewed the new fixes yet but I have some comments on
0008-Add-support-for-streaming-to-built-in-replicatio.patch.
1.
I think the temporary files (and or handles) used for storing the
information of changes and subxacts are getting leaked in the patch.
At some places, it is taken care to close the file but cases like
apply_handle_stream_commit where if any error occurred in
apply_dispatch(), the file might not get closed. The other place is
in apply_handle_stream_abort() where if there is an error in ftruncate
the file won't be closed. Now, the bigger problem is with changes
related file which is opened in apply_handle_stream_start and closed
in apply_handle_stream_stop and if there is any error in-between, we
won't close it.

OTOH, I think the worker will exit on an error so it might not matter
but then why we are at few other places we are closing it before the
error? I think on error these temporary files should be removed
instead of relying on them to get removed next time when we receive
changes for the same transaction which I feel is what we do in other
cases where we use temporary files like for sorts or hashjoins.

Also, what if the changes file size overflows "OS file size limit"?
If we agree that the above are problems then do you think we should
explore using BufFile interface (see storage/file/buffile.c) to avoid
all such problems?

2.
apply_handle_stream_abort()
{
..
+ /* discard the subxacts added later */
+ nsubxacts = subidx;
+
+ /* write the updated subxact list */
+ subxact_info_write(MyLogicalRepWorker->subid, xid);
..
}

Here, if subxacts becomes zero, then also subxact_info_write will
create a new file and write checksum. I think subxact_info_write
should have a check for nsubxacts > 0 before writing to the file.

3.
apply_handle_stream_commit(StringInfo s)
{
..
+ /*
+ * send feedback to upstream
+ *
+ * XXX Probably should send a valid LSN. But which one?
+ */
+ send_feedback(InvalidXLogRecPtr, false, false);
..
}

Why do we need to send the feedback at this stage after applying each
message? If we see a non-streamed case, we never send_feedback after
each message. So, following that, I don't see the need to send it here
but if you see any specific reason then do let me know? And if we
have to send feedback, then we need to decide the appropriate values
as well.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#340Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#332)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, May 27, 2020 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

While reviewing/testing I have found a couple of problems in 0005 and
0006 which I have fixed in the attached version.

..

In 0006: If we are streaming the serialized changed and there are
still few incomplete changes, then currently we are not deleting the
spilled file, but the spill file contains all the changes of the
transaction because there is no way to partially truncate it. So in
the next stream, it will try to resend those. I have fixed this by
sending the spilled transaction as soon as its changes are complete so
ideally, we can always delete the spilled file. It is also a better
solution because this transaction is already spilled once and that
happened because we could not stream it, so we better stream it on
the first opportunity that will reduce the replay lag which is our
whole purpose here.

I have reviewed these changes (in the patch
v25-0006-Bugfix-handling-of-incomplete-toast-spec-insert-) and below
are my comments.

1.
+ /*
+ * If the transaction is serialized and the the changes are complete in
+ * the top level transaction then immediately stream the transaction.
+ * The reason for not waiting for memory limit to get full is that in
+ * the streaming mode, if the transaction serialized that means we have
+ * already reached the memory limit but that time we could not stream
+ * this due to incomplete tuple so now stream it as soon as the tuple
+ * is complete.
+ */
+ if (rbtxn_is_serialized(txn))
+ ReorderBufferStreamTXN(rb, toptxn);

I think here it is important to explain why it is a must to stream a
prior serialized transaction as otherwise, later we won't be able to
know how to truncate a file.

2.
+ * If complete_truncate is set we completely truncate the transaction,
+ * otherwise we truncate upto last_complete_lsn if the transaction has
+ * incomplete changes.  Basically, complete_truncate is passed true only if
+ * concurrent abort is detected while processing the TXN.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+ bool partial_truncate)
 {

The description talks about complete_truncate flag whereas API is
using partial_truncate flag. I think the description needs to be
changed.

3.
+ /* We have truncated upto last complete lsn so stop. */
+ if (partial_truncate && rbtxn_has_incomplete_tuple(toptxn) &&
+ (change->lsn > toptxn->last_complete_lsn))
+ {
+ /*
+ * If this is a top transaction then we can reset the
+ * last_complete_lsn and complete_size, because by now we would
+ * have stream all the changes upto last_complete_lsn.
+ */
+ if (txn->toptxn == NULL)
+ {
+ toptxn->last_complete_lsn = InvalidXLogRecPtr;
+ toptxn->complete_size = 0;
+ }
+ break;
+ }

I think here we can add an Assert to ensure that we don't partially
truncate when the transaction is serialized and add comments for the
same.

4.
+ /*
+ * Subtract the processed changes from the nentries/nentries_mem Refer
+ * detailed comment atop this variable in ReorderBufferTXN structure.
+ * We do this only ff we are truncating the partial changes otherwise
+ * reset these values directly to 0.
+ */
+ if (partial_truncate)
+ {
+ txn->nentries -= txn->nprocessed;
+ txn->nentries_mem -= txn->nprocessed;
+ }
+ else
+ {
+ txn->nentries = 0;
+ txn->nentries_mem = 0;
+ }

I think we can write this comment as "Adjust nentries/nentries_mem
based on the changes processed. See comments where nprocessed is
declared."

5.
+ /*
+ * In streaming mode, sometime we can't stream all the changes due to the
+ * incomplete changes.  So we can not directly reset the values of
+ * nentries/nentries_mem to 0 after one stream is sent like we do in
+ * non-streaming mode.  So while sending one stream we keep count of the
+ * changes processed in thi stream and only those many changes we decrement
+ * from the nentries/nentries_mem.
+ */
+ uint64 nprocessed;

How about something like: "Number of changes processed. This is used
to keep track of changes that remained to be streamed. As of now,
this can happen either due to toast tuples or speculative insertions
where we need to wait for multiple changes before we can send them."

6.
+ /* Size of the commplete changes. */
+ Size complete_size;

Typo. /commplete/complete

7.
+ /*
+ * Increment the nprocessed count.  See the detailed comment
+ * for usage of this in ReorderBufferTXN structure.
+ */
+ change->txn->nprocessed++;

Ideally, this has to be incremented after processing the change. So,
we can combine it with existing check in the patch as below:

if (streaming)
{
change->txn->nprocessed++;

if (rbtxn_has_incomplete_tuple(txn) &&
prev_lsn == txn->last_complete_lsn)
{
/* Only in streaming mode we should get here. */
Assert(streaming);
partial_truncate = true;
break;
}
}

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#341Amit Kapila
amit.kapila16@gmail.com
In reply to: Mahendra Singh Thalor (#331)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, May 27, 2020 at 5:19 PM Mahendra Singh Thalor <mahi6run@gmail.com>
wrote:

On Tue, 26 May 2020 at 16:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

Hi all,
On the top of v16 patch set [1], I did some testing for DDL's and DML's to
test wal size and performance. Below is the testing summary;

*Test parameters:*
wal_level= 'logical
max_connections = '150'
wal_receiver_timeout = '600s'
max_wal_size = '2GB'
min_wal_size = '2GB'
autovacuum= 'off'
checkpoint_timeout= '1d'

*Test results:*

CREATE index operations Add col int(date) operations Add col text
operations
SN. operation name LSN diff (in bytes) time (in sec) % LSN change LSN
diff (in bytes) time (in sec) % LSN change LSN diff (in bytes) time (in
sec) % LSN change

1
1 DDL without patch 17728 0.89116
1.624548
976 0.764393
11.475409
33904 0.80044
2.80792
with patch 18016 0.804868 1088 0.763602 34856 0.787108

2
2 DDL without patch 19872 0.860348
2.73752
1632 0.763199
13.7254902
34560 0.806086
3.078703
with patch 20416 0.839065 1856 0.733147 35624 0.829281

3
3 DDL without patch 22016 0.894891
3.63372093
2288 0.776871
14.685314
35216 0.803493
3.339391186
with patch 22816 0.828028 2624 0.737177 36392 0.800194

4
4 DDL without patch 24160 0.901686
4.4701986
2944 0.768445
15.217391
35872 0.77489
3.590544
with patch 25240 0.887143 3392 0.768382 37160 0.82777

5
5 DDL without patch 26328 0.901686
4.9832877
3600 0.751879
15.555555
36528 0.817928
3.832676
with patch 27640 0.914078 4160 0.74709 37928 0.820621

6
6 DDL without patch 28472 0.936385
5.5071649
4256 0.745179
15.78947368
37184 0.797043
4.066265
with patch 30040 0.958226 4928 0.725321 38696 0.814535

7
8 DDL without patch 32760 1.0022203
6.422466
5568 0.757468
16.091954
38496 0.83207
4.509559
with patch 34864 0.966777 6464 0.769072 40232 0.903604

8
11 DDL without patch 50296 1.0022203
5.662478
7536 0.748332
16.666666
40464 0.822266
5.179913
with patch 53144 0.966777 8792 0.750553 42560 0.797133

9
15 DDL without patch <#m_-5189706345613774249_gid=2095312519&range=B9>
58896 1.267253
5.662478
10184 0.776875
16.496465
43112 0.821916
5.84524
with patch 62768 1.27234 11864 0.746844 45632 0.812567

10
1 DDL & 3 DML without patch 18240 0.812551
1.6228
1192 0.771993
10.067114
34120 0.849467
2.8113599
with patch 18536 0.819089 1312 0.785117 35080 0.855456

11
3 DDL & 5 DML without patch 23656 0.926616
3.4832606
2656 0.758029
13.55421687
35584 0.829377
3.372302
with patch 24480 0.915517 3016 0.797206 36784 0.839176

12
10 DDL & 5 DML without patch 52760 1.101005
4.958301744
7288 0.763065
16.02634468
40216 0.837843
4.993037
with patch 55376 1.105241 8456 0.779257 42224 0.835206

13
10 DML without patch 1008 0.791091
6.349206
1008 0.81105
6.349206
1008 0.78817
6.349206
with patch 1072 0.807875 1072 0.771113 1072 0.759789

To see all operations, please see[2] test_results
<https://docs.google.com/spreadsheets/d/1g11MrSd_I39505OnGoLFVslz3ykbZ1nmfR_gUiE_O9k/edit?usp=sharing&gt;

Why are you seeing any additional WAL in case-13 (10 DML) where there is no
DDL? I think it is because you have used savepoints in that case which
will add some additional WAL. You seems to have 9 savepoints in that test
which should ideally generate 36 bytes of additional WAL (4-byte per
transaction id for each subtransaction). Also, in other cases where you
took data for DDL and DML, you have also used savepoints in those tests. I
suggest for savepoints, let's do separate tests as you have done in case-13
but we can do it 3,5,7,10 savepoints and probably each transaction can
update a row of 200 bytes or so.

I think you can take data for somewhat more realistic cases of DDL and DML
combination like 3 DDL's with 10 DML and 3 DDL's with 15 DML operations.
In general, I think we will see many more DML's per DDL. It is good to see
the worst-case WAL and performance overhead as you have done.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#342Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#340)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, May 29, 2020 at 2:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, May 27, 2020 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

While reviewing/testing I have found a couple of problems in 0005 and
0006 which I have fixed in the attached version.

..

In 0006: If we are streaming the serialized changed and there are
still few incomplete changes, then currently we are not deleting the
spilled file, but the spill file contains all the changes of the
transaction because there is no way to partially truncate it. So in
the next stream, it will try to resend those. I have fixed this by
sending the spilled transaction as soon as its changes are complete so
ideally, we can always delete the spilled file. It is also a better
solution because this transaction is already spilled once and that
happened because we could not stream it, so we better stream it on
the first opportunity that will reduce the replay lag which is our
whole purpose here.

I have reviewed these changes (in the patch
v25-0006-Bugfix-handling-of-incomplete-toast-spec-insert-) and below
are my comments.

1.
+ /*
+ * If the transaction is serialized and the the changes are complete in
+ * the top level transaction then immediately stream the transaction.
+ * The reason for not waiting for memory limit to get full is that in
+ * the streaming mode, if the transaction serialized that means we have
+ * already reached the memory limit but that time we could not stream
+ * this due to incomplete tuple so now stream it as soon as the tuple
+ * is complete.
+ */
+ if (rbtxn_is_serialized(txn))
+ ReorderBufferStreamTXN(rb, toptxn);

I think here it is important to explain why it is a must to stream a
prior serialized transaction as otherwise, later we won't be able to
know how to truncate a file.

Done

2.
+ * If complete_truncate is set we completely truncate the transaction,
+ * otherwise we truncate upto last_complete_lsn if the transaction has
+ * incomplete changes.  Basically, complete_truncate is passed true only if
+ * concurrent abort is detected while processing the TXN.
*/
static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+ bool partial_truncate)
{

The description talks about complete_truncate flag whereas API is
using partial_truncate flag. I think the description needs to be
changed.

Fixed

3.
+ /* We have truncated upto last complete lsn so stop. */
+ if (partial_truncate && rbtxn_has_incomplete_tuple(toptxn) &&
+ (change->lsn > toptxn->last_complete_lsn))
+ {
+ /*
+ * If this is a top transaction then we can reset the
+ * last_complete_lsn and complete_size, because by now we would
+ * have stream all the changes upto last_complete_lsn.
+ */
+ if (txn->toptxn == NULL)
+ {
+ toptxn->last_complete_lsn = InvalidXLogRecPtr;
+ toptxn->complete_size = 0;
+ }
+ break;
+ }

I think here we can add an Assert to ensure that we don't partially
truncate when the transaction is serialized and add comments for the
same.

Done

4.
+ /*
+ * Subtract the processed changes from the nentries/nentries_mem Refer
+ * detailed comment atop this variable in ReorderBufferTXN structure.
+ * We do this only ff we are truncating the partial changes otherwise
+ * reset these values directly to 0.
+ */
+ if (partial_truncate)
+ {
+ txn->nentries -= txn->nprocessed;
+ txn->nentries_mem -= txn->nprocessed;
+ }
+ else
+ {
+ txn->nentries = 0;
+ txn->nentries_mem = 0;
+ }

I think we can write this comment as "Adjust nentries/nentries_mem
based on the changes processed. See comments where nprocessed is
declared."

5.
+ /*
+ * In streaming mode, sometime we can't stream all the changes due to the
+ * incomplete changes.  So we can not directly reset the values of
+ * nentries/nentries_mem to 0 after one stream is sent like we do in
+ * non-streaming mode.  So while sending one stream we keep count of the
+ * changes processed in thi stream and only those many changes we decrement
+ * from the nentries/nentries_mem.
+ */
+ uint64 nprocessed;

How about something like: "Number of changes processed. This is used
to keep track of changes that remained to be streamed. As of now,
this can happen either due to toast tuples or speculative insertions
where we need to wait for multiple changes before we can send them."

Done

6.
+ /* Size of the commplete changes. */
+ Size complete_size;

Typo. /commplete/complete

7.
+ /*
+ * Increment the nprocessed count.  See the detailed comment
+ * for usage of this in ReorderBufferTXN structure.
+ */
+ change->txn->nprocessed++;

Ideally, this has to be incremented after processing the change. So,
we can combine it with existing check in the patch as below:

if (streaming)
{
change->txn->nprocessed++;

if (rbtxn_has_incomplete_tuple(txn) &&
prev_lsn == txn->last_complete_lsn)
{
/* Only in streaming mode we should get here. */
Assert(streaming);
partial_truncate = true;
break;
}
}

Done

Apart from this, there was one more issue in this patch
+ if (partial_truncate && rbtxn_has_incomplete_tuple(toptxn) &&
+ (change->lsn > toptxn->last_complete_lsn))
+ {
+ /*
+ * If this is a top transaction then we can reset the
+ * last_complete_lsn and complete_size, because by now we would
+ * have stream all the changes upto last_complete_lsn.
+ */
+ if (txn->toptxn == NULL)
+ {
+ toptxn->last_complete_lsn = InvalidXLogRecPtr;
+ toptxn->complete_size = 0;
+ }
+ break;

We shall reset toptxn->last_complete_lsn and toptxn->complete_size,
outside this {(change->lsn > toptxn->last_complete_lsn)} check,
because we might be in subxact when we meet this condition, so in that
case, for toptxn we never reach here and it will never get reset, I
have fixed this.

Apart from this one more fix in 0005, basically, CheckLiveXid was
never reset, so I have fixed that as well.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v26.tarapplication/x-tar; name=v26.tarDownload
._v26000755 000765 000024 00000000334 13664200130 013135 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl
+�C���6�<PaxHeader/v26000755 000765 000024 00000000036 13664200130 014670 xustar00dilipkumarstaff000000 000000 30 mtime=1590755416.857876106
v26/000755 000765 000024 00000000000 13664200130 012772 5ustar00dilipkumarstaff000000 000000 v26/v26-0011-Provide-new-api-to-get-the-streaming-changes.patch000644 000765 000024 00000014524 13664200130 025331 0ustar00dilipkumarstaff000000 000000 From 4023481bc5a8faa5ce58aee0d4e5165ed15e9717 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Sat, 2 May 2020 11:41:59 +0530
Subject: [PATCH v26 11/12] Provide new api to get the streaming changes

---
 .gitignore                                    |  1 +
 doc/src/sgml/test-decoding.sgml               | 22 ++++++++++++++++++
 src/backend/catalog/system_views.sql          |  8 +++++++
 .../replication/logical/logicalfuncs.c        | 23 +++++++++++++++----
 src/include/catalog/pg_proc.dat               |  9 ++++++++
 5 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/.gitignore b/.gitignore
index 794e35b73c..6083744c07 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d67b..eed6e9d134 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_streaming_changes('test_slot', NULL, NULL);
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 9f509fbc21..5fe6f28ba2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1243,6 +1243,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
 
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_peek_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index b99c94e848..70c28ffa91 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -108,7 +108,8 @@ check_permissions(void)
  * Helper function for the various SQL callable logical decoding functions.
  */
 static Datum
-pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool binary)
+pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
+								 bool binary, bool streaming)
 {
 	Name		name;
 	XLogRecPtr	upto_lsn;
@@ -252,6 +253,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or has been invalidated.")));
 
+		/* If called has not asked for streaming changes then disable it. */
+		ctx->streaming &= streaming;
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
@@ -362,7 +366,16 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 Datum
 pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, false);
+}
+
+/*
+ * SQL function to get the streaming changes as text, consuming the data.
+ */
+Datum
+pg_logical_slot_get_streaming_changes(PG_FUNCTION_ARGS)
+{
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, true);
 }
 
 /*
@@ -371,7 +384,7 @@ pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, false, false);
 }
 
 /*
@@ -380,7 +393,7 @@ pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, true, false);
 }
 
 /*
@@ -389,7 +402,7 @@ pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, true, false);
 }
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7869f721da..875e0bef28 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10115,6 +10115,15 @@
   proargmodes => '{i,i,i,v,o,o,o}',
   proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
   prosrc => 'pg_logical_slot_get_binary_changes' },
+{ oid => '6150', descr => 'get streaming changes from replication slot',
+  proname => 'pg_logical_slot_get_streaming_changes', procost => '1000',
+  prorows => '1000', provariadic => 'text', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'name pg_lsn int4 _text',
+  proallargtypes => '{name,pg_lsn,int4,_text,pg_lsn,xid,text}',
+  proargmodes => '{i,i,i,v,o,o,o}',
+  proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
+  prosrc => 'pg_logical_slot_get_streaming_changes' },
 { oid => '3784', descr => 'peek at changes from replication slot',
   proname => 'pg_logical_slot_peek_changes', procost => '1000',
   prorows => '1000', provariadic => 'text', proisstrict => 'f',
-- 
2.23.0

v26/v26-0005-Implement-streaming-mode-in-ReorderBuffer.patch000644 000765 000024 00000114220 13664200130 025011 0ustar00dilipkumarstaff000000 000000 From 39a75251078bac2db4740c36c2a58c5792aca7ed Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 1 May 2020 19:56:35 +0530
Subject: [PATCH v26 05/12] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes
we have in memory and invoke new stream API methods. This happens
in ReorderBufferStreamTXN() using about the same logic as in
ReorderBufferCommit() logic.  However, sometime if we have incomplete
toast or speculative insert we spill to the disk because we can not
generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c   |  38 +-
 .../replication/logical/reorderbuffer.c       | 768 ++++++++++++++++--
 src/include/replication/reorderbuffer.h       |  31 +
 3 files changed, 761 insertions(+), 76 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aa..160b167adb 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b889edf461..af94c6d074 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -236,6 +237,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +246,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +382,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -772,6 +786,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+/*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -865,6 +911,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -1024,6 +1073,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1038,6 +1090,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1315,6 +1370,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		dlist_delete(&txn->base_snapshot_node);
 	}
 
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
 	/*
 	 * Remove TXN from its containing list.
 	 *
@@ -1340,6 +1404,80 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferReturnTXN(rb, txn);
 }
 
+/*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
 /*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1491,57 +1629,171 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+	 * aborted. That will happen during catalog access.  Also, reset the
+	 * bsysscan flag.
+	 */
+	if (!TransactionIdDidCommit(xid))
+	{
+		CheckXidAlive = xid;
+		bsysscan = false;
 	}
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction.
+ */
+static void
+ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   Snapshot snapshot_now,
+								   CommandId command_id,
+								   XLogRecPtr last_lsn,
+								   ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+
+	ReorderBufferToastReset(rb, txn);
+	if (specinsert != NULL)
+		ReorderBufferReturnChange(rb, specinsert);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool	stream_started = false;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1564,21 +1816,44 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
-
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
 		{
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Start stream or begin transaction for the first change in the
+			 * current stream.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+					rb->stream_start(rb, txn, change->lsn);
+				else
+					rb->begin(rb, txn);
+				stream_started = true;
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1655,7 +1930,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1695,7 +1971,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1753,7 +2029,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1762,10 +2041,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1796,7 +2072,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1858,14 +2133,34 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.  If we have sent
+		 * start/begin.
+		 */
+		if (stream_started)
+		{
+			if (streaming)
+				rb->stream_stop(rb, txn, prev_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+			stream_started = false;
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1884,14 +2179,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,17 +2219,122 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/* Reset the CheckXidAlive */
+		if (streaming)
+			CheckXidAlive = InvalidTransactionId;
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If the error code is ERRCODE_TRANSACTION_ROLLBACK,  that means we
+		 * have detected a concurrent abort of the (sub)transaction we are
+		 * streaming.  So just do the cleanup and return gracefully.
+		 * Otherwise, Re-throw the error.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * Only in the streaming mode we can get this error, because only
+			 * in the streaming mode we send in-progress transaction.
+			 */
+			Assert(streaming);
 
-		PG_RE_THROW();
+			/*
+			 * In the TRY block we only stop the stream after we have send
+			 * all the changes.  So we have detected the concurrent abort then
+			 * the stream should have not been stopped yet.
+			 */
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+
+			/* Handle the concurrent abort. */
+			ReorderBufferHandleConcurrentAbort(rb, txn, snapshot_now,
+											   command_id, prev_lsn,
+											   specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.  We iterate over the top and
+ * subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot snapshot_now;
+	CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
@@ -1946,6 +2359,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2015,6 +2435,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2150,8 +2577,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2159,6 +2595,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2170,19 +2607,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2211,6 +2657,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2295,6 +2742,16 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * Mark toplevel transaction as having catalog changes too if one of its
+	 * children has so that the ReorderBufferBuildTupleCidHash can conveniently
+	 * check just toplevel transaction and decide whethe we need to build the
+	 * hash table or not.  In non-streaming mode we mark the toplevel
+	 * transaction in DecodeCommit as we only stream on commit.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2398,6 +2855,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the transaction to evict and spill the changes to disk.
@@ -2418,15 +2907,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/*
 	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by serializing it to disk.
+	 * memory by streaming, if supported. Otherwise spill to disk.
 	 */
-	txn = ReorderBufferLargestTXN(rb);
+	if (ReorderBufferCanStream(rb))
+	{
+		/*
+		 * Pick the largest toplevel transaction and evict it from memory by
+		 * streaming the already decoded part.
+		 */
+		txn = ReorderBufferLargestTopTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn && !txn->toptxn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
 
-	ReorderBufferSerializeTXN(rb, txn);
+		ReorderBufferStreamTXN(rb, txn);
+	}
+	else
+	{
+		/*
+		 * Pick the largest transaction (or subtransaction) and evict it from
+		 * memory by serializing it to disk.
+		 */
+		txn = ReorderBufferLargestTXN(rb);
+
+		/* we know there has to be one, because the size is not zero */
+		Assert(txn);
+		Assert(txn->size > 0);
+		Assert(rb->size >= txn->size);
+
+		ReorderBufferSerializeTXN(rb, txn);
+	}
 
 	/*
 	 * After eviction, the transaction should have no entries in memory, and
 	 * should use 0 bytes for changes.
+	 *
+	 * XXX Checking the size is fine for both cases - spill to disk and
+	 * streaming. But for streaming we should really check nentries_mem for
+	 * all subtransactions too.
 	 */
 	Assert(txn->size == 0);
 	Assert(txn->nentries_mem == 0);
@@ -2746,6 +3266,102 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot snapshot_now;
+	CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -3864,6 +4480,16 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases we assume the CID is from the future command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 65814af9f5..b3e2b3f64b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -171,6 +171,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -190,6 +191,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -224,6 +243,11 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	final_lsn;
 
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
@@ -254,6 +278,13 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
+	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
 	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
-- 
2.23.0

v26/v26-0009-Enable-streaming-for-all-subscription-TAP-tests.patch000644 000765 000024 00000035517 13664200130 026042 0ustar00dilipkumarstaff000000 000000 From 03b885eea36bcd6395f6bfb9560e89b2ee61f004 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v26 09/12] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318fc7c..6f7bedc130 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2fbc..94c71f8ae2 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f871a..4ba80869b9 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9181..a6fae9c3f1 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17a78..202871a658 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10a19..70c86b22ac 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6d63..f9c8d1d348 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334ed89..cdf9b8e7bb 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bdba9..21f50c7012 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133f69..30561d8f96 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae38592b..9a6bac6822 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..ed56fbf96c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cba4c..4df1ddef63 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1a8a..c3caff6149 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7c2f..c62eb521e7 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30f59..2be7542831 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0578..2da9607a7d 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9435..96ffc091b0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
2.23.0

v26/v26-0001-Immediately-WAL-log-assignments.patch000644 000765 000024 00000025100 13664200130 022776 0ustar00dilipkumarstaff000000 000000 From 975234c54873bf67f34488809b9d42a8eaf02af3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 20 Mar 2020 15:03:01 +0530
Subject: [PATCH v26 01/12] Immediately WAL-log assignments

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT wal as that is required
for avoiding overflow in the hot standby snapshot.
---
 src/backend/access/transam/xact.c        | 46 ++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 22 ++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 +++
 src/backend/replication/logical/decode.c | 39 ++++++++++----------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 99 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index cd30b62d36..3af8e81af1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to toplevel XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5118,6 +5120,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6020,3 +6023,46 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ *	IsSubTransactionAssignmentPending
+ *
+ *	This returns true if we are inside a valid substransaction, for which
+ *	the assignment was not yet written to any WAL record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it needs to have 'assigned' */
+	return !CurrentTransactionState->assigned;
+
+}
+
+/*
+ *	MarkSubTransactionAssigned
+ *
+ *	Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f09e..53be2b3059 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId	xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogInsertRecord) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5995798b58..560ec27fa0 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1195,6 +1195,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1233,6 +1234,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3abf8..122c581d0f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -93,12 +93,28 @@ static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
-	XLogRecordBuffer buf;
+	XLogRecordBuffer	buf;
+	TransactionId		txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the toplevel xid is valid, we need to assign the subxact to the
+	 * toplevel xact. We need to do this for all records, hence we do it before
+	 * the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -217,12 +233,12 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
 	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * However, it's critical to process records with subxid assignment even
 	 * when the snapshot is being built: it is possible to get later records
 	 * that require subxids to be properly assigned.
 	 */
 	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+		!TransactionIdIsValid(XLogRecGetTopXid(r)))
 		return;
 
 	switch (info)
@@ -264,22 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
-
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04babc2..8645b3816c 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e917dfe92d..26426cc779 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of toplevel xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index c21b0ba972..83170a663c 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -191,6 +191,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId	toplevel_xid;	/* XID of toplevel transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -308,6 +310,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0194..2f0c8bf589 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
2.23.0

v26/v26-0010-Add-TAP-test-for-streaming-vs.-DDL.patch000644 000765 000024 00000010620 13664200130 022752 0ustar00dilipkumarstaff000000 000000 From 9a91d770612ca3a7e8e73b69ddd315bfa80f9abe Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v26 10/12] Add TAP test for streaming vs. DDL

---
 .../subscription/t/014_stream_through_ddl.pl  | 98 +++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000000..b8d78b1972
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v26/v26-0007-Track-statistics-for-streaming.patch000644 000765 000024 00000027722 13664200130 023033 0ustar00dilipkumarstaff000000 000000 From 0f4380a2ff1543baf3df437935d77a375bb9c060 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Tue, 19 May 2020 19:08:16 +0530
Subject: [PATCH v26 07/12] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                  | 33 +++++++++++++++++++
 src/backend/catalog/system_views.sql          |  5 ++-
 .../replication/logical/reorderbuffer.c       | 12 +++++++
 src/backend/replication/walsender.c           | 32 +++++++++++++++---
 src/include/catalog/pg_proc.dat               |  6 ++--
 src/include/replication/reorderbuffer.h       | 13 +++++---
 src/include/replication/walsender_private.h   |  5 +++
 src/test/regress/expected/rules.out           |  7 ++--
 8 files changed, 98 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 49d4bb13b9..0fc896ca7e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2453,6 +2453,39 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        Amount of decoded transaction data spilled to disk.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_txns</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of in-progress transactions streamed to subscriber after
+       memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+       Streaming only works with toplevel transactions (subtransactions can't
+       be streamed independently), so the counter does not get incremented for
+       subtransactions.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_count</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times in-progress transactions were streamed to subscriber.
+       Transactions may get streamed repeatedly, and this counter gets incremented
+       on every such invocation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_bytes</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Amount of decoded in-progress transaction data streamed to subscriber.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 56420bbc9d..9f509fbc21 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -788,7 +788,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5fcb125664..1baa4f91cc 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -348,6 +348,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3547,6 +3551,14 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		ReorderBufferFreeSnap(rb, txn->snapshot_now);
 	}
 
+	/* Update the stream statistics. */
+	rb->streamCount += 1;
+	rb->streamBytes += (rbtxn_has_incomplete_tuple(txn)) ?
+						txn->complete_size : txn->total_size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	/*
 	 * Access the main routine to decode the changes and send to output plugin.
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 86847cbb54..adb7d7962e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1353,7 +1353,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1374,7 +1374,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2423,6 +2424,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3258,7 +3262,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3316,6 +3320,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillCount;
 		int64		spillBytes;
 		bool		is_sync_standby;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 		int			j;
@@ -3341,6 +3348,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		/*
@@ -3443,6 +3453,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3691,11 +3706,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 	MyWalSnd->spillCount = rb->spillCount;
 	MyWalSnd->spillBytes = rb->spillBytes;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockRelease(&MyWalSnd->mutex);
 }
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 61f2c2f5b4..7869f721da 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5237,9 +5237,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 2d86209f61..399f3e49f2 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -549,15 +549,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 734acec2a4..b997d1710e 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -83,6 +83,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b813e32215..cf22f8a038 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2005,9 +2005,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_slru| SELECT s.name,
     s.blks_zeroed,
-- 
2.23.0

v26/v26-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch000644 000765 000024 00000032536 13664200130 026435 0ustar00dilipkumarstaff000000 000000 From 72d4c79f2b04b04efbebd1831951d4cbbf7a5b2e Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 10:55:19 +0530
Subject: [PATCH v26 04/12] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml  |  9 +++--
 src/backend/access/heap/heapam.c   | 10 ++++++
 src/backend/access/index/genam.c   | 53 ++++++++++++++++++++++++++++
 src/backend/access/table/tableam.c |  8 +++++
 src/backend/utils/time/snapmgr.c   | 13 +++++++
 src/include/access/tableam.h       | 55 ++++++++++++++++++++++++++++++
 src/include/utils/snapmgr.h        |  2 ++
 7 files changed, 147 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 1b56daa4bb..5f7394f3c1 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -432,9 +432,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 94eb37d48d..2d77107c4f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments at snapmgr.c
+	 * where these variables are declared.  Normally we have such a check at
+	 * tableam level API but this is called from many places so we need to
+	 * ensure it here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae39a..446b8cbc86 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,9 +430,36 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments at snapmgr.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
+/*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments at snapmgr.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
 /*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments at snapmgr.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index c814733b22..2f52b407c6 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -230,6 +230,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	Relation	rel = scan->rs_rd;
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
+	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
 	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..9f1ecd123f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -153,6 +153,19 @@ static Snapshot SecondarySnapshot = NULL;
 static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction. ��Currently, it is used in logical decoding. ��It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables. ��We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool bsysscan = false;
+
 /*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8c34935c34..9d890d3c4b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
+#include "utils/snapmgr.h"
 #include "utils/snapshot.h"
 
 
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1015,6 +1025,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1054,6 +1071,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1710,6 +1735,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1727,6 +1760,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1745,6 +1786,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1761,6 +1809,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce84..5af6df698b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,6 +145,8 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
+extern bool	bsysscan;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
 extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
 extern void TeardownHistoricSnapshot(bool is_error);
-- 
2.23.0

v26/v26-0012-Add-streaming-option-in-pg_dump.patch000644 000765 000024 00000005337 13664200130 023040 0ustar00dilipkumarstaff000000 000000 From 9ad32b40c8da566cbe44c0b36af95f919f30b0bd Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v26 12/12] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index dfe43968b8..8ca4a05822 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4235,8 +4236,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4249,6 +4250,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4265,6 +4267,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4342,6 +4345,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0c2fcfb3a9..af64270c55 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
2.23.0

v26/v26-0003-Extend-the-output-plugin-API-with-stream-methods.patch000644 000765 000024 00000106360 13664200130 026165 0ustar00dilipkumarstaff000000 000000 From 38b053bbc4294ab2b68f38545ad0283c1b39b218 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v26 03/12] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 ++++++
 doc/src/sgml/logicaldecoding.sgml         | 213 +++++++++++++
 src/backend/replication/logical/logical.c | 365 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++
 src/include/replication/reorderbuffer.h   |  59 ++++
 6 files changed, 811 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c948856e..64f651fa72 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index bad3bfe620..1b56daa4bb 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -388,6 +388,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -400,6 +407,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -678,6 +694,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -746,4 +868,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index dc69e5ce5f..0cff1ac393 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similar to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,321 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475e5d..deef31825d 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -79,6 +79,11 @@ typedef struct LogicalDecodingContext
 	 */
 	void	   *output_writer_private;
 
+	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236c57..0d0a94a648 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
 /*
  * Output plugin callbacks
  */
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index af35287896..65814af9f5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -354,6 +354,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
 struct ReorderBuffer
 {
 	/*
@@ -392,6 +440,17 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
 	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
-- 
2.23.0

v26/v26-0006-Bugfix-handling-of-incomplete-toast-spec-insert-.patch000644 000765 000024 00000064454 13664200130 026232 0ustar00dilipkumarstaff000000 000000 From 1480ecadffaecec491c33b2ac8b3ad1c42a06ac3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Fri, 29 May 2020 14:52:44 +0530
Subject: [PATCH v26 06/12] Bugfix handling of incomplete toast/spec insert
 tuple

---
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/replication/logical/decode.c      |  17 +-
 .../replication/logical/reorderbuffer.c       | 388 +++++++++++++-----
 src/include/access/heapam_xlog.h              |   1 +
 src/include/replication/reorderbuffer.h       |  47 ++-
 5 files changed, 347 insertions(+), 109 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2d77107c4f..3927448f46 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1955,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 69c1f45ef6..c841687c66 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -727,7 +727,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -794,7 +796,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -851,7 +854,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -887,7 +891,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -987,7 +991,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1025,7 +1029,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index af94c6d074..5fcb125664 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -179,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -237,7 +252,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool partial_truncate);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -646,14 +662,91 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 	return txn;
 }
 
+/*
+ * Handle incomplete tuple during streaming.  If streaming is enabled then we
+ * might need to stream the in-progress transaction.  So the problem is that
+ * sometime we might get some incomplete changes which we can not stream
+ * until we get the complete change. e.g.  toast table insert without the main
+ * table insert.  So this function remember the lsn of the last complete change
+ * and the complete size upto last complete lsn so that if we need to stream
+ * we can only stream upto last complete lsn.
+ */
+static void
+ReorderBufferHandleIncompleteTuple(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   ReorderBufferChange *change,
+								   bool toast_insert, Size total_size)
+{
+	ReorderBufferTXN *toptxn;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * If this is a first incomplete change then set the size of the complete
+	 * change.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		(toast_insert || IsSpecInsert(change->action)))
+		toptxn->complete_size = total_size;
+
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Basically,
+	 * both update and insert will do the insert in the toast table.  And as
+	 * explained in the function header we can not stream the only toast
+	 * changes.  So whenever we get the toast insert we set the flag and clear
+	 * the same whenever we get the next insert or update on the main table.
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial tuple and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+
+	/*
+	 * If we don't have any incomplete change after this change then set this
+	 * LSN as last complete lsn.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(toptxn)))
+	{
+		toptxn->last_complete_lsn = change->lsn;
+
+		/*
+		 * If the transaction is serialized and the the changes are complete in
+		 * the top level transaction then immediately stream the transaction.
+		 * The reason for not waiting for memory limit to get full is that in
+		 * the streaming mode, if the transaction serialized that means we have
+		 * already reached the memory limit but that time we could not stream
+		 * this due to incomplete tuple so now stream it as soon as the tuple
+		 * is complete.  Also, if we don't stream the serialized changes then
+		 * if we get some more incomplete changes in this transaction then we
+		 * don't have a way to partly truncate the serialized changes.
+		 */
+		if (rbtxn_is_serialized(txn))
+			ReorderBufferStreamTXN(rb, toptxn);
+	}
+}
+
 /*
  * Queue a change into a transaction so it can be replayed upon commit.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
+	Size	total_size = 0;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
@@ -665,9 +758,28 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * Get the total size of the top transaction before updating the size for
+	 * current change so that if this is the incomplete tuple we know the size
+	 * prior to this change.  That will be used for updating the size of the
+	 * complete changes in the top transaction for streaming.
+	 */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			total_size = txn->toptxn->total_size;
+		else
+			total_size = txn->total_size;
+	}
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* Handle the incomplete tuple, If streaming is enabled */
+	if (ReorderBufferCanStream(rb))
+		ReorderBufferHandleIncompleteTuple(rb, txn, change, toast_insert,
+										   total_size);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -697,7 +809,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1407,11 +1519,45 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 /*
  * Discard changes from a transaction (and subtransactions), after streaming
  * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ * If partial_truncate is false we completely truncate the transaction,
+ * otherwise we truncate upto last_complete_lsn
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 bool partial_truncate)
 {
 	dlist_mutable_iter iter;
+	ReorderBufferTXN *toptxn;
+
+	/* Get the top transaction */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * The serialized transaction should never be partly truncated, because if
+	 * it is serialized then we stream it as soon as its changes get completed.
+	 */
+	Assert(!(rbtxn_is_serialized(txn) && partial_truncate));
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
 
 	/* cleanup subtransactions & their changes */
 	dlist_foreach_modify(iter, &txn->subtxns)
@@ -1428,7 +1574,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, partial_truncate);
 	}
 
 	/* cleanup changes in the toplevel txn */
@@ -1438,30 +1584,19 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		change = dlist_container(ReorderBufferChange, node, iter.cur);
 
+		/* We have truncated upto last complete lsn so stop. */
+		if (partial_truncate && (change->lsn > toptxn->last_complete_lsn))
+		{
+			/* The transaction must have incomplete changes. */
+			Assert(rbtxn_has_incomplete_tuple(toptxn));
+			break;
+		}
+
 		/* remove the change from it's containing list */
 		dlist_delete(&change->node);
-
 		ReorderBufferReturnChange(rb, change);
 	}
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The toplevel transaction, identified by (toptxn==NULL), is marked
-	 * as streamed always, even if it does not contain any changes (that
-	 * is, when all the changes are in subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts
-	 * for XIDs the downstream is not aware of. And of course, it always
-	 * knows about the toplevel xact (we send the XID in all messages),
-	 * but we never stream XIDs of empty subxacts.
-	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
 	 * any memory. We could also keep the hash table and update it with
@@ -1473,9 +1608,39 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
-	/* also reset the number of entries in the transaction */
-	txn->nentries_mem = 0;
-	txn->nentries = 0;
+	/*
+	 * Adjust nentries/nentries_mem based on the changes processed. ��See
+	 * comments where nprocessed is declared.
+	 */
+	if (partial_truncate)
+	{
+		txn->nentries -= txn->nprocessed;
+		txn->nentries_mem -= txn->nprocessed;
+	}
+	else
+	{
+		txn->nentries = 0;
+		txn->nentries_mem = 0;
+	}
+	txn->nprocessed = 0;
+
+	/*
+	 * If this is a top transaction then we can reset the
+	 * last_complete_lsn and complete_size, because by now we would
+	 * have stream all the changes upto last_complete_lsn.
+	 */
+	if (partial_truncate && (txn->toptxn == NULL))
+	{
+		toptxn->last_complete_lsn = InvalidXLogRecPtr;
+		toptxn->complete_size = 0;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
 }
 
 /*
@@ -1762,7 +1927,7 @@ ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
 								   ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed. */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Stop the stream. */
 	rb->stream_stop(rb, txn, last_lsn);
@@ -1794,6 +1959,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 	ReorderBufferChange *volatile specinsert = NULL;
 	volatile bool	stream_started = false;
+	volatile bool	partial_truncate = false;
+
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1816,6 +1983,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
+		ReorderBufferTXN *curtxn;
 
 		if (using_subtxn)
 			BeginInternalSubTransaction(streaming? "stream" : "replay");
@@ -1852,7 +2020,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			/* Set the xid for concurrent abort check. */
 			if (streaming)
-				SetupCheckXidLive(change->txn->xid);
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
 
 			switch (change->action)
 			{
@@ -2116,6 +2287,27 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
 			}
+
+			if (streaming)
+			{
+				/*
+				 * Increment the nprocessed count.  See the detailed comment
+				 * for usage of this in ReorderBufferTXN structure.
+				 */
+				curtxn->nprocessed++;
+
+				/*
+				 * If the transaction contains incomplete tuple and this is the
+				 * last complete change then stop further processing of the
+				 * transaction.  And, set the partial truncate flag to true.
+				 */
+				if (rbtxn_has_incomplete_tuple(txn) &&
+					prev_lsn == txn->last_complete_lsn)
+				{
+					partial_truncate = true;
+					break;
+				}
+			}
 		}
 
 		/*
@@ -2135,7 +2327,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * Done with current changes, call stream_stop callback for streaming
-		 * transaction, commit callback otherwise.  If we have sent
+		 * transaction, commit callback otherwise.  Only If we have sent
 		 * start/begin.
 		 */
 		if (stream_started)
@@ -2187,7 +2379,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming)
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, partial_truncate);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
@@ -2524,7 +2716,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2573,7 +2765,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2596,6 +2788,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2610,8 +2803,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2619,12 +2817,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2685,7 +2891,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 	memcpy(change->data.inval.invalidations, msgs,
 		   sizeof(SharedInvalidationMessage) * nmsgs);
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -2872,18 +3078,28 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 	dlist_foreach(iter, &rb->toplevel_by_lsn)
 	{
 		ReorderBufferTXN *txn;
+		Size	size = 0;
+		Size	largest_size = 0;
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
+		/*
+		 * If this transaction have some incomplete changes then only consider
+		 * the size upto last complete lsn.
+		 */
+		if (rbtxn_has_incomplete_tuple(txn))
+			size = txn->complete_size;
+		else
+			size = txn->total_size;
+
+		/* If the current transaction is larger then remember it. */
+		if ((largest != NULL || size > largest_size) && size > 0)
+		{
 			largest = txn;
+			largest_size = size;
+		}
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2901,66 +3117,46 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
 	ReorderBufferTXN *txn;
 
-	/* bail out if we haven't exceeded the memory limit */
-	if (rb->size < logical_decoding_work_mem * 1024L)
-		return;
-
-	/*
-	 * Pick the largest transaction (or subtransaction) and evict it from
-	 * memory by streaming, if supported. Otherwise spill to disk.
-	 */
-	if (ReorderBufferCanStream(rb))
-	{
-		/*
-		 * Pick the largest toplevel transaction and evict it from memory by
-		 * streaming the already decoded part.
-		 */
-		txn = ReorderBufferLargestTopTXN(rb);
-
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn && !txn->toptxn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
-
-		ReorderBufferStreamTXN(rb, txn);
-	}
-	else
+	/* Loop until we reach under the memory limit. */
+	while (rb->size >= logical_decoding_work_mem * 1024L)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		/* we know there has to be one, because the size is not zero */
-		Assert(txn);
-		Assert(txn->size > 0);
-		Assert(rb->size >= txn->size);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		ReorderBufferSerializeTXN(rb, txn);
-	}
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
 
-	/*
-	 * After eviction, the transaction should have no entries in memory, and
-	 * should use 0 bytes for changes.
-	 *
-	 * XXX Checking the size is fine for both cases - spill to disk and
-	 * streaming. But for streaming we should really check nentries_mem for
-	 * all subtransactions too.
-	 */
-	Assert(txn->size == 0);
-	Assert(txn->nentries_mem == 0);
+			ReorderBufferSerializeTXN(rb, txn);
+			/*
+			 * After eviction, the transaction should have no entries in memory, and
+			 * should use 0 bytes for changes.
+			 */
+			Assert(txn->size == 0);
+			Assert(txn->nentries_mem == 0);
+		}
+	}
 
-	/*
-	 * And furthermore, evicting the transaction should get us below the
-	 * memory limit again - it is not possible that we're still exceeding the
-	 * memory limit after evicting the transaction.
-	 *
-	 * This follows from the simple fact that the selected transaction is at
-	 * least as large as the most recent change (which caused us to go over
-	 * the memory limit). So by evicting it we're definitely back below the
-	 * memory limit.
-	 */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
 
@@ -3356,10 +3552,6 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 */
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
-
-	Assert(dlist_is_empty(&txn->changes));
-	Assert(txn->nentries == 0);
-	Assert(txn->nentries_mem == 0);
 }
 
 /*
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cdb12..aa17f7df84 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index b3e2b3f64b..2d86209f61 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -172,6 +172,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -191,6 +193,26 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -199,10 +221,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -350,6 +368,23 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* Size of the complete changes. */
+	Size		complete_size;
+
+	/* LSN of the last complete change. */
+	XLogRecPtr	last_complete_lsn;
+
+	/*
+	 * Number of changes processed. ��This is used to keep track of changes that
+	 * remained to be streamed. ��As of now, this can happen either due to toast
+	 * tuples or speculative insertions where we need to wait for multiple
+	 * changes before we can send them.
+	 */
+	uint64		nprocessed;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -537,7 +572,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
2.23.0

v26/v26-0008-Add-support-for-streaming-to-built-in-replicatio.patch000644 000765 000024 00000263030 13664200130 026266 0ustar00dilipkumarstaff000000 000000 From ee8f1306aa752db052e78dce25761444c1ee44bf Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 14 May 2020 21:27:46 +0530
Subject: [PATCH v26 08/12] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml      |    4 +-
 doc/src/sgml/ref/create_subscription.sgml     |   12 +
 src/backend/catalog/pg_subscription.c         |    1 +
 src/backend/commands/subscriptioncmds.c       |   45 +-
 src/backend/postmaster/pgstat.c               |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |    3 +
 src/backend/replication/logical/proto.c       |  140 ++-
 src/backend/replication/logical/worker.c      | 1012 +++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c   |  318 +++++-
 src/backend/replication/slotfuncs.c           |    6 +
 src/backend/replication/walsender.c           |    6 +
 src/include/catalog/pg_subscription.h         |    3 +
 src/include/pgstat.h                          |    6 +-
 src/include/replication/logicalproto.h        |   42 +-
 src/include/replication/walreceiver.h         |    1 +
 src/test/subscription/t/009_stream_simple.pl  |   86 ++
 src/test/subscription/t/010_stream_subxact.pl |  102 ++
 src/test/subscription/t/011_stream_ddl.pl     |   95 ++
 .../t/012_stream_subxact_abort.pl             |   82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |   84 ++
 20 files changed, 2020 insertions(+), 40 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 1b8beadbaa..95b7c24ef9 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -164,8 +164,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c244fb..3349cc4bfc 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist>
      </para>
     </listitem>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731115..f28482f0f4 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9ebb026187..9065a1be1b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index d7f99d9944..e843d1e658 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4139,6 +4139,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9bb6..5257ab0394 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd171..83d0642cf3 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a1224d..d2d9469999 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,6 +54,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -64,6 +86,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +94,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -100,6 +124,7 @@ typedef struct SlotErrCallbackArg
 } SlotErrCallbackArg;
 
 static MemoryContext ApplyMessageContext = NULL;
+static MemoryContext LogicalStreamingContext = NULL;
 MemoryContext ApplyContext = NULL;
 
 WalReceiverConn *wrconn = NULL;
@@ -110,12 +135,58 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool	in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;						/* XID of the subxact */
+	off_t           offset;					/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +258,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -552,6 +659,326 @@ apply_handle_origin(StringInfo s)
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+	{
+		MemoryContext oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+
+		/* Read the subxacts info in per-stream context. */
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+		MemoryContextSwitchTo(oldctx);
+	}
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		int			fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = OpenTransientFile(path, O_WRONLY | PG_BINARY);
+		if (fd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+		}
+
+		/* OK, truncate the file at the right offset. */
+		if (ftruncate(fd, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+		CloseTransientFile(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	ensure_transaction();
+
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -565,6 +992,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1010,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1049,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1167,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1312,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1685,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1826,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1477,6 +1938,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 	}
 }
 
+/*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
 /*
  * Apply main loop.
  */
@@ -1493,6 +1970,17 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reeset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													 ALLOCSET_DEFAULT_SIZES);
+
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1941,6 +2429,529 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Let the caller decide which memory context it will be allocated.
+	 * Ideally, during stream start it will be allocated in the
+	 * LogicalStreamingContext which will be reset on stream stop, and
+	 * during the stream abort we need this memory only for short term so
+	 * it will be allocated in ApplyMessageContext.
+	 */
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd >= 0);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.subxacts",
+			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.changes",
+			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		/* Need to allocate this in permanent context */
+		oldcxt = MemoryContextSwitchTo(ApplyContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3117,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 15379e3118..1509f9b826 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +122,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +148,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +211,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +240,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +263,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +373,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +423,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +465,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +486,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +518,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +538,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +562,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +582,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +607,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +639,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -587,6 +719,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -623,6 +840,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -753,11 +1002,44 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -793,7 +1075,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 6fed3cfd23..e1344ab4cc 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -157,6 +157,12 @@ create_logical_replication_slot(char *name, char *plugin,
 											   .segment_close = wal_segment_close),
 									NULL, NULL, NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
 	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index adb7d7962e..9731b86d1f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1020,6 +1020,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndPrepareWrite, WalSndWriteData,
 										WalSndUpdateProgress);
 
+		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
 		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d42d8..617b9094d4 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c55dc1481c..899d7e2013 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -980,7 +980,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561be9..89158ed46f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index ac1acbb27e..95132062c6 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v26/v26-0002-Issue-individual-invalidations-with-wal_level-lo.patch000644 000765 000024 00000041336 13664200130 026425 0ustar00dilipkumarstaff000000 000000 From 39038dc312020cc1f01dc25281e802366a0cf756 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Tue, 14 Apr 2020 11:11:37 +0530
Subject: [PATCH v26 02/12] Issue individual invalidations with
 wal_level=logical

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.
---
 src/backend/access/rmgrdesc/xactdesc.c        |  40 +++++++
 src/backend/access/transam/xact.c             |   7 ++
 src/backend/replication/logical/decode.c      |  16 +++
 .../replication/logical/reorderbuffer.c       | 104 +++++++++++++++---
 src/backend/utils/cache/inval.c               |  49 +++++++++
 src/include/access/xact.h                     |  13 ++-
 src/include/replication/reorderbuffer.h       |  11 ++
 7 files changed, 226 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce75565f..7ab0d11ea9 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3af8e81af1..e576b10055 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6020,6 +6020,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 122c581d0f..69c1f45ef6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -281,6 +281,22 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 		case XLOG_XACT_ASSIGNMENT:
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				Assert(TransactionIdIsValid(xid));
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->nmsgs, invals->msgs);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4594cf9509..b889edf461 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -220,7 +220,7 @@ static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
 static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
-static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
 
 /*
  * ---------------------------------------
@@ -455,6 +455,11 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 				pfree(change->data.msg.message);
 			change->data.msg.message = NULL;
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			if (change->data.inval.invalidations)
+				pfree(change->data.inval.invalidations);
+			change->data.inval.invalidations = NULL;
+			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			if (change->data.snapshot)
 			{
@@ -1814,17 +1819,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation messages locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					ReorderBufferExecuteInvalidations(
+							change->data.inval.ninvalidations,
+							change->data.inval.invalidations);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -1866,7 +1878,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1892,7 +1905,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2202,6 +2216,33 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	txn->ntuplecids++;
 }
 
+/*
+ * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn, int nmsgs,
+							 SharedInvalidationMessage *msgs)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.ninvalidations = nmsgs;
+	change->data.inval.invalidations = (SharedInvalidationMessage *)
+		MemoryContextAlloc(rb->context,
+						   sizeof(SharedInvalidationMessage) * nmsgs);
+	memcpy(change->data.inval.invalidations, msgs,
+		   sizeof(SharedInvalidationMessage) * nmsgs);
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Setup the invalidation of the toplevel transaction.
  *
@@ -2234,12 +2275,12 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  * in the changestream but we don't know which those are.
  */
 static void
-ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
 {
 	int			i;
 
-	for (i = 0; i < txn->ninvalidations; i++)
-		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	for (i = 0; i < nmsgs; i++)
+		LocalExecuteInvalidationMessage(&msgs[i]);
 }
 
 /*
@@ -2589,6 +2630,24 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				char	   *data;
+				Size		inval_size = sizeof(SharedInvalidationMessage) *
+										change->data.inval.ninvalidations;
+
+				sz += inval_size;
+
+				ReorderBufferSerializeReserve(rb, sz);
+				data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+
+				/* might have been reallocated above */
+				ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+				memcpy(data, change->data.inval.invalidations, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -2736,6 +2795,11 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			sz += sizeof(SharedInvalidationMessage) *
+					change->data.inval.ninvalidations;
+			break;
+
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -3002,6 +3066,20 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				Size	inval_size = sizeof(SharedInvalidationMessage) *
+									change->data.inval.ninvalidations;
+
+				change->data.inval.invalidations =
+						MemoryContextAlloc(rb->context, inval_size);
+
+				/* read the message */
+				memcpy(change->data.inval.invalidations, data, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33be6..cba5b6c64b 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transaction.  As of now it was
+ *	enough to log invalidation only at commit because we are only decoding the
+ *	transaction at the commit time.   We only need to log the catalog cache and
+ *	relcache invalidation.  There can not be any active MVCC scan in logical
+ *	decoding so we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
 	if (transInvalInfo == NULL)
 		return;
 
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
@@ -1501,3 +1513,40 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage	*invalMessages;
+	int	nmsgs = 0;
+
+	if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+	{
+		ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+								 MakeSharedInvalidMessagesArray);
+		invalMessages = SharedInvalidMessagesArray;
+		nmsgs  = numSharedInvalidMessagesArray;
+		SharedInvalidMessagesArray = NULL;
+		numSharedInvalidMessagesArray = 0;
+	}
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b3816c..b822c5e4b2 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -197,6 +197,17 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+/*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4dc9..af35287896 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,14 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations;		/* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation
+														 * message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +468,8 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  int nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
2.23.0

#343Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#335)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, May 28, 2020 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 28, 2020 at 12:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 26, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Isn't this problem only for subxact file as we anyway create changes
file as part of start stream message which should have come after
abort? If so, can't we detect whether subxact file exists probably by
using nsubxacts or something like that? Can you please once try to
reproduce this scenario to ensure that we are not missing anything?

I have tested this, as of now, by default we create both changes and
subxact files irrespective of whether we get any subtransactions or
not. Maybe this could be optimized that only if we have any subxact
then only create that file otherwise not? What's your opinion on the
same.

Yeah, that makes sense.

8.
@@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
*rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);

txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+ /*
+ * TOCHECK: Mark toplevel transaction as having catalog changes too
+ * if one of its children has.
+ */
+ if (txn->toptxn != NULL)
+ txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}

Why are we marking top transaction here?

We need to mark top transaction to decide whether to build tuplecid
hash or not. In non-streaming mode, we are only sending during the
commit time, and during commit time we know whether the top
transaction has any catalog changes or not based on the invalidation
message so we are marking the top transaction there in DecodeCommit.
Since here we are not waiting till commit so we need to mark the top
transaction as soon as we mark any of its child transactions.

But how does it help? We use this flag (via
ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
anyway done in DecodeCommit and that too after setting this flag for
the top transaction if required. So, how will it help in setting it
while processing for subxid. Also, even if we have to do it won't it
add the xid needlessly in builder->committed.xip array?

In ReorderBufferBuildTupleCidHash, we use this flag to decide whether
to build the tuplecid hash or not based on whether it has catalog
changes or not.

Okay, but you haven't answered the second part of the question: "won't
it add the xid of top transaction needlessly in builder->committed.xip
array, see function SnapBuildCommitTxn?" IIUC, this can happen
without patch as well because DecodeCommit also sets the flags just
based on invalidation messages irrespective of whether the messages
are generated by top transaction or not, is that right?

Yes, with or without the patch it always adds the topxid. I think
purpose for doing this with/without patch is not for the snapshot
instead we are marking the top itself that some of its subtxn has the
catalog changes so that while building the tuplecid has we can know
whether to build the hash or not. But, having said that I feel in
ReorderBufferBuildTupleCidHash why do we need these two checks
if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;

I mean it should be enough to just have the check, because if we have
added something to the tuplecids then catalog changes must be there
because that time we are setting the catalog changes to true.

if (dlist_is_empty(&txn->tuplecids))
return;

I think in the base code there are multiple things going on
1. If we get new CID we always set the catalog change in that
transaction but add the tuplecids in the top transaction. So
basically, top transaction is so far not marked with catalog changes
but it has tuplecids.
2. Now, in DecodeCommit the top xid will be marked that it has catalog
changes based on the invalidation messages.

I don't think it is advisable to remove that check from base code
unless we have a strong reason for doing so. I think here you can
write better comments about why you are marking the flag for top
transaction and remove TOCHECK from the comment.

Done.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#344Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#330)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, May 26, 2020 at 4:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, May 26, 2020 at 2:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 26, 2020 at 10:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

2. There is a bug fix in handling the stream abort in 0008 (earlier it
was 0006).

The code changes look fine but it is not clear what was the exact
issue. Can you explain?

Basically, in case of an empty subtransaction, we were reading the
subxacts info but when we could not find the subxid in the subxacts
info we were not releasing the memory. So on next subxact_info_read
it will expect that subxacts should be freed but we did not free it in
that !found case.

Okay, on looking at it again, the same code exists in
subxact_info_write as well. It is better to have a function for it.
Can we have a structure like SubXactContext for all the variables used
for subxact? As mentioned earlier I find the allocation/deallocation
of subxacts a bit ad-hoc, so there will always be a chance that we can
forget to free it. Having it allocated in memory context which we can
reset later might reduce that risk. One idea could be that we have a
special memory context for start and stop messages which can be used
to allocate the subxacts there. In case of commit/abort, we can allow
subxacts information to be allocated in ApplyMessageContext which is
reset at the end of each protocol message.

Changed as per this.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#345Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#329)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, May 26, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 25, 2020 at 8:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

4.
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
{
..

For this and other places in a patch like in function
stream_open_file(), instead of using TopMemoryContext, can we consider
using a new memory context LogicalStreamingContext or something like
that. We can create LogicalStreamingContext under TopMemoryContext. I
don't see any need of using TopMemoryContext here.

But, when we will delete/reset the LogicalStreamingContext?

Why can't we reset it at each stream stop message?

Done this

because
we are planning to keep this memory until the worker is alive so that
supposed to be the top memory context.

Which part of allocation do we want to keep till the worker is alive?
Why we need memory-related to subxacts till the worker is alive? As
we have now, after reading subxact info (subxact_info_read), we need
to ensure that it is freed after its usage due to which we need to
remember and perform pfree at various places.

I think we should once see the possibility that such that we could
switch to this new context in start stream message and reset it in
stop stream message. That might help in avoiding
MemoryContextSwitchTo TopMemoryContext at various places.

If we create any other context
with the same life span as TopMemoryContext then what is the point?

It is helpful for debugging. It is recommended that we don't use the
top memory context unless it is really required. Read about it in
src/backend/utils/mmgr/README.

xids is now allocated in ApplyContext

8.
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)

Do we really need to have the checksum for temporary files? I have
checked a few other similar cases like SharedFileSet stuff for
parallel hash join but didn't find them using checksums. Can you also
once see other usages of temporary files and then let us decide if we
see any reason to have checksums for this?

Yeah, even I can see other places checksum is not used.

So, unless someone speaks up before you are ready for the next version
of the patch, can we remove it?

Done

Another point is we don't seem to be doing this for 'changes' file,
see stream_write_change. So, not sure, there is any sense to write
checksum for subxact file.

I can see there are comment atop this function

* XXX The subxact file includes CRC32C of the contents. Maybe we should
* include something like that here too, but doing so will not be as
* straighforward, because we write the file in chunks.

You can remove this comment as well. I don't know how advantageous it
is to checksum temporary files. We can anyway add it later if there
is a reason for doing so.

Done

12.
maybe_send_schema()
{
..
+ if (in_streaming)
+ {
+ /*
+ * TOCHECK: We have to send schema after each catalog change and it may
+ * occur when streaming already started, so we have to track new catalog
+ * changes somehow.
+ */
+ schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
..
..
}

I think it is good to once verify/test what this comment says but as
per code we should be sending the schema after each catalog change as
we invalidate the streamed_txns list in rel_sync_cache_relation_cb
which must be called during relcache invalidation. Do we see any
problem with that mechanism?

I have tested this, I think we are already sending the schema after
each catalog change.

Then remove "TOCHECK" in the above comment.

Done

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#346Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#339)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, May 28, 2020 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, May 27, 2020 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 26, 2020 at 7:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Okay, sending again.

While reviewing/testing I have found a couple of problems in 0005 and
0006 which I have fixed in the attached version.

I haven't reviewed the new fixes yet but I have some comments on
0008-Add-support-for-streaming-to-built-in-replicatio.patch.
1.
I think the temporary files (and or handles) used for storing the
information of changes and subxacts are getting leaked in the patch.
At some places, it is taken care to close the file but cases like
apply_handle_stream_commit where if any error occurred in
apply_dispatch(), the file might not get closed. The other place is
in apply_handle_stream_abort() where if there is an error in ftruncate
the file won't be closed. Now, the bigger problem is with changes
related file which is opened in apply_handle_stream_start and closed
in apply_handle_stream_stop and if there is any error in-between, we
won't close it.

OTOH, I think the worker will exit on an error so it might not matter
but then why we are at few other places we are closing it before the
error? I think on error these temporary files should be removed
instead of relying on them to get removed next time when we receive
changes for the same transaction which I feel is what we do in other
cases where we use temporary files like for sorts or hashjoins.

Also, what if the changes file size overflows "OS file size limit"?
If we agree that the above are problems then do you think we should
explore using BufFile interface (see storage/file/buffile.c) to avoid
all such problems?

I also think that the file size is a problem. I think we can use
BufFile with some modifications. We can not use the
BufFileCreateTemp, because of few reasons
1) files get deleted on close, but we have to open/close on every
stream start/stop.
2) even if we try to avoid closing we need to the BufFile pointers
(which take 8192k per file) because there is no option to pass the
file name.

I thin for our use case BufFileCreateShared is more suitable. I think
we need to do some modifications so that we can use these apps without
SharedFileSet. Otherwise, we need to unnecessarily need to create
SharedFileSet for each transaction and also need to maintain it in xid
array or xid hash until transaction commit/abort. So I suggest
following modifications in shared files set so that we can
conveniently use it.
1. ChooseTablespace(const SharedFileSet fileset, const char name)
if fileset is NULL then select the DEFAULTTABLESPACEOID
2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
If fileset is NULL then in directory path we can use MyProcPID or
something instead of fileset->creator_pid.
3. Pass some parameter to BufFileOpenShared, so that it can open the
file in RW mode instead of read-only mode.

2.
apply_handle_stream_abort()
{
..
+ /* discard the subxacts added later */
+ nsubxacts = subidx;
+
+ /* write the updated subxact list */
+ subxact_info_write(MyLogicalRepWorker->subid, xid);
..
}

Here, if subxacts becomes zero, then also subxact_info_write will
create a new file and write checksum.

How, will it create the new file, in fact it will write nsubxacts as 0
in the existing file, and I think we need to do that right so that in
next open we will know that the nsubxact is 0.

I think subxact_info_write

should have a check for nsubxacts > 0 before writing to the file.

But, even if nsubxacts become 0 we want to write the file so that we
can overwrite the previous info.

3.
apply_handle_stream_commit(StringInfo s)
{
..
+ /*
+ * send feedback to upstream
+ *
+ * XXX Probably should send a valid LSN. But which one?
+ */
+ send_feedback(InvalidXLogRecPtr, false, false);
..
}

Why do we need to send the feedback at this stage after applying each
message? If we see a non-streamed case, we never send_feedback after
each message. So, following that, I don't see the need to send it here
but if you see any specific reason then do let me know? And if we
have to send feedback, then we need to decide the appropriate values
as well.

Let me put more thought on this and then I will revert back to you.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#347Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#346)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, May 28, 2020 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Also, what if the changes file size overflows "OS file size limit"?
If we agree that the above are problems then do you think we should
explore using BufFile interface (see storage/file/buffile.c) to avoid
all such problems?

I also think that the file size is a problem. I think we can use
BufFile with some modifications. We can not use the
BufFileCreateTemp, because of few reasons
1) files get deleted on close, but we have to open/close on every
stream start/stop.
2) even if we try to avoid closing we need to the BufFile pointers
(which take 8192k per file) because there is no option to pass the
file name.

I thin for our use case BufFileCreateShared is more suitable. I think
we need to do some modifications so that we can use these apps without
SharedFileSet. Otherwise, we need to unnecessarily need to create
SharedFileSet for each transaction and also need to maintain it in xid
array or xid hash until transaction commit/abort. So I suggest
following modifications in shared files set so that we can
conveniently use it.
1. ChooseTablespace(const SharedFileSet fileset, const char name)
if fileset is NULL then select the DEFAULTTABLESPACEOID
2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
If fileset is NULL then in directory path we can use MyProcPID or
something instead of fileset->creator_pid.

Hmm, I find these modifications a bit ad-hoc. So, not sure if it is
better than the patch maintains sharedfileset information.

3. Pass some parameter to BufFileOpenShared, so that it can open the
file in RW mode instead of read-only mode.

This seems okay.

2.
apply_handle_stream_abort()
{
..
+ /* discard the subxacts added later */
+ nsubxacts = subidx;
+
+ /* write the updated subxact list */
+ subxact_info_write(MyLogicalRepWorker->subid, xid);
..
}

Here, if subxacts becomes zero, then also subxact_info_write will
create a new file and write checksum.

How, will it create the new file, in fact it will write nsubxacts as 0
in the existing file, and I think we need to do that right so that in
next open we will know that the nsubxact is 0.

I think subxact_info_write

should have a check for nsubxacts > 0 before writing to the file.

But, even if nsubxacts become 0 we want to write the file so that we
can overwrite the previous info.

Can't we just remove the file for such a case?

apply_handle_stream_abort()
{
..
+ /* XXX optimize the search by bsearch on sorted data */
+ for (i = nsubxacts; i > 0; i--)
+ {
+ if (subxacts[i - 1].xid == subxid)
+ {
+ subidx = (i - 1);
+ found = true;
+ break;
+ }
+ }
+
+ /*
+ * If it's an empty sub-transaction then we will not find the subxid
+ * here so just free the memory and return.
+ */
+ if (!found)
+ {
+ /* Free the subxacts memory */
+ if (subxacts)
+ pfree(subxacts);
+
+ subxacts = NULL;
+ subxact_last = InvalidTransactionId;
+ nsubxacts = 0;
+ nsubxacts_max = 0;
+
+ return;
+ }
..
}

I have one question regarding the above code. Isn't it possible that
a particular subtransaction id doesn't have any change but others do
we have? For ex. cases like below:

postgres=# begin;
BEGIN
postgres=*# insert into t1 values(1);
INSERT 0 1
postgres=*# savepoint s1;
SAVEPOINT
postgres=*# savepoint s2;
SAVEPOINT
postgres=*# insert into t1 values(2);
INSERT 0 1
postgres=*# insert into t1 values(3);
INSERT 0 1
postgres=*# Rollback to savepoint s1;
ROLLBACK
postgres=*# commit;

Here, we have performed Rolledback to savepoint s1 which doesn't have
any change of its own. I think this would have handled but just
wanted to confirm.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#348Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#347)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, May 28, 2020 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Also, what if the changes file size overflows "OS file size limit"?
If we agree that the above are problems then do you think we should
explore using BufFile interface (see storage/file/buffile.c) to avoid
all such problems?

I also think that the file size is a problem. I think we can use
BufFile with some modifications. We can not use the
BufFileCreateTemp, because of few reasons
1) files get deleted on close, but we have to open/close on every
stream start/stop.
2) even if we try to avoid closing we need to the BufFile pointers
(which take 8192k per file) because there is no option to pass the
file name.

I thin for our use case BufFileCreateShared is more suitable. I think
we need to do some modifications so that we can use these apps without
SharedFileSet. Otherwise, we need to unnecessarily need to create
SharedFileSet for each transaction and also need to maintain it in xid
array or xid hash until transaction commit/abort. So I suggest
following modifications in shared files set so that we can
conveniently use it.
1. ChooseTablespace(const SharedFileSet fileset, const char name)
if fileset is NULL then select the DEFAULTTABLESPACEOID
2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
If fileset is NULL then in directory path we can use MyProcPID or
something instead of fileset->creator_pid.

Hmm, I find these modifications a bit ad-hoc. So, not sure if it is
better than the patch maintains sharedfileset information.

I think we might do something better here, maybe by supplying function
pointer or so, but maintaining sharedfileset which contains different
tablespace/mutext which we don't need at all for our purpose also
doesn't sound very appealing. Let me see if I can not come up with
some clean way of avoiding the need to shared-fileset then maybe we
can go with the shared fileset idea.

3. Pass some parameter to BufFileOpenShared, so that it can open the
file in RW mode instead of read-only mode.

This seems okay.

2.
apply_handle_stream_abort()
{
..
+ /* discard the subxacts added later */
+ nsubxacts = subidx;
+
+ /* write the updated subxact list */
+ subxact_info_write(MyLogicalRepWorker->subid, xid);
..
}

Here, if subxacts becomes zero, then also subxact_info_write will
create a new file and write checksum.

How, will it create the new file, in fact it will write nsubxacts as 0
in the existing file, and I think we need to do that right so that in
next open we will know that the nsubxact is 0.

I think subxact_info_write

should have a check for nsubxacts > 0 before writing to the file.

But, even if nsubxacts become 0 we want to write the file so that we
can overwrite the previous info.

Can't we just remove the file for such a case?

But, as of now, we expect if it is not a first-time stream start then
the file exists. Actually, currently, it's very easy that if it is
not the first segment we always expect that the file must exist,
otherwise an error. Now if it is not the first segment then we will
need to handle multiple cases.

a) subxact_info_read need to handle the error case, because the file
may not exist because there was no subxact in last stream or it was
deleted because nsubxact become 0.
b) subxact_info_write, there will be multiple cases that if nsubxact
was already 0 then we can avoid writing the file, but if it become 0
now we need to remove the file.

Let me think more on that.

apply_handle_stream_abort()
{
..
+ /* XXX optimize the search by bsearch on sorted data */
+ for (i = nsubxacts; i > 0; i--)
+ {
+ if (subxacts[i - 1].xid == subxid)
+ {
+ subidx = (i - 1);
+ found = true;
+ break;
+ }
+ }
+
+ /*
+ * If it's an empty sub-transaction then we will not find the subxid
+ * here so just free the memory and return.
+ */
+ if (!found)
+ {
+ /* Free the subxacts memory */
+ if (subxacts)
+ pfree(subxacts);
+
+ subxacts = NULL;
+ subxact_last = InvalidTransactionId;
+ nsubxacts = 0;
+ nsubxacts_max = 0;
+
+ return;
+ }
..
}

I have one question regarding the above code. Isn't it possible that
a particular subtransaction id doesn't have any change but others do
we have? For ex. cases like below:

postgres=# begin;
BEGIN
postgres=*# insert into t1 values(1);
INSERT 0 1
postgres=*# savepoint s1;
SAVEPOINT
postgres=*# savepoint s2;
SAVEPOINT
postgres=*# insert into t1 values(2);
INSERT 0 1
postgres=*# insert into t1 values(3);
INSERT 0 1
postgres=*# Rollback to savepoint s1;
ROLLBACK
postgres=*# commit;

Here, we have performed Rolledback to savepoint s1 which doesn't have
any change of its own. I think this would have handled but just
wanted to confirm.

But internally, that will send abort for the s2 first, and for that,
we will find xid and truncate, and later we will send abort for s1 but
that we will not find and do nothing? Anyway, I will test it and let
you know.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#349Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#348)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jun 2, 2020 at 7:53 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I thin for our use case BufFileCreateShared is more suitable. I think
we need to do some modifications so that we can use these apps without
SharedFileSet. Otherwise, we need to unnecessarily need to create
SharedFileSet for each transaction and also need to maintain it in xid
array or xid hash until transaction commit/abort. So I suggest
following modifications in shared files set so that we can
conveniently use it.
1. ChooseTablespace(const SharedFileSet fileset, const char name)
if fileset is NULL then select the DEFAULTTABLESPACEOID
2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
If fileset is NULL then in directory path we can use MyProcPID or
something instead of fileset->creator_pid.

Hmm, I find these modifications a bit ad-hoc. So, not sure if it is
better than the patch maintains sharedfileset information.

I think we might do something better here, maybe by supplying function
pointer or so, but maintaining sharedfileset which contains different
tablespace/mutext which we don't need at all for our purpose also
doesn't sound very appealing.

I think we can say something similar for Relation (rel cache entry as
well) maintained in LogicalRepRelMapEntry. I think we only need a
pointer to that information.

Let me see if I can not come up with
some clean way of avoiding the need to shared-fileset then maybe we
can go with the shared fileset idea.

Fair enough.
..

But, even if nsubxacts become 0 we want to write the file so that we
can overwrite the previous info.

Can't we just remove the file for such a case?

But, as of now, we expect if it is not a first-time stream start then
the file exists.

Isn't it primarily because we do subxact_info_write in stop stream
which will create such a file irrespective of whether we have any
subxacts? If so, isn't that an unnecessary write?

Actually, currently, it's very easy that if it is
not the first segment we always expect that the file must exist,
otherwise an error.

I think we can check if the file doesn't exist then we can initialize
nsubxacts as 0.

Now if it is not the first segment then we will
need to handle multiple cases.

a) subxact_info_read need to handle the error case, because the file
may not exist because there was no subxact in last stream or it was
deleted because nsubxact become 0.
b) subxact_info_write, there will be multiple cases that if nsubxact
was already 0 then we can avoid writing the file, but if it become 0
now we need to remove the file.

Let me think more on that.

I feel we should be able to deal with these cases but if you find any
difficulty then let us discuss. I understand there is some ease if we
always have subxacts file but OTOH it sounds quite awkward that we
need so many file operations to detect the case whether the
transaction has any subtransactions.

Here, we have performed Rolledback to savepoint s1 which doesn't have
any change of its own. I think this would have handled but just
wanted to confirm.

But internally, that will send abort for the s2 first, and for that,
we will find xid and truncate, and later we will send abort for s1 but
that we will not find and do nothing? Anyway, I will test it and let
you know.

It would be good if we can test and confirm this behavior once. If it
is not very inconvenient then we can even try to include a test for
the same in the patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#350Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#342)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, May 29, 2020 at 8:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

The fixes in the latest patchset are correct.  Few minor comments:
v26-0005-Implement-streaming-mode-in-ReorderBuffer
+ /*
+ * Mark toplevel transaction as having catalog changes too if one of its
+ * children has so that the ReorderBufferBuildTupleCidHash can conveniently
+ * check just toplevel transaction and decide whethe we need to build the
+ * hash table or not.  In non-streaming mode we mark the toplevel
+ * transaction in DecodeCommit as we only stream on commit.

Typo, /whethe/whether
missing comma, /In non-streaming mode we/In non-streaming mode, we

v26-0008-Add-support-for-streaming-to-built-in-replicatio
+ /*
+ * This memory context used for per stream data when streaming mode is
+ * enabled.  This context is reeset on each stream stop.
+ */

Can we slightly modify the above comment as "This is used in the
streaming mode for the changes between the start and stop stream
messages. We reset this context on the stream stop message."?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#351Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#349)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jun 3, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 2, 2020 at 7:53 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I thin for our use case BufFileCreateShared is more suitable. I think
we need to do some modifications so that we can use these apps without
SharedFileSet. Otherwise, we need to unnecessarily need to create
SharedFileSet for each transaction and also need to maintain it in xid
array or xid hash until transaction commit/abort. So I suggest
following modifications in shared files set so that we can
conveniently use it.
1. ChooseTablespace(const SharedFileSet fileset, const char name)
if fileset is NULL then select the DEFAULTTABLESPACEOID
2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
If fileset is NULL then in directory path we can use MyProcPID or
something instead of fileset->creator_pid.

Hmm, I find these modifications a bit ad-hoc. So, not sure if it is
better than the patch maintains sharedfileset information.

I think we might do something better here, maybe by supplying function
pointer or so, but maintaining sharedfileset which contains different
tablespace/mutext which we don't need at all for our purpose also
doesn't sound very appealing.

I think we can say something similar for Relation (rel cache entry as
well) maintained in LogicalRepRelMapEntry. I think we only need a
pointer to that information.

Yeah, I see.

Let me see if I can not come up with
some clean way of avoiding the need to shared-fileset then maybe we
can go with the shared fileset idea.

Fair enough.

While evaluating it further I feel there are a few more problems to
solve if we are using BufFile, First thing is that in subxact file we
maintain the information of xid and its offset in the changes file.
So now, we will also have to store 'fileno' but that we can find using
BufFileTell. Yet another problem is that currently, we don't
have the truncate option in the BufFile, but we need it if the
sub-transaction gets aborted. I think we can implement an extra
interface with the BufFile and should not be very hard as we already
know the fileno and the offset. I will evaluate this part further and
let you know about the same.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#352Mahendra Singh Thalor
mahi6run@gmail.com
In reply to: Amit Kapila (#341)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, 29 May 2020 at 15:52, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, May 27, 2020 at 5:19 PM Mahendra Singh Thalor <mahi6run@gmail.com>

wrote:

On Tue, 26 May 2020 at 16:46, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Hi all,
On the top of v16 patch set [1], I did some testing for DDL's and DML's

to test wal size and performance. Below is the testing summary;

Test parameters:
wal_level= 'logical
max_connections = '150'
wal_receiver_timeout = '600s'
max_wal_size = '2GB'
min_wal_size = '2GB'
autovacuum= 'off'
checkpoint_timeout= '1d'

Test results:

CREATE index operationsAdd col int(date) operationsAdd col text

operations

SN.operation nameLSN diff (in bytes)time (in sec)% LSN changeLSN diff

(in bytes)time (in sec)% LSN changeLSN diff (in bytes)time (in sec)% LSN
change

1
1 DDL without patch177280.89116
1.624548
9760.764393
11.475409
339040.80044
2.80792
with patch180160.80486810880.763602348560.787108
2
2 DDL without patch198720.860348
2.73752
16320.763199
13.7254902
345600.806086
3.078703
with patch204160.83906518560.733147356240.829281
3
3 DDL without patch220160.894891
3.63372093
2 2880.776871
14.685314
352160.803493
3.339391186
with patch228160.82802826240.737177363920.800194
4
4 DDL without patch241600.901686
4.4701986
29440.768445
15.217391
358720.77489
3.590544
with patch252400.88714333920.768382371600.82777
5
5 DDL without patch263280.901686
4.9832877
36000.751879
15.555555
365280.817928
3.832676
with patch276400.91407841600.74709379280.820621
6
6 DDL without patch284720.936385
5.5071649
42560.745179
15.78947368
371840.797043
4.066265
with patch300400.95822649280.725321386960.814535
7
8 DDL without patch327601.0022203
6.422466
55680.757468
16.091954
384960.83207
4.509559
with patch348640.96677764640.769072402320.903604
8
11 DDL without patch502961.0022203
5.662478
75360.748332
16.666666
404640.822266
5.179913
with patch531440.96677787920.750553425600.797133
9
15 DDL without patch588961.267253
5.662478
101840.776875
16.496465
431120.821916
5.84524
with patch627681.27234118640.746844456320.812567
10
1 DDL & 3 DML without patch182400.812551
1.6228
11920.771993
10.067114
341200.849467
2.8113599
with patch185360.81908913120.785117350800.855456
11
3 DDL & 5 DML without patch236560.926616
3.4832606
26560.758029
13.55421687
355840.829377
3.372302
with patch244800.91551730160.797206367840.839176
12
10 DDL & 5 DML without patch527601.101005
4.958301744
72880.763065
16.02634468
402160.837843
4.993037
with patch553761.10524184560.779257422240.835206
13
10 DML without patch10080.791091
6.349206
10080.81105
6.349206
10080.78817
6.349206
with patch10720.80787510720.77111310720.759789

To see all operations, please see[2] test_results

Why are you seeing any additional WAL in case-13 (10 DML) where there is

no DDL? I think it is because you have used savepoints in that case which
will add some additional WAL. You seems to have 9 savepoints in that test
which should ideally generate 36 bytes of additional WAL (4-byte per
transaction id for each subtransaction). Also, in other cases where you
took data for DDL and DML, you have also used savepoints in those tests. I
suggest for savepoints, let's do separate tests as you have done in case-13
but we can do it 3,5,7,10 savepoints and probably each transaction can
update a row of 200 bytes or so.

Thanks Amit for reviewing results.

Yes, you are correct. I used savepoints in DML so it was showing
additional wal.

As suggested above, I did testing for DML's, DDL's and savepoints. Below is
the test results:

*Test results:*

CREATE index operations Add col int(date) operations Add col text operations
SN. operation name LSN diff (in bytes) time (in sec) % LSN change LSN diff
(in bytes) time (in sec) % LSN change LSN diff (in bytes) time (in sec) %
LSN change

1
1 DDL without patch <#gid=0&range=B2> 17728 0.89116
1.624548
976 0.764393
11.475409
33904 0.80044
2.80792
with patch 18016 0.804868 1088 0.763602 34856 0.787108

2
2 DDL without patch <#gid=0&range=B3> 19872 0.860348
2.73752
1632 0.763199
13.7254902
34560 0.806086
3.078703
with patch 20416 0.839065 1856 0.733147 35624 0.829281

3
3 DDL without patch <#gid=0&range=B4> 22016 0.894891
3.63372093
2288 0.776871
14.685314
35216 0.803493
3.339391186
with patch 22816 0.828028 2624 0.737177 36392 0.800194

4
4 DDL without patch <#gid=0&range=B5> 24160 0.901686
4.4701986
2944 0.768445
15.217391
35872 0.77489
3.590544
with patch 25240 0.887143 3392 0.768382 37160 0.82777

5
5 DDL without patch <#gid=0&range=B6> 26328 0.901686
4.9832877
3600 0.751879
15.555555
36528 0.817928
3.832676
with patch 27640 0.914078 4160 0.74709 37928 0.820621

6
6 DDL without patch <#gid=0&range=B7> 28472 0.936385
5.5071649
4256 0.745179
15.78947368
37184 0.797043
4.066265
with patch 30040 0.958226 4928 0.725321 38696 0.814535

7
8 DDL without patch <#gid=0&range=B8> 32760 1.0022203
6.422466
5568 0.757468
16.091954
38496 0.83207
4.509559
with patch 34864 0.966777 6464 0.769072 40232 0.903604

8
11 DDL without patch <#gid=0&range=B9> 50296 1.0022203
5.662478
7536 0.748332
16.666666
40464 0.822266
5.179913
with patch 53144 0.966777 8792 0.750553 42560 0.797133

9
15 DDL without patch <#gid=0&range=B10> 58896 1.267253
5.662478
10184 0.776875
16.496465
43112 0.821916
5.84524
with patch 62768 1.27234 11864 0.746844 45632 0.812567

10
1 DDL & 3 DML without patch <#gid=0&range=E2> 18224 0.865753
1.58033362
1176 0.78074
9.523809
34104 0.857664
2.7914614
with patch 18512 0.854788 1288 0.767758 35056 0.877604

11
3 DDL & 5 DML without patch <#gid=0&range=E3> 23632 0.954274
3.385203
2632 0.785501
12.765957
35560 0.87744
3.3070866
with patch 24432 0.927245 2968 0.857528 36736 0.867555

12
3 DDL & 10 DML without patch <#gid=0&range=E4> 25088 0.941534
3.316326
3040 0.812123
11.052631
35968 0.877769
3.269579
with patch 25920 0.898643 3376 0.804943 37144 0.879752

13
3 DDL & 15 DML without patch <#gid=0&range=E5> 26400 0.949599
3.151515
3392 0.818491
9.90566037
36320 0.859353
3.2378854
with patch 27232 0.892505 3728 0.789752 37320 0.812386

14
5 DDL & 15 DML without patch <#gid=0&range=E5> 31904 0.994223
4.287863
4704 0.838091
11.904761
37632 0.867281
3.720238095
with patch 33272 0.968122 5264 0.816922 39032 0.876364

15
1 DML without patch <#gid=0&range=E5> 328 0.817988
0

with patch 328 0.794927

16
3 DML without patch <#gid=0&range=E5> 464 0.791229
0

with patch 464 0.806211

17
5 DML without patch <#gid=0&range=E5> 608 0.794258
0

with patch 608 0.802001

18
10 DML without patch <#gid=0&range=E5> 968 0.831733
0

with patch 968 0.852777

*Results for savepoints:*
SN. Operation name Operation LSN diff (in bytes) time (in sec) % LSN change

1
1 savepoint without patch
begin;
insert into perftest values (1);
savepoint s1;
update perftest set c1 = 5 where c1 = 1;
commit;
408 0.805615
1.960784
with patch 416 0.823121

2
2 savepoint without patch
begin;
insert into perftest values (1);
savepoint s1;
update perftest set c1 = 5 where c1 = 1;
savepoint s2;
update perftest set c1 = 6 where c1 = 5;
commit;
488 0.827147
3.278688
with patch 504 0.819165

3
3 savepoint without patch
begin;
insert into perftest values (1);
savepoint s1;
update perftest set c1 = 2 where c1 = 1;
savepoint s2;
update perftest set c1 = 3 where c1 = 2;
savepoint s3;
update perftest set c1 = 4 where c1 = 3;
commit;
560 0.806441
4.28571428
with patch 584 0.821316

4
5 savepoint without patch
712 0.823774
5.617977528
with patch 752 0.800037

5
7 savepoint without patch
864 0.829136
6.48148148
with patch 920 0.793751

6
10 savepoint without patch
1096 0.77946
7.29927007
with patch 1176 0.78711

To see all the operations(DDL's and DML's), please see test_results
<https://docs.google.com/spreadsheets/d/1g11MrSd_I39505OnGoLFVslz3ykbZ1nmfR_gUiE_O9k/edit?usp=sharing&gt;

*Testing summary:*
Basically, we are writing per command invalidation message and for testing
that I have tested with different combinations of the DDL and DML
operation. I have not observed any performance degradation with the patch.
For "create index" DDL's, %change in wal is 1-7% for 1-15 DDL's. For "add
col int/date" DDL's, it is 11-17% for 1-15 DDL's and for "add col text"
DDL's, it is 2-6% for 1-15 DDL's. For mix (DDL & DML), it is 2-10%.

why are we seeing 11-13 % of the extra wall, basically, the amount of
extra WAL is not very high but the amount of WAL generated with add column
int/date is just ~1000 bytes so additional 100 bytes will be around 10% and
for add column text it is ~35000 bytes so % is less. For text, these
~35000 bytes are due to toast
There is no change in wal size for *DML operations*. For savepoints, we are
getting max 8 bytes per savepoint wal increment (basically for
Sub-transaction, we are adding 5 bytes to store xid but due to padding, it
is 8 bytes and some times if wal is already aligned, then we are getting 0
bytes increment)

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

#353Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#342)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, May 29, 2020 at 8:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Apart from this one more fix in 0005, basically, CheckLiveXid was
never reset, so I have fixed that as well.

I have made a number of modifications in the 0001 patch and attached
is the result.  I have changed/added comments, done some cosmetic
cleanup, and ran pgindent.  The most notable change is to remove the
below code change:
DecodeXactOp()
{
..
- * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+ * However, it's critical to process records with subxid assignment even
  * when the snapshot is being built: it is possible to get later records
  * that require subxids to be properly assigned.
  */
  if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
- info != XLOG_XACT_ASSIGNMENT)
+ !TransactionIdIsValid(XLogRecGetTopXid(r)))
..
}

I have not only removed the change done by the patch but the check
related to XLOG_XACT_ASSIGNMENT as well. That check has been added by
commit bac2fae05c to ensure that we process XLOG_XACT_ASSIGNMENT even
if snapshot state is not SNAPBUILD_FULL_SNAPSHOT. Now, with this
patch that is not required because we are making the subtransaction
and top-level transaction much earlier than this. I have verified
that it doesn't reopen the bug by running the test provided in the
original report [1]/messages/by-id/CAONYFtOv+Er1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg@mail.gmail.com.

Let me know what you think of the changes? If you find them okay,
then feel to include them in the next patch-set.

[1]: /messages/by-id/CAONYFtOv+Er1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v27-0001-Immediately-WAL-log-subtransaction-and-top-level.patchapplication/octet-stream; name=v27-0001-Immediately-WAL-log-subtransaction-and-top-level.patchDownload
From f8239516407569e1e4b4c96507975f02dd9400ce Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 5 Jun 2020 09:03:16 +0530
Subject: [PATCH v27] Immediately WAL-log subtransaction and top-level XID
 association.

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead).  However, we can not
remove the existing XLOG_XACT_ASSIGNMENT WAL as that is required
for avoiding overflow in the hot standby snapshot.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/transam/xact.c        | 50 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 23 +++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 44 ++++++++++++++--------------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 107 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index cd30b62..04fd5ca 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to top-level XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5118,6 +5120,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6020,3 +6023,50 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * IsSubTransactionAssignmentPending
+ *
+ * This is used to decide whether we need to WAL log the top-level XID for
+ * operation in a subtransaction.  We require that for logical decoding, see
+ * LogicalDecodingProcessRecord.
+ *
+ * This returns true if wal_level >= logical and we are inside a valid
+ * subtransaction, for which the assignment was not yet written to any WAL
+ * record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	/* wal_level has to be logical */
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it should not be already 'assigned' */
+	return !CurrentTransactionState->assigned;
+}
+
+/*
+ * MarkSubTransactionAssigned
+ *
+ * Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f..c526bb1 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,19 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogResetInsertion) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5995798..560ec27 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1195,6 +1195,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1233,6 +1234,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3a..0c0c371 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -94,11 +94,27 @@ void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
 	XLogRecordBuffer buf;
+	TransactionId txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the top-level xid is valid, we need to assign the subxact to the
+	 * top-level xact. We need to do this for all records, hence we do it
+	 * before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -216,13 +232,8 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	/*
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
-	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
-	 * when the snapshot is being built: it is possible to get later records
-	 * that require subxids to be properly assigned.
 	 */
-	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
 		return;
 
 	switch (info)
@@ -264,22 +275,13 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
 
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			/*
+			 * We assign subxact to the toplevel xact while processing each
+			 * record if required.  So, we don't need to do anything here.
+			 * See LogicalDecodingProcessRecord.
+			 */
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04ba..8645b38 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e917dfe..05cc2b6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of top-level xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index d930fe9..24a4c44 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -191,6 +191,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -308,6 +310,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

#354Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#351)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jun 4, 2020 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jun 3, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 2, 2020 at 7:53 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I thin for our use case BufFileCreateShared is more suitable. I think
we need to do some modifications so that we can use these apps without
SharedFileSet. Otherwise, we need to unnecessarily need to create
SharedFileSet for each transaction and also need to maintain it in xid
array or xid hash until transaction commit/abort. So I suggest
following modifications in shared files set so that we can
conveniently use it.
1. ChooseTablespace(const SharedFileSet fileset, const char name)
if fileset is NULL then select the DEFAULTTABLESPACEOID
2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
If fileset is NULL then in directory path we can use MyProcPID or
something instead of fileset->creator_pid.

Hmm, I find these modifications a bit ad-hoc. So, not sure if it is
better than the patch maintains sharedfileset information.

I think we might do something better here, maybe by supplying function
pointer or so, but maintaining sharedfileset which contains different
tablespace/mutext which we don't need at all for our purpose also
doesn't sound very appealing.

I think we can say something similar for Relation (rel cache entry as
well) maintained in LogicalRepRelMapEntry. I think we only need a
pointer to that information.

Yeah, I see.

Let me see if I can not come up with
some clean way of avoiding the need to shared-fileset then maybe we
can go with the shared fileset idea.

Fair enough.

While evaluating it further I feel there are a few more problems to
solve if we are using BufFile, First thing is that in subxact file we
maintain the information of xid and its offset in the changes file.
So now, we will also have to store 'fileno' but that we can find using
BufFileTell. Yet another problem is that currently, we don't
have the truncate option in the BufFile, but we need it if the
sub-transaction gets aborted. I think we can implement an extra
interface with the BufFile and should not be very hard as we already
know the fileno and the offset. I will evaluate this part further and
let you know about the same.

I have further evaluated this and also tested the concept with a POC
patch. Soon I will complete and share, here is the scatch of the
idea.

As discussed we will use SharedBufFile for changes files and subxact
files. There will be a separate LogicalStreamingResourceOwner, which
will be used to manage the VFD of the shared buf files. We can create
a per stream resource owner i.e. on stream start we will create the
resource owner and all the shared buffiles will be opened under that
resource owner, which will be deleted on stream stop. We need to
remember the SharedFileSet so that for subsequent stream for the same
transaction we can open the same file again, for this we will use a
hash table with xid as a key and in that, we will keep stream_fileset
and subxact_fileset's pointers as payload.

+typedef struct StreamXidHash
+{
+       TransactionId   xid;
+       SharedFileSet  *stream_fileset;
+       SharedFileSet  *subxact_fileset;
+} StreamXidHash;

We have to do some extension to the buffile modules, some of them are
already discussed up-thread but still listing them all down here
- A new interface BufFileTruncateShared(BufFile *file, int fileno,
off_t offset), for truncating the subtransaction changes, if changes
are spread across multiple files those files will be deleted and we
will adjust the file count and current offset accordingly in BufFile.
- In BufFileOpenShared, we will have to implement a mode so that we
can open in write mode as well, current only read-only mode supported.
- In SharedFileSetInit, if dsm_segment is NULL then we will not
register the file deletion on on_dsm_detach.
- As usual, we will clean up the files on stream abort/commit, or on
the worker exit.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#355Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#353)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Jun 5, 2020 at 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 29, 2020 at 8:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Apart from this one more fix in 0005, basically, CheckLiveXid was
never reset, so I have fixed that as well.

I have made a number of modifications in the 0001 patch and attached
is the result.  I have changed/added comments, done some cosmetic
cleanup, and ran pgindent.  The most notable change is to remove the
below code change:
DecodeXactOp()
{
..
- * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+ * However, it's critical to process records with subxid assignment even
* when the snapshot is being built: it is possible to get later records
* that require subxids to be properly assigned.
*/
if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
- info != XLOG_XACT_ASSIGNMENT)
+ !TransactionIdIsValid(XLogRecGetTopXid(r)))
..
}

I have not only removed the change done by the patch but the check
related to XLOG_XACT_ASSIGNMENT as well. That check has been added by
commit bac2fae05c to ensure that we process XLOG_XACT_ASSIGNMENT even
if snapshot state is not SNAPBUILD_FULL_SNAPSHOT. Now, with this
patch that is not required because we are making the subtransaction
and top-level transaction much earlier than this. I have verified
that it doesn't reopen the bug by running the test provided in the
original report [1].

Let me know what you think of the changes? If you find them okay,
then feel to include them in the next patch-set.

[1] - /messages/by-id/CAONYFtOv+Er1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg@mail.gmail.com

Thanks for the patch, I will review it and include it in my next version.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#356Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#355)
3 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sun, Jun 7, 2020 at 5:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Jun 5, 2020 at 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Let me know what you think of the changes? If you find them okay,
then feel to include them in the next patch-set.

[1] - /messages/by-id/CAONYFtOv+Er1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg@mail.gmail.com

Thanks for the patch, I will review it and include it in my next version.

Okay, I have done review of
0002-Issue-individual-invalidations-with-wal_level-lo.patch and below
are my comments:

1. I don't think it is a good idea that logical decoding process the
new XLOG_XACT_INVALIDATIONS and existing WAL records for invalidations
like XLOG_INVALIDATIONS and what we do in DecodeCommit (see code in
the check "if (parsed->nmsgs > 0)").  I think if that is required for
some particular reason then we should write detailed comments about
the same.  I have tried some experiments to see if those are really
required:
a. After applying patch 0002, I have tried by commenting out the
processing of invalidations via DecodeCommit and found some regression
tests were failing but the reason for failure was that we are not
setting RBTXN_HAS_CATALOG_CHANGES for the toptxn when subtxn has
catalog changes and when I did that all regression tests started
passing.  See the attached diff patch
(v27-0003-Incremental-patch-for-0002-to-test-removal-of-du) atop 0002
patch.
b. The processing of invalidations for XLOG_INVALIDATIONS is added by
commit c6ff84b06a for xid-less transactions.  See
https://postgr.es/m/CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com
to know why that has been added.  Now, after this patch we will
process the same invalidations via XLOG_XACT_INVALIDATIONS and
XLOG_INVALIDATIONS which doesn't seem warranted.  Also, the below
assertion will fail for xid-less transactions (try create index
concurrently statement):
+ case XLOG_XACT_INVALIDATIONS:
+ {
+ TransactionId xid;
+ xl_xact_invalidations *invals;
+
+ xid = XLogRecGetXid(r);
+ invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+ Assert(TransactionIdIsValid(xid));

I feel we don't need the processing of XLOG_INVALIDATIONS in logical
decoding after this patch but to prove that first we need to write a
test case which need XLOG_INVALIDATIONS in the HEAD as commit
c6ff84b06a doesn't add one. I think we need two code paths in
XLOG_XACT_INVALIDATIONS where if it is for xid-less transactions, then
execute actions immediately as we are doing in processing of
XLOG_INVALIDATIONS, otherwise, do what we are doing currently in the
patch. If the above point (b) is correct, I am not sure if it is a
good idea to use RM_XACT_ID as resource manager if for this WAL in
LogLogicalInvalidations, what do you think?

I think one of the usages we still need is in ReorderBufferForget
because it can be called when we skip processing the txn. See the
comments in DecodeCommit where we call this function. If I am
correct, we need to probably collect all invalidations in
ReorderBufferTxn as we are collecting tuplecids and use them here. We
can do the same during processing of XLOG_XACT_INVALIDATIONS.

I had also thought a bit about removing logging of invalidations at
commit time altogether but it seems processing hot-standby is somewhat
tightly coupled with existing WAL logging. See xact_redo_commit (a
comment atop call to ProcessCommittedInvalidationMessages). It says
we need to maintain the order when we process invalidations. If we
can later find a way to avoid that we can probably remove it but for
now maybe we can live with it.

2.
+ /* not expected, but print something anyway */
+ else if (msg->id == SHAREDINVALSMGR_ID)
+ appendStringInfoString(buf, " smgr");
+ /* not expected, but print something anyway */
+ else if (msg->id == SHAREDINVALRELMAP_ID)

I think the above comment is not valid after we started logging at CCI.

3.
+
+ xid = XLogRecGetXid(r);
+ invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+ Assert(TransactionIdIsValid(xid));
+ ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+ invals->nmsgs, invals->msgs);

Here, it should check !ctx->forward as we do in DecodeCommit, do we
have any reason for not doing so. We can test once by changing this.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v27-0001-Immediately-WAL-log-subtransaction-and-top-level.patchapplication/octet-stream; name=v27-0001-Immediately-WAL-log-subtransaction-and-top-level.patchDownload
From ea84ce8275d349c81f514e47d7f0a9ffd5a54cae Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 5 Jun 2020 09:03:16 +0530
Subject: [PATCH v27 1/3] Immediately WAL-log subtransaction and top-level XID
 association.

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead) only when wal_level=logical.
We can not remove the existing XLOG_XACT_ASSIGNMENT WAL as that is
required for avoiding overflow in the hot standby snapshot.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/transam/xact.c        | 50 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 23 +++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 44 ++++++++++++++--------------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 107 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index cd30b62..04fd5ca 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to top-level XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5118,6 +5120,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6020,3 +6023,50 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * IsSubTransactionAssignmentPending
+ *
+ * This is used to decide whether we need to WAL log the top-level XID for
+ * operation in a subtransaction.  We require that for logical decoding, see
+ * LogicalDecodingProcessRecord.
+ *
+ * This returns true if wal_level >= logical and we are inside a valid
+ * subtransaction, for which the assignment was not yet written to any WAL
+ * record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	/* wal_level has to be logical */
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it should not be already 'assigned' */
+	return !CurrentTransactionState->assigned;
+}
+
+/*
+ * MarkSubTransactionAssigned
+ *
+ * Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f..c526bb1 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,19 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogResetInsertion) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5995798..560ec27 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1195,6 +1195,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1233,6 +1234,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3a..0c0c371 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -94,11 +94,27 @@ void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
 	XLogRecordBuffer buf;
+	TransactionId txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the top-level xid is valid, we need to assign the subxact to the
+	 * top-level xact. We need to do this for all records, hence we do it
+	 * before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -216,13 +232,8 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	/*
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
-	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
-	 * when the snapshot is being built: it is possible to get later records
-	 * that require subxids to be properly assigned.
 	 */
-	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
 		return;
 
 	switch (info)
@@ -264,22 +275,13 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
 
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			/*
+			 * We assign subxact to the toplevel xact while processing each
+			 * record if required.  So, we don't need to do anything here.
+			 * See LogicalDecodingProcessRecord.
+			 */
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7ee04ba..8645b38 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e917dfe..05cc2b6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of top-level xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index d930fe9..24a4c44 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -191,6 +191,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -308,6 +310,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v27-0002-Issue-individual-invalidations-with-wal_level-lo.patchapplication/octet-stream; name=v27-0002-Issue-individual-invalidations-with-wal_level-lo.patchDownload
From 71e8c5ac3d17c5757d7b0f13cd5fcbf6aa7693a6 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 6 Jun 2020 09:54:21 +0530
Subject: [PATCH v27 2/3] Issue individual invalidations with
 wal_level=logical.

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/rmgrdesc/xactdesc.c          |  40 +++++++++
 src/backend/access/transam/xact.c               |   7 ++
 src/backend/replication/logical/decode.c        |  16 ++++
 src/backend/replication/logical/reorderbuffer.c | 104 +++++++++++++++++++++---
 src/backend/utils/cache/inval.c                 |  49 +++++++++++
 src/include/access/xact.h                       |  13 ++-
 src/include/replication/reorderbuffer.h         |  11 +++
 7 files changed, 226 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce755..7ab0d11 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 04fd5ca..72efa3c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6020,6 +6020,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0c0c371..a1d8745 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -282,6 +282,22 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * See LogicalDecodingProcessRecord.
 			 */
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				Assert(TransactionIdIsValid(xid));
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->nmsgs, invals->msgs);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4594cf9..b889edf 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -220,7 +220,7 @@ static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
 static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
-static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
 
 /*
  * ---------------------------------------
@@ -455,6 +455,11 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 				pfree(change->data.msg.message);
 			change->data.msg.message = NULL;
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			if (change->data.inval.invalidations)
+				pfree(change->data.inval.invalidations);
+			change->data.inval.invalidations = NULL;
+			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			if (change->data.snapshot)
 			{
@@ -1814,17 +1819,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation messages locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					ReorderBufferExecuteInvalidations(
+							change->data.inval.ninvalidations,
+							change->data.inval.invalidations);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -1866,7 +1878,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1892,7 +1905,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2204,6 +2218,33 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 
 /*
  * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn, int nmsgs,
+							 SharedInvalidationMessage *msgs)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.ninvalidations = nmsgs;
+	change->data.inval.invalidations = (SharedInvalidationMessage *)
+		MemoryContextAlloc(rb->context,
+						   sizeof(SharedInvalidationMessage) * nmsgs);
+	memcpy(change->data.inval.invalidations, msgs,
+		   sizeof(SharedInvalidationMessage) * nmsgs);
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Setup the invalidation of the toplevel transaction.
  *
  * This needs to be done before ReorderBufferCommit is called!
  */
@@ -2234,12 +2275,12 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  * in the changestream but we don't know which those are.
  */
 static void
-ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
 {
 	int			i;
 
-	for (i = 0; i < txn->ninvalidations; i++)
-		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	for (i = 0; i < nmsgs; i++)
+		LocalExecuteInvalidationMessage(&msgs[i]);
 }
 
 /*
@@ -2591,6 +2632,24 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				char	   *data;
+				Size		inval_size = sizeof(SharedInvalidationMessage) *
+										change->data.inval.ninvalidations;
+
+				sz += inval_size;
+
+				ReorderBufferSerializeReserve(rb, sz);
+				data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+
+				/* might have been reallocated above */
+				ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+				memcpy(data, change->data.inval.invalidations, inval_size);
+				data += inval_size;
+
+				break;
+			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -2736,6 +2795,11 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			sz += sizeof(SharedInvalidationMessage) *
+					change->data.inval.ninvalidations;
+			break;
+
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -3004,6 +3068,20 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				Size	inval_size = sizeof(SharedInvalidationMessage) *
+									change->data.inval.ninvalidations;
+
+				change->data.inval.invalidations =
+						MemoryContextAlloc(rb->context, inval_size);
+
+				/* read the message */
+				memcpy(change->data.inval.invalidations, data, inval_size);
+				data += inval_size;
+
+				break;
+			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	oldsnap;
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..cba5b6c 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transaction.  As of now it was
+ *	enough to log invalidation only at commit because we are only decoding the
+ *	transaction at the commit time.   We only need to log the catalog cache and
+ *	relcache invalidation.  There can not be any active MVCC scan in logical
+ *	decoding so we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
 	if (transInvalInfo == NULL)
 		return;
 
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
@@ -1501,3 +1513,40 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage	*invalMessages;
+	int	nmsgs = 0;
+
+	if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+	{
+		ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+								 MakeSharedInvalidMessagesArray);
+		invalMessages = SharedInvalidMessagesArray;
+		nmsgs  = numSharedInvalidMessagesArray;
+		SharedInvalidMessagesArray = NULL;
+		numSharedInvalidMessagesArray = 0;
+	}
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8645b38..b822c5e 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,17 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4..af35287 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,14 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations;		/* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation
+														 * message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +468,8 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  int nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
1.8.3.1

v27-0003-Incremental-patch-for-0002-to-test-removal-of-du.patchapplication/octet-stream; name=v27-0003-Incremental-patch-for-0002-to-test-removal-of-du.patchDownload
From 5418ba6709e2a5f25f47dcc6c06ed4646a1b42af Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Mon, 8 Jun 2020 11:36:28 +0530
Subject: [PATCH v27 3/3] Incremental patch for 0002 to test removal of
 duplicate invalidation processing.

---
 src/backend/replication/logical/decode.c        |  4 ++--
 src/backend/replication/logical/reorderbuffer.c | 22 +++++++++++++++++-----
 src/include/replication/reorderbuffer.h         |  3 +++
 3 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a1d8745..7f1385c 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -596,10 +596,10 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 */
 	if (parsed->nmsgs > 0)
 	{
-		if (!ctx->fast_forward)
+		/*if (!ctx->fast_forward)
 			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
 										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);*/
 	}
 
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b889edf..42067f1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -864,6 +864,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->txn_flags |= RBTXN_IS_SUBXACT;
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
+        
+        /* set the reference to top-level transaction */
+        subtxn->toptxn = txn;
 
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
@@ -1878,8 +1881,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(txn->ninvalidations,
-										  txn->invalidations);
+		/*ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);*/
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1905,8 +1908,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(txn->ninvalidations,
-										  txn->invalidations);
+		/*ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);*/
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2237,7 +2240,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 						   sizeof(SharedInvalidationMessage) * nmsgs);
 	memcpy(change->data.inval.invalidations, msgs,
 		   sizeof(SharedInvalidationMessage) * nmsgs);
-
+        
 	ReorderBufferQueueChange(rb, xid, lsn, change);
 
 	MemoryContextSwitchTo(oldcontext);
@@ -2295,6 +2298,15 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+        
+        /*
+	 * Mark top-level transaction as having catalog changes too if one of its
+	 * children has so that the ReorderBufferBuildTupleCidHash can conveniently
+	 * check just top-level transaction and decide whether to build the hash
+	 * table or not.
+	 */
+        if (txn->toptxn != NULL)
+            txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index af35287..e582ceb 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -228,6 +228,9 @@ typedef struct ReorderBufferTXN
 	 * LSN pointing to the end of the commit record + 1.
 	 */
 	XLogRecPtr	end_lsn;
+        
+        /* Toplevel transaction for this subxact (NULL for top-level). */
+        struct ReorderBufferTXN *toptxn;
 
 	/*
 	 * LSN of the last lsn at which snapshot information reside, so we can
-- 
1.8.3.1

#357Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#356)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think one of the usages we still need is in ReorderBufferForget
because it can be called when we skip processing the txn. See the
comments in DecodeCommit where we call this function. If I am
correct, we need to probably collect all invalidations in
ReorderBufferTxn as we are collecting tuplecids and use them here. We
can do the same during processing of XLOG_XACT_INVALIDATIONS.

One more point related to this is that after this patch series, we
need to consider executing all invalidation during transaction abort.
Because it is possible that due to memory overflow, we have processed
some of the messages which also contain a few XACT_INVALIDATION
messages, so to avoid cache pollution, we need to execute all of them
in abort. We also do the similar thing in Rollback/Rollback To
Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval.

Few other comments on
0002-Issue-individual-invalidations-with-wal_level-lo.patch
---------------------------------------------------------------------------------------------------------------
1.
+ if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+ {
+ ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+ MakeSharedInvalidMessagesArray);
+ invalMessages = SharedInvalidMessagesArray;
+ nmsgs  = numSharedInvalidMessagesArray;
+ SharedInvalidMessagesArray = NULL;
+ numSharedInvalidMessagesArray = 0;

a. Immediately after ProcessInvalidationMessagesMulti, isn't it better
to have an Assertion like Assert(!(numSharedInvalidMessagesArray > 0
&& SharedInvalidMessagesArray == NULL));?
b. Why check "if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)" is
required? If you see xactGetCommittedInvalidationMessages where we do
something similar, we only check for valid value of transInvalInfo and
here we check the same in the caller of LogLogicalInvalidations, isn't
that sufficient? If that is sufficient, we can either have the same
check here or have an Assert for the same.

2.
@@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
if (transInvalInfo == NULL)
return;

+ if (XLogLogicalInfoActive())
+ LogLogicalInvalidations();
+
ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
LocalExecuteInvalidationMessage);
Generally, we WAL log the action after performing it but here you are
writing WAL first. Is there any specific reason? If so, can we write
a comment about the same?

3.
+ * When wal_level=logical, write invalidations into WAL at each command end to
+ * support the decoding of the in-progress transaction.  As of now it was
+ * enough to log invalidation only at commit because we are only decoding the
+ * transaction at the commit time.   We only need to log the catalog cache and
+ * relcache invalidation.  There can not be any active MVCC scan in logical
+ * decoding so we don't need to log the snapshot invalidation.

I think this comment doesn't hold good after we have changed the patch
to LOG invalidations at the time of CCI.

4.
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()

Add the function name atop of this function in comments to match the
style with other nearby functions. How about modifying it to
something like: "Emit WAL for invalidations. This is currently only
used for logging invalidations at the command end."

5.
+ *
+ * XXX Do we need to care about relcacheInitFileInval and
+ * the other fields added to ReorderBufferChange, or just
+ * about the message itself?
+ */

I don't think we need to do anything about relcacheInitFileInval.
This is used to remove the stale files (RELCACHE_INIT_FILENAME) that
have obsolete information about relcache. The walsender process that
is doing decoding doesn't require us to do anything about this. Also,
if you see before this patch, we don't do anything about relcache
files during decoding of invalidation messages. In short, I think we
can remove this comment unless you see some use of it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#358Amit Kapila
amit.kapila16@gmail.com
In reply to: Mahendra Singh Thalor (#352)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jun 4, 2020 at 5:06 PM Mahendra Singh Thalor <mahi6run@gmail.com>
wrote:

On Fri, 29 May 2020 at 15:52, Amit Kapila <amit.kapila16@gmail.com> wrote:

To see all the operations(DDL's and DML's), please see test_results
<https://docs.google.com/spreadsheets/d/1g11MrSd_I39505OnGoLFVslz3ykbZ1nmfR_gUiE_O9k/edit?usp=sharing&gt;

*Testing summary:*
Basically, we are writing per command invalidation message and for testing
that I have tested with different combinations of the DDL and DML
operation. I have not observed any performance degradation with the patch.
For "create index" DDL's, %change in wal is 1-7% for 1-15 DDL's. For "add
col int/date" DDL's, it is 11-17% for 1-15 DDL's and for "add col text"
DDL's, it is 2-6% for 1-15 DDL's. For mix (DDL & DML), it is 2-10%.

why are we seeing 11-13 % of the extra wall, basically, the amount of
extra WAL is not very high but the amount of WAL generated with add column
int/date is just ~1000 bytes so additional 100 bytes will be around 10% and
for add column text it is ~35000 bytes so % is less. For text, these
~35000 bytes are due to toast
There is no change in wal size for *DML operations*. For savepoints, we
are getting max 8 bytes per savepoint wal increment (basically for
Sub-transaction, we are adding 5 bytes to store xid but due to padding, it
is 8 bytes and some times if wal is already aligned, then we are getting 0
bytes increment)

So, if I read it correctly, there is no performance penalty with either of
the patches but there is some additional WAL which in most cases is 2-5%
but in worst cases and some specific DDL's it is upto 15%. I think as this
WAL overhead is when wal_level is logical, we might have to live with it as
the other alternative is to blew up all caches on any DDL in WALSenders and
that will have bot CPU and Network overhead as expalined previously [1]/messages/by-id/CAA4eK1JaKW1mj4L6DPnk-V4vXJ6hM=Kcf6+-X+93Jk56UN+kGw@mail.gmail.com. I
feel if the WAL overhead pinches any workload, we might want to do it under
some new guc (which will disable streaming of transactions) but I don't
think we need to go there.

What do you think?

[1]: /messages/by-id/CAA4eK1JaKW1mj4L6DPnk-V4vXJ6hM=Kcf6+-X+93Jk56UN+kGw@mail.gmail.com
/messages/by-id/CAA4eK1JaKW1mj4L6DPnk-V4vXJ6hM=Kcf6+-X+93Jk56UN+kGw@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#359Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#358)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jun 9, 2020 at 3:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 4, 2020 at 5:06 PM Mahendra Singh Thalor <mahi6run@gmail.com> wrote:

On Fri, 29 May 2020 at 15:52, Amit Kapila <amit.kapila16@gmail.com> wrote:

To see all the operations(DDL's and DML's), please see test_results

Testing summary:
Basically, we are writing per command invalidation message and for testing that I have tested with different combinations of the DDL and DML operation. I have not observed any performance degradation with the patch. For "create index" DDL's, %change in wal is 1-7% for 1-15 DDL's. For "add col int/date" DDL's, it is 11-17% for 1-15 DDL's and for "add col text" DDL's, it is 2-6% for 1-15 DDL's. For mix (DDL & DML), it is 2-10%.

why are we seeing 11-13 % of the extra wall, basically, the amount of extra WAL is not very high but the amount of WAL generated with add column int/date is just ~1000 bytes so additional 100 bytes will be around 10% and for add column text it is ~35000 bytes so % is less. For text, these ~35000 bytes are due to toast
There is no change in wal size for DML operations. For savepoints, we are getting max 8 bytes per savepoint wal increment (basically for Sub-transaction, we are adding 5 bytes to store xid but due to padding, it is 8 bytes and some times if wal is already aligned, then we are getting 0 bytes increment)

So, if I read it correctly, there is no performance penalty with either of the patches but there is some additional WAL which in most cases is 2-5% but in worst cases and some specific DDL's it is upto 15%. I think as this WAL overhead is when wal_level is logical, we might have to live with it as the other alternative is to blew up all caches on any DDL in WALSenders and that will have bot CPU and Network overhead as expalined previously [1]. I feel if the WAL overhead pinches any workload, we might want to do it under some new guc (which will disable streaming of transactions) but I don't think we need to go there.

What do you think?

Even I feel so because the WAL overhead is only with wal_level=logical
and especially with DDL and ideally, there should not be a large amount
of DDL in the system compared to other operations. So I think we can live
with the current approach.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#360Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#354)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sun, Jun 7, 2020 at 5:06 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jun 4, 2020 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jun 3, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 2, 2020 at 7:53 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I thin for our use case BufFileCreateShared is more suitable. I think
we need to do some modifications so that we can use these apps without
SharedFileSet. Otherwise, we need to unnecessarily need to create
SharedFileSet for each transaction and also need to maintain it in xid
array or xid hash until transaction commit/abort. So I suggest
following modifications in shared files set so that we can
conveniently use it.
1. ChooseTablespace(const SharedFileSet fileset, const char name)
if fileset is NULL then select the DEFAULTTABLESPACEOID
2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
If fileset is NULL then in directory path we can use MyProcPID or
something instead of fileset->creator_pid.

Hmm, I find these modifications a bit ad-hoc. So, not sure if it is
better than the patch maintains sharedfileset information.

I think we might do something better here, maybe by supplying function
pointer or so, but maintaining sharedfileset which contains different
tablespace/mutext which we don't need at all for our purpose also
doesn't sound very appealing.

I think we can say something similar for Relation (rel cache entry as
well) maintained in LogicalRepRelMapEntry. I think we only need a
pointer to that information.

Yeah, I see.

Let me see if I can not come up with
some clean way of avoiding the need to shared-fileset then maybe we
can go with the shared fileset idea.

Fair enough.

While evaluating it further I feel there are a few more problems to
solve if we are using BufFile, First thing is that in subxact file we
maintain the information of xid and its offset in the changes file.
So now, we will also have to store 'fileno' but that we can find using
BufFileTell. Yet another problem is that currently, we don't
have the truncate option in the BufFile, but we need it if the
sub-transaction gets aborted. I think we can implement an extra
interface with the BufFile and should not be very hard as we already
know the fileno and the offset. I will evaluate this part further and
let you know about the same.

I have further evaluated this and also tested the concept with a POC
patch. Soon I will complete and share, here is the scatch of the
idea.

As discussed we will use SharedBufFile for changes files and subxact
files. There will be a separate LogicalStreamingResourceOwner, which
will be used to manage the VFD of the shared buf files. We can create
a per stream resource owner i.e. on stream start we will create the
resource owner and all the shared buffiles will be opened under that
resource owner, which will be deleted on stream stop. We need to
remember the SharedFileSet so that for subsequent stream for the same
transaction we can open the same file again, for this we will use a
hash table with xid as a key and in that, we will keep stream_fileset
and subxact_fileset's pointers as payload.

+typedef struct StreamXidHash
+{
+       TransactionId   xid;
+       SharedFileSet  *stream_fileset;
+       SharedFileSet  *subxact_fileset;
+} StreamXidHash;

We have to do some extension to the buffile modules, some of them are
already discussed up-thread but still listing them all down here
- A new interface BufFileTruncateShared(BufFile *file, int fileno,
off_t offset), for truncating the subtransaction changes, if changes
are spread across multiple files those files will be deleted and we
will adjust the file count and current offset accordingly in BufFile.
- In BufFileOpenShared, we will have to implement a mode so that we
can open in write mode as well, current only read-only mode supported.
- In SharedFileSetInit, if dsm_segment is NULL then we will not
register the file deletion on on_dsm_detach.
- As usual, we will clean up the files on stream abort/commit, or on
the worker exit.

Currently, I am done with a working prototype of using the BufFile
infrastructure for the tempfile. Meanwhile, I want to discuss a few
interface changes required for the BufFIle infrastructure.

1. Support read-write mode for "BufFileOpenShared", Basically, in
workers we will be opening the xid's changes and subxact files per
stream, so we need an RW mode even in the open. I have passed a flag
for the same.

2. Files should not be closed at the end of the transaction:
Currently, files opened with BufFileCreateShared/BufFileOpenShared are
registered to be closed on EOXACT. Basically, we need to open the
changes file on the stream start and keep it open until stream stop,
so we can not afford to get it closed on the EOXACT. I have added a
flag for the same.

3. As. discussed above we need to support truncate for handling thee
subtransaction abort so I have added a new interface for the same.

4. After every time we open the changes file, we need to seek to the
end, so I have supported SEEK_END.

Attached is the WIP patch for describing my changes.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

buffile_change.patchapplication/octet-stream; name=buffile_change.patchDownload
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 35e8f12e62..0399f11607 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -100,7 +100,7 @@ static void extendBufFile(BufFile *file);
 static void BufFileLoadBuffer(BufFile *file);
 static void BufFileDumpBuffer(BufFile *file);
 static int	BufFileFlush(BufFile *file);
-static File MakeNewSharedSegment(BufFile *file, int segment);
+static File MakeNewSharedSegment(BufFile *file, int segment, bool eoxact_close);
 
 /*
  * Create BufFile and perform the common initialization.
@@ -156,7 +156,7 @@ extendBufFile(BufFile *file)
 	if (file->fileset == NULL)
 		pfile = OpenTemporaryFile(file->isInterXact);
 	else
-		pfile = MakeNewSharedSegment(file, file->numFiles);
+		pfile = MakeNewSharedSegment(file, file->numFiles, true);
 
 	Assert(pfile >= 0);
 
@@ -219,7 +219,7 @@ SharedSegmentName(char *name, const char *buffile_name, int segment)
  * Create a new segment file backing a shared BufFile.
  */
 static File
-MakeNewSharedSegment(BufFile *buffile, int segment)
+MakeNewSharedSegment(BufFile *buffile, int segment, bool eoxact_close)
 {
 	char		name[MAXPGPATH];
 	File		file;
@@ -235,7 +235,7 @@ MakeNewSharedSegment(BufFile *buffile, int segment)
 
 	/* Create the new segment. */
 	SharedSegmentName(name, buffile->name, segment);
-	file = SharedFileSetCreate(buffile->fileset, name);
+	file = SharedFileSetCreate(buffile->fileset, name, eoxact_close);
 
 	/* SharedFileSetCreate would've errored out */
 	Assert(file > 0);
@@ -255,7 +255,8 @@ MakeNewSharedSegment(BufFile *buffile, int segment)
  * unrelated SharedFileSet objects.
  */
 BufFile *
-BufFileCreateShared(SharedFileSet *fileset, const char *name)
+BufFileCreateShared(SharedFileSet *fileset, const char *name,
+					bool eoxact_close)
 {
 	BufFile    *file;
 
@@ -263,7 +264,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 	file->files = (File *) palloc(sizeof(File));
-	file->files[0] = MakeNewSharedSegment(file, 0);
+	file->files[0] = MakeNewSharedSegment(file, 0, eoxact_close);
 	file->readOnly = false;
 
 	return file;
@@ -277,7 +278,8 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, bool eoxact_close,
+				  bool read_only)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +303,8 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, eoxact_close,
+										  read_only);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +324,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = read_only;		/* Can't write to files opened this way */
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -670,11 +673,14 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+			/*
+			 * Get the file size of the last file to get the last offset
+			 * of that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -843,3 +849,40 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int newFile = file->numFiles;
+	off_t newOffset;
+	char segment_name[MAXPGPATH];
+	int i;
+
+	/* Loop over all the  files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno,  we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it
+		 * is the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)	
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			SharedFileSetDelete(file->fileset, segment_name, true);
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			FileTruncate(file->files[i], offset, WAIT_EVENT_BUFFILE_READ);
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 7dc6dd2f15..0fa98585f9 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1403,13 +1403,14 @@ ReportTemporaryFileUsage(const char *path, off_t size)
  * before the file was opened.
  */
 static void
-RegisterTemporaryFile(File file)
+RegisterTemporaryFile(File file, bool eoxact_close)
 {
 	ResourceOwnerRememberFile(CurrentResourceOwner, file);
 	VfdCache[file].resowner = CurrentResourceOwner;
 
 	/* Backup mechanism for closing at end of xact. */
-	VfdCache[file].fdstate |= FD_CLOSE_AT_EOXACT;
+	if (eoxact_close)
+		VfdCache[file].fdstate |= FD_CLOSE_AT_EOXACT;
 	have_xact_temporary_files = true;
 }
 
@@ -1616,7 +1617,7 @@ OpenTemporaryFile(bool interXact)
 
 	/* Register it with the current resource owner */
 	if (!interXact)
-		RegisterTemporaryFile(file);
+		RegisterTemporaryFile(file, true);
 
 	return file;
 }
@@ -1707,7 +1708,8 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
  * the prefix isn't needed.
  */
 File
-PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
+PathNameCreateTemporaryFile(const char *path, bool error_on_failure,
+							bool eoxact_close)
 {
 	File		file;
 
@@ -1733,7 +1735,7 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 	VfdCache[file].fdstate |= FD_TEMP_FILE_LIMIT;
 
 	/* Register it for automatic close. */
-	RegisterTemporaryFile(file);
+	RegisterTemporaryFile(file, eoxact_close);
 
 	return file;
 }
@@ -1741,18 +1743,22 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are read-only if the flag is set and are
+ * automatically closed at the end of the transaction if the eoxact_close is
+ * setbut are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, bool eoxact_close, bool read_only)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	/* We open the file read-only if instructed by the caller. */
+	if (read_only)
+		file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	else
+		file = PathNameOpenFile(path, O_RDWR | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
@@ -1764,7 +1770,7 @@ PathNameOpenTemporaryFile(const char *path)
 	if (file > 0)
 	{
 		/* Register it for automatic close. */
-		RegisterTemporaryFile(file);
+		RegisterTemporaryFile(file, eoxact_close);
 	}
 
 	return file;
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index f7206c9175..9b87bcad3d 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -68,7 +68,8 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
 }
 
 /*
@@ -102,13 +103,14 @@ SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg)
  * Create a new file in the given set.
  */
 File
-SharedFileSetCreate(SharedFileSet *fileset, const char *name)
+SharedFileSetCreate(SharedFileSet *fileset, const char *name,
+					bool eoxact_close)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameCreateTemporaryFile(path, false);
+	file = PathNameCreateTemporaryFile(path, false, eoxact_close);
 
 	/* If we failed, see if we need to create the directory on demand. */
 	if (file <= 0)
@@ -120,7 +122,7 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
 		TempTablespacePath(tempdirpath, tablespace);
 		SharedFileSetPath(filesetpath, fileset, tablespace);
 		PathNameCreateTemporaryDir(tempdirpath, filesetpath);
-		file = PathNameCreateTemporaryFile(path, true);
+		file = PathNameCreateTemporaryFile(path, true, eoxact_close);
 	}
 
 	return file;
@@ -131,13 +133,14 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, bool eoxact_close,
+				  bool read_only)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, eoxact_close, read_only);
 
 	return file;
 }
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 666a7c0e81..212a607e63 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -544,7 +544,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, true, true);
 		filesize = BufFileSize(file);
 
 		/*
@@ -701,7 +701,7 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
 		char		filename[MAXPGPATH];
 
 		pg_itoa(worker, filename);
-		lts->pfile = BufFileCreateShared(fileset, filename);
+		lts->pfile = BufFileCreateShared(fileset, filename, true);
 	}
 	else
 		lts->pfile = BufFileCreateTemp(false);
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index c3ab494a45..6ea52cb5a5 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -315,7 +315,8 @@ sts_puttuple(SharedTuplestoreAccessor *accessor, void *meta_data,
 
 		/* Create one.  Only this backend will write into it. */
 		sts_filename(name, accessor, accessor->participant);
-		accessor->write_file = BufFileCreateShared(accessor->fileset, name);
+		accessor->write_file = BufFileCreateShared(accessor->fileset, name,
+												   true);
 
 		/* Set up the shared state for this backend's file. */
 		participant = &accessor->sts->participants[accessor->participant];
@@ -563,7 +564,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, true, true);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index 60433f35b4..0f628dd2b8 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -46,9 +46,12 @@ extern int	BufFileSeekBlock(BufFile *file, long blknum);
 extern int64 BufFileSize(BufFile *file);
 extern long BufFileAppend(BufFile *target, BufFile *source);
 
-extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name,
+									bool eoxact_close);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  bool eoxact_close, bool read_only);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d7df..8c4e684f51 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -93,8 +93,8 @@ extern int	FileGetRawFlags(File file);
 extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
-extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure, bool eoxact_close);
+extern File PathNameOpenTemporaryFile(const char *path, bool eoxact_close, bool read_only);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf077e5..c661efaaaa 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -36,8 +36,10 @@ typedef struct SharedFileSet
 
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
-extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name,
+								bool eoxact_close);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  bool eoxact_close, bool read_only);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
#361Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#360)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jun 10, 2020 at 2:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Currently, I am done with a working prototype of using the BufFile
infrastructure for the tempfile. Meanwhile, I want to discuss a few
interface changes required for the BufFIle infrastructure.

1. Support read-write mode for "BufFileOpenShared", Basically, in
workers we will be opening the xid's changes and subxact files per
stream, so we need an RW mode even in the open. I have passed a flag
for the same.

Generally file open APIs have mode as a parameter to indicate
read_only or read_write. Using flag here seems a bit odd to me.

2. Files should not be closed at the end of the transaction:
Currently, files opened with BufFileCreateShared/BufFileOpenShared are
registered to be closed on EOXACT. Basically, we need to open the
changes file on the stream start and keep it open until stream stop,
so we can not afford to get it closed on the EOXACT. I have added a
flag for the same.

But where do we end the transaction before the stream stop which can
lead to closure of this file?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#362Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#361)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jun 10, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jun 10, 2020 at 2:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Currently, I am done with a working prototype of using the BufFile
infrastructure for the tempfile. Meanwhile, I want to discuss a few
interface changes required for the BufFIle infrastructure.

1. Support read-write mode for "BufFileOpenShared", Basically, in
workers we will be opening the xid's changes and subxact files per
stream, so we need an RW mode even in the open. I have passed a flag
for the same.

Generally file open APIs have mode as a parameter to indicate
read_only or read_write. Using flag here seems a bit odd to me.

Let me think about it, we can try to pass the mode.

2. Files should not be closed at the end of the transaction:
Currently, files opened with BufFileCreateShared/BufFileOpenShared are
registered to be closed on EOXACT. Basically, we need to open the
changes file on the stream start and keep it open until stream stop,
so we can not afford to get it closed on the EOXACT. I have added a
flag for the same.

But where do we end the transaction before the stream stop which can
lead to closure of this file?

Currently, I am keeping the transaction only while creating/opening
the files and closing immediately after that, maybe we can keep the
transaction until stream stop, then we can avoid this changes, and we
can also avoid creating extra resource owner? What is your thought on
this?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#363Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#362)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jun 10, 2020 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jun 10, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

2. Files should not be closed at the end of the transaction:
Currently, files opened with BufFileCreateShared/BufFileOpenShared are
registered to be closed on EOXACT. Basically, we need to open the
changes file on the stream start and keep it open until stream stop,
so we can not afford to get it closed on the EOXACT. I have added a
flag for the same.

But where do we end the transaction before the stream stop which can
lead to closure of this file?

Currently, I am keeping the transaction only while creating/opening
the files and closing immediately after that, maybe we can keep the
transaction until stream stop, then we can avoid this changes, and we
can also avoid creating extra resource owner? What is your thought on
this?

I would prefer to keep the transaction until the stream stop unless
there are good reasons for not doing so.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#364Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#363)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jun 10, 2020 at 5:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jun 10, 2020 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jun 10, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

2. Files should not be closed at the end of the transaction:
Currently, files opened with BufFileCreateShared/BufFileOpenShared are
registered to be closed on EOXACT. Basically, we need to open the
changes file on the stream start and keep it open until stream stop,
so we can not afford to get it closed on the EOXACT. I have added a
flag for the same.

But where do we end the transaction before the stream stop which can
lead to closure of this file?

Currently, I am keeping the transaction only while creating/opening
the files and closing immediately after that, maybe we can keep the
transaction until stream stop, then we can avoid this changes, and we
can also avoid creating extra resource owner? What is your thought on
this?

I would prefer to keep the transaction until the stream stop unless
there are good reasons for not doing so.

I am ready with the first patch set which replaces the temp file usage
in the worker with the buffile usage. (patch v27-0013 and v27-0014)

Open item:
- As of now, I have kept the buffile changes and the worker using
buffile as separate patches for review. Later I will make buffile
changes patch as a base patch and I will merge the worker changes with
the 0008 patch.

- Currently, while reading/writing the streaming/subxact files we are
reporting the wait event for example
'pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);', but
BufFileWrite/BufFileRead internally reports the read/write wait event.
So I think we can avoid reporting that? Basically, this part is still
I have to work upon, once we get the consensus then I can remove those
extra wait event from the patch.

- There are still a few open comments, from your other mails, I still
have to work upon. So I will work on those in the next version.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v27.tarapplication/x-tar; name=v27.tarDownload
._v27000755 000765 000024 00000000334 13670577767 013174 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl��^���OU�=C�6�PaxHeader/v27000755 000765 000024 00000000036 13670577767 014727 xustar00dilipkumarstaff000000 000000 30 mtime=1591934967.722948437
v27/000755 000765 000024 00000000000 13670577767 013031 5ustar00dilipkumarstaff000000 000000 v27/._.DS_Store000644 000765 000024 00000000170 13670577763 014723 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2Fx        @ATTRxxv27/PaxHeader/.DS_Store000644 000765 000024 00000000036 13670577763 016460 xustar00dilipkumarstaff000000 000000 30 mtime=1591934963.284642037
v27/.DS_Store000644 000765 000024 00000020004 13670577763 014504 0ustar00dilipkumarstaff000000 000000 Bud1
0001-I @� @� @� @
?v27-0001-Immediately-WAL-log-subtransaction-and-top-level.patchIlocblob;(������?v27-0002-Issue-individual-invalidations-with-wal_level-lo.patchIlocblob�(������?v27-0003-Extend-the-output-plugin-API-with-stream-methods.patchIlocblob(������?v27-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patchIlocblob�(������8v27-0005-Implement-streaming-mode-in-ReorderBuffer.patchIlocblob�(������-v27-0007-Track-statistics-for-streaming.patchIlocblob�(������?v27-0008-Add-support-for-streaming-to-built-in-replicatio.patchIlocblob=(������>v27-0009-Enable-streaming-for-all-subscription-TAP-tests.patchIlocblob�(������1v27-0010-Add-TAP-test-for-streaming-vs.-DDL.patchIlocblob(������;v27-0011-Provide-new-api-to-get-the-streaming-changes.patchIlocblob�(������.v27-0012-Add-streaming-option-in-pg_dump.patchIlocblob�(������?v27-0013-Change-buffile-interface-required-for-streaming-.patchIlocblob;�������?v27-0014-Worker-tempfile-use-the-shared-buffile-infrastru.patchIlocblob��������EDSDB `� @� @� @r-streaming-vs.-DDL.patchIlocblob(������;v27-0011-Provide-new-api-to-get-the-streaming-changes.patchIlocblob�(������.v27-0012-Add-streaming-option-in-pg_dump.patchIlocblob�(������?v27-0013-Change-buffile-interface-required-for-streaming-.patchIlocblob;�������?v27-0014-Worker-tempfile-use-the-shared-buffile-infrastru.patchIlocblob��������v27/v27-0009-Enable-streaming-for-all-subscription-TAP-tests.patch000644 000765 000024 00000035517 13670411611 026051 0ustar00dilipkumarstaff000000 000000 From b10f2d435242fab190c64fcd2f1c4b88b37ad460 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v27 09/14] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318fc7c..6f7bedc130 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2fbc..94c71f8ae2 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f871a..4ba80869b9 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9181..a6fae9c3f1 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17a78..202871a658 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10a19..70c86b22ac 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6d63..f9c8d1d348 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334ed89..cdf9b8e7bb 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bdba9..21f50c7012 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133f69..30561d8f96 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae38592b..9a6bac6822 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..ed56fbf96c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cba4c..4df1ddef63 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1a8a..c3caff6149 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7c2f..c62eb521e7 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30f59..2be7542831 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0578..2da9607a7d 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9435..96ffc091b0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
2.23.0

v27/v27-0014-Worker-tempfile-use-the-shared-buffile-infrastru.patch000644 000765 000024 00000066264 13670411611 026352 0ustar00dilipkumarstaff000000 000000 From df5b9ec348b2f9541cfe4017fd57da485b05fb64 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 11 Jun 2020 16:42:07 +0530
Subject: [PATCH v27 14/14] Worker tempfile use the shared buffile
 infrastructure

Tobe merged with 0008, kept separate to make it easy for the
review.
---
 src/backend/replication/logical/worker.c | 540 +++++++++++------------
 1 file changed, 270 insertions(+), 270 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index d2d9469999..cdc8e4f9ab 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -56,6 +56,7 @@
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -85,6 +86,7 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -123,10 +125,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a xid we create this entry in the
+ * xidhash and we also create the streaming file and store the fileset handle.
+ * So that on the subsequent stream for the xid we can search the entry in the
+ * hash and get the fileset handle.  The subxact file is only create if there
+ * are any suxact info under this xid.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;				/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;  /* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
-static MemoryContext LogicalStreamingContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -139,12 +157,23 @@ static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 bool	in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
-static int	stream_fd = -1;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB	*xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
 
 typedef struct SubXactInfo
 {
-	TransactionId xid;						/* XID of the subxact */
-	off_t           offset;					/* offset in the file */
+	TransactionId xid;					/* XID of the subxact */
+	int			fileno;					/* file number in the buffile */
+	off_t		offset;					/* offset in the file */
 } SubXactInfo;
 
 static uint32 nsubxacts = 0;
@@ -171,13 +200,6 @@ static void stream_open_file(Oid subid, TransactionId xid, bool first);
 static void stream_write_change(char action, StringInfo s);
 static void stream_close_file(void);
 
-/*
- * Array of serialized XIDs.
- */
-static int	nxids = 0;
-static int	maxnxids = 0;
-static TransactionId	*xids = NULL;
-
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
@@ -275,7 +297,7 @@ handle_streamed_transaction(const char action, StringInfo s)
 	if (!in_streamed_transaction)
 		return false;
 
-	Assert(stream_fd != -1);
+	Assert(stream_fd != NULL);
 	Assert(TransactionIdIsValid(stream_xid));
 
 	/*
@@ -666,31 +688,39 @@ static void
 apply_handle_stream_start(StringInfo s)
 {
 	bool		first_segment;
+	HASHCTL		hash_ctl;
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
 	/* notify handle methods we're processing a remote transaction */
 	in_streamed_transaction = true;
 
 	/* extract XID of the top-level transaction */
 	stream_xid = logicalrep_read_stream_start(s, &first_segment);
 
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
 	/* open the spool file for this transaction */
 	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
 
-	/*
-	 * if this is not the first segment, open existing file
-	 *
-	 * XXX Note that the cleanup is performed by stream_open_file.
-	 */
+	/* if this is not the first segment, open existing file */
 	if (!first_segment)
-	{
-		MemoryContext oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
-
-		/* Read the subxacts info in per-stream context. */
 		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
-		MemoryContextSwitchTo(oldctx);
-	}
 
 	pgstat_report_activity(STATE_RUNNING, NULL);
 }
@@ -710,6 +740,9 @@ apply_handle_stream_stop(StringInfo s)
 	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
 	stream_close_file();
 
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
 	in_streamed_transaction = false;
 
 	/* Reset per-stream context */
@@ -736,10 +769,7 @@ apply_handle_stream_abort(StringInfo s)
 	 * just delete the files with serialized info.
 	 */
 	if (xid == subxid)
-	{
 		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
-		return;
-	}
 	else
 	{
 		/*
@@ -761,11 +791,13 @@ apply_handle_stream_abort(StringInfo s)
 
 		int64		i;
 		int64		subidx;
-		int			fd;
+		BufFile	   *fd;
 		bool		found = false;
 		char		path[MAXPGPATH];
+		StreamXidHash *ent;
 
 		subidx = -1;
+		ensure_transaction();
 		subxact_info_read(MyLogicalRepWorker->subid, xid);
 
 		/* XXX optimize the search by bsearch on sorted data */
@@ -787,33 +819,32 @@ apply_handle_stream_abort(StringInfo s)
 		{
 			/* Cleanup the subxact info */
 			cleanup_subxact_info();
+			CommitTransactionCommand();
 			return;
 		}
 
 		Assert((subidx >= 0) && (subidx < nsubxacts));
 
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
 		changes_filename(path, MyLogicalRepWorker->subid, xid);
-		fd = OpenTransientFile(path, O_WRONLY | PG_BINARY);
-		if (fd < 0)
-		{
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not open file \"%s\": %m",
-							path)));
-		}
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
 
-		/* OK, truncate the file at the right offset. */
-		if (ftruncate(fd, subxacts[subidx].offset))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not truncate file \"%s\": %m", path)));
-		CloseTransientFile(fd);
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
 
 		/* discard the subxacts added later */
 		nsubxacts = subidx;
 
 		/* write the updated subxact list */
 		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
 	}
 }
 
@@ -823,16 +854,16 @@ apply_handle_stream_abort(StringInfo s)
 static void
 apply_handle_stream_commit(StringInfo s)
 {
-	int			fd;
 	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
-
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
+	bool		found;
 	LogicalRepCommitData commit_data;
-
-	MemoryContext oldcxt;
+	StreamXidHash  *ent;
+	MemoryContext	oldcxt;
+	BufFile	*fd;
 
 	Assert(!in_streamed_transaction);
 
@@ -840,25 +871,21 @@ apply_handle_stream_commit(StringInfo s)
 
 	elog(DEBUG1, "received commit for streamed transaction %u", xid);
 
-	/* open the spool file for the committed transaction */
-	changes_filename(path, MyLogicalRepWorker->subid, xid);
-
 	elog(DEBUG1, "replaying changes from file '%s'", path);
 
-	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (fd < 0)
-	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
-	}
-
 	ensure_transaction();
-
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	buffer = palloc(8192);
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
 	initStringInfo(&s2);
 
 	MemoryContextSwitchTo(oldcxt);
@@ -882,7 +909,7 @@ apply_handle_stream_commit(StringInfo s)
 
 		/* read length of the on-disk record */
 		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
-		nbytes = read(fd, &len, sizeof(len));
+		nbytes = BufFileRead(fd, &len, sizeof(len));
 		pgstat_report_wait_end();
 
 		/* have we reached end of the file? */
@@ -894,7 +921,7 @@ apply_handle_stream_commit(StringInfo s)
 		{
 			int			save_errno = errno;
 
-			CloseTransientFile(fd);
+			BufFileClose(fd);
 			errno = save_errno;
 			ereport(ERROR,
 					(errcode_for_file_access(),
@@ -909,11 +936,11 @@ apply_handle_stream_commit(StringInfo s)
 
 		/* and finally read the data into the buffer */
 		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
-		if (read(fd, buffer, len) != len)
+		if (BufFileRead(fd, buffer, len) != len)
 		{
 			int			save_errno = errno;
 
-			CloseTransientFile(fd);
+			BufFileClose(fd);
 			errno = save_errno;
 			ereport(ERROR,
 					(errcode_for_file_access(),
@@ -948,11 +975,7 @@ apply_handle_stream_commit(StringInfo s)
 		 */
 		send_feedback(InvalidXLogRecPtr, false, false);
 	}
-
-	if (CloseTransientFile(fd) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not close file \"%s\": %m", path)));
+	BufFileClose(fd);
 
 	/*
 	 * Update origin state so we can restart streaming from correct
@@ -1946,12 +1969,39 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 static void
 worker_onexit(int code, Datum arg)
 {
-	int	i;
+	HASH_SEQ_STATUS status;
+	StreamXidHash   *ent;
+	char path[MAXPGPATH];
+
+	/* nothing to clean */
+	if (xidhash == NULL)
+		return;
+
+	/*
+	 * Scan complete hash and delete the underlying files for the the xids.
+	 * Also delete the memory for the shared file sets.
+	 */
+	hash_seq_init(&status, xidhash);
+	while ((ent = (StreamXidHash *) hash_seq_search(&status)) != NULL)
+	{
+		changes_filename(path, MyLogicalRepWorker->subid, ent->xid);
+		BufFileDeleteShared(ent->stream_fileset, path);
+		pfree(ent->stream_fileset);
 
-	elog(LOG, "cleanup files for %d transactions", nxids);
+		/*
+		 * We might not have created the suxact fileset if there is no sub
+		 * transaction.
+		 */
+		if (ent->subxact_fileset)
+		{
+			subxact_filename(path, MyLogicalRepWorker->subid, ent->xid);
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+		}
+	}
 
-	for (i = nxids-1; i >= 0; i--)
-		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+	/* Remove the xid hash */
+	hash_destroy(xidhash);
 }
 
 /*
@@ -2085,7 +2135,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -2441,33 +2491,63 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 static void
 subxact_info_write(Oid subid, TransactionId xid)
 {
-	int			fd;
-	char		path[MAXPGPATH];
-	Size		len;
+	char path[MAXPGPATH];
+	bool found;
+	Size len;
+	StreamXidHash *ent;
+	BufFile *fd;
 
 	Assert(TransactionIdIsValid(xid));
 
 	subxact_filename(path, subid, xid);
 
-	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
-	if (fd < 0)
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
 	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create file \"%s\": %m",
-						path)));
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			ent->subxact_fileset = NULL;
+		}
 		return;
 	}
 
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		ent->subxact_fileset =
+				MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
+
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
 	len = sizeof(SubXactInfo) * nsubxacts;
 
 	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
 
-	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	if (BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
 	{
-		int			save_errno = errno;
+		int save_errno = errno;
 
-		CloseTransientFile(fd);
+		BufFileClose(fd);
 		errno = save_errno;
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -2476,11 +2556,11 @@ subxact_info_write(Oid subid, TransactionId xid)
 		return;
 	}
 
-	if ((len > 0) && (write(fd, subxacts, len) != len))
+	if ((len > 0) && (BufFileWrite(fd, subxacts, len) != len))
 	{
-		int			save_errno = errno;
+		int save_errno = errno;
 
-		CloseTransientFile(fd);
+		BufFileClose(fd);
 		errno = save_errno;
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -2490,15 +2570,7 @@ subxact_info_write(Oid subid, TransactionId xid)
 	}
 
 	pgstat_report_wait_end();
-
-	/*
-	 * We don't need to fsync or anything, as we'll recreate the files after a
-	 * crash from scratch. So just close the file.
-	 */
-	if (CloseTransientFile(fd) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not close file \"%s\": %m", path)));
+	BufFileClose(fd);
 
 	/*
 	 * But we free the memory allocated for subxact info. There might be one
@@ -2519,35 +2591,40 @@ subxact_info_write(Oid subid, TransactionId xid)
 static void
 subxact_info_read(Oid subid, TransactionId xid)
 {
-	int			fd;
 	char		path[MAXPGPATH];
+	bool		found;
 	Size		len;
+	BufFile		*fd;
+	StreamXidHash  *ent;
+	MemoryContext	oldctx;
 
 	Assert(TransactionIdIsValid(xid));
 	Assert(!subxacts);
 	Assert(nsubxacts == 0);
 	Assert(nsubxacts_max == 0);
 
-	subxact_filename(path, subid, xid);
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+									   (void *) &xid,
+									   HASH_FIND,
+									   &found);
 
-	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (fd < 0)
-	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
+	/* If subxact_fileset is not valid that mean we don't have any subxact info */
+	if (ent->subxact_fileset == NULL)
 		return;
-	}
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
 
 	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
 
 	/* read number of subxact items */
-	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
 	{
 		int			save_errno = errno;
 
-		CloseTransientFile(fd);
+		BufFileClose(fd);
 		errno = save_errno;
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -2564,21 +2641,22 @@ subxact_info_read(Oid subid, TransactionId xid)
 	nsubxacts_max = 1 << my_log2(nsubxacts);
 
 	/*
-	 * Let the caller decide which memory context it will be allocated.
-	 * Ideally, during stream start it will be allocated in the
-	 * LogicalStreamingContext which will be reset on stream stop, and
-	 * during the stream abort we need this memory only for short term so
-	 * it will be allocated in ApplyMessageContext.
+	 * Allocate subxact information in the logical streaming context.  We
+	 * need this information during the complete stream so that we can add
+	 * the sub transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming context.
 	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
 	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
 
 	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
 
-	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
 	{
 		int			save_errno = errno;
 
-		CloseTransientFile(fd);
+		BufFileClose(fd);
 		errno = save_errno;
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -2586,13 +2664,9 @@ subxact_info_read(Oid subid, TransactionId xid)
 						path)));
 		return;
 	}
-
 	pgstat_report_wait_end();
 
-	if (CloseTransientFile(fd) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not close file \"%s\": %m", path)));
+	BufFileClose(fd);
 }
 
 /*
@@ -2606,7 +2680,7 @@ subxact_info_add(TransactionId xid)
 
 	/* We must have a valid top level stream xid and a stream fd. */
 	Assert(TransactionIdIsValid(stream_xid));
-	Assert(stream_fd >= 0);
+	Assert(stream_fd != NULL);
 
 	/*
 	 * If the XID matches the toplevel transaction, we don't want to add it.
@@ -2658,7 +2732,13 @@ subxact_info_add(TransactionId xid)
 	}
 
 	subxacts[nsubxacts].xid = xid;
-	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
 
 	nsubxacts++;
 }
@@ -2667,44 +2747,14 @@ subxact_info_add(TransactionId xid)
 static void
 subxact_filename(char *path, Oid subid, TransactionId xid)
 {
-	char		tempdirpath[MAXPGPATH];
-
-	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
-
-	/*
-	 * We might need to create the tablespace's tempfile directory, if no
-	 * one has yet done so.
-	 */
-	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create directory \"%s\": %m",
-						tempdirpath)));
-
-	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.subxacts",
-			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
 }
 
 /* format filename for file containing serialized changes */
-static void
+static inline void
 changes_filename(char *path, Oid subid, TransactionId xid)
 {
-	char		tempdirpath[MAXPGPATH];
-
-	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
-
-	/*
-	 * We might need to create the tablespace's tempfile directory, if no
-	 * one has yet done so.
-	 */
-	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create directory \"%s\": %m",
-						tempdirpath)));
-
-	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.changes",
-			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
 }
 
 /*
@@ -2721,60 +2771,31 @@ changes_filename(char *path, Oid subid, TransactionId xid)
 static void
 stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
 {
-	int			i;
 	char		path[MAXPGPATH];
-	bool		found = false;
+	StreamXidHash	*ent;
 
-	subxact_filename(path, subid, xid);
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
 
-	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not remove file \"%s\": %m", path)));
+	/* No entry created for this xid so simply return. */
+	if (ent == NULL)
+		return;
 
+	/* Delete the change file and release the stream fileset memory */
 	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
 
-	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not remove file \"%s\": %m", path)));
-
-	/*
-	 * Cleanup the XID from the array - find the XID in the array and
-	 * remove it by shifting all the remaining elements. The array is
-	 * bound to be fairly small (maximum number of in-progress xacts,
-	 * so max_connections + max_prepared_transactions) so simply loop
-	 * through the array and find index of the XID. Then move the rest
-	 * of the array by one element to the left.
-	 *
-	 * Notice we also call this from stream_open_file for first segment
-	 * of each transaction, to deal with possible left-overs after a
-	 * crash, so it's entirely possible not to find the XID in the
-	 * array here. In that case we don't remove anything.
-	 *
-	 * XXX Perhaps it'd be better to handle this automatically after a
-	 * restart, instead of doing it over and over for each transaction.
-	 */
-	for (i = 0; i < nxids; i++)
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
 	{
-		if (xids[i] == xid)
-		{
-			found = true;
-			break;
-		}
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
 	}
-
-	if (!found)
-		return;
-
-	/*
-	 * Move the last entry from the array to the place. We don't keep
-	 * the streamed transactions sorted or anything - we only expect
-	 * a few of them in progress (max_connections + max_prepared_xacts)
-	 * so linear search is just fine.
-	 */
-	xids[i] = xids[nxids-1];
-	nxids--;
 }
 
 /*
@@ -2793,61 +2814,29 @@ static void
 stream_open_file(Oid subid, TransactionId xid, bool first_segment)
 {
 	char		path[MAXPGPATH];
-	int			flags;
+	bool		found;
+	MemoryContext	oldcxt;
+	StreamXidHash   *ent;
 
 	Assert(in_streamed_transaction);
 	Assert(OidIsValid(subid));
 	Assert(TransactionIdIsValid(xid));
-	Assert(stream_fd == -1);
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
 
 	/*
-	 * If this is the first segment for this transaction, try removing
-	 * existing files (if there are any, possibly after a crash).
+	 * Create/open the buffiles under the logical streaming context so that
+	 * we have those files until stream stop.
 	 */
-	if (first_segment)
-	{
-		MemoryContext	oldcxt;
-
-		/* XXX make sure there are no previous files for this transaction */
-		stream_cleanup_files(subid, xid, true);
-
-		/* Need to allocate this in permanent context */
-		oldcxt = MemoryContextSwitchTo(ApplyContext);
-
-		/*
-		 * We need to remember the XIDs we spilled to files, so that we can
-		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
-		 *
-		 * The number of XIDs we may need to track is fairly small, because
-		 * we can only stream toplevel xacts (so limited by max_connections
-		 * and max_prepared_transactions), and we only stream the large ones.
-		 * So we simply keep the XIDs in an unsorted array. If the number of
-		 * xacts gets large for some reason (e.g. very high max_connections),
-		 * a more elaborate approach might be better - e.g. sorted array, to
-		 * speed-up the lookups.
-		 */
-		if (nxids == maxnxids)	/* array of XIDs is full */
-		{
-			if (!xids)
-			{
-				maxnxids = 64;
-				xids = palloc(maxnxids * sizeof(TransactionId));
-			}
-			else
-			{
-				maxnxids = 2 * maxnxids;
-				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
-			}
-		}
-
-		xids[nxids++] = xid;
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-
-	changes_filename(path, subid, xid);
-
-	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
 
 	/*
 	 * If this is the first streamed segment, the file must not exist, so
@@ -2855,17 +2844,32 @@ stream_open_file(Oid subid, TransactionId xid, bool first_segment)
 	 * for writing, in append mode.
 	 */
 	if (first_segment)
-		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
-	else
-		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+	{
+		/*
+		 * Shared fileset handle must be allocated in the persistent context.
+		 */
+		SharedFileSet *fileset =
+			MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
 
-	stream_fd = OpenTransientFile(path, flags);
+		PrepareTempTablespaces();
+		SharedFileSetInit(fileset, NULL);
+		stream_fd = BufFileCreateShared(fileset, path);
 
-	if (stream_fd < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+	MemoryContextSwitchTo(oldcxt);
 }
 
 /*
@@ -2880,12 +2884,12 @@ stream_close_file(void)
 {
 	Assert(in_streamed_transaction);
 	Assert(TransactionIdIsValid(stream_xid));
-	Assert(stream_fd != -1);
+	Assert(stream_fd != NULL);
 
-	CloseTransientFile(stream_fd);
+	BufFileClose(stream_fd);
 
 	stream_xid = InvalidTransactionId;
-	stream_fd = -1;
+	stream_fd = NULL;
 }
 
 /*
@@ -2907,21 +2911,19 @@ stream_write_change(char action, StringInfo s)
 
 	Assert(in_streamed_transaction);
 	Assert(TransactionIdIsValid(stream_xid));
-	Assert(stream_fd != -1);
+	Assert(stream_fd != NULL);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
-
 	/* first write the size */
-	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+	if (BufFileWrite(stream_fd, &len, sizeof(len)) != sizeof(len))
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not serialize streamed change to file: %m")));
 
 	/* then the action */
-	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+	if (BufFileWrite(stream_fd, &action, sizeof(action)) != sizeof(action))
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not serialize streamed change to file: %m")));
@@ -2929,12 +2931,10 @@ stream_write_change(char action, StringInfo s)
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
-	if (write(stream_fd, &s->data[s->cursor], len) != len)
+	if (BufFileWrite(stream_fd, &s->data[s->cursor], len) != len)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not serialize streamed change to file: %m")));
-
-	pgstat_report_wait_end();
 }
 
 /*
-- 
2.23.0

v27/v27-0006-Bugfix-handling-of-incomplete-toast-spec-insert.patch000644 000765 000024 00000062001 13670411611 026146 0ustar00dilipkumarstaff000000 000000 From ed26baddc1eefbd1135c4124d78ba549b758b9b6 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 11 Jun 2020 15:25:02 +0530
Subject: [PATCH v27 06/14] Bugfix handling of incomplete toast/spec insert

---
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/replication/logical/decode.c      |  17 +-
 .../replication/logical/reorderbuffer.c       | 337 ++++++++++++++----
 src/include/access/heapam_xlog.h              |   1 +
 src/include/replication/reorderbuffer.h       |  47 ++-
 5 files changed, 329 insertions(+), 76 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2d77107c4f..3927448f46 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1955,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 69c1f45ef6..c841687c66 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -727,7 +727,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -794,7 +796,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -851,7 +854,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -887,7 +891,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -987,7 +991,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1025,7 +1029,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 47dc31298d..36958fe2ee 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -179,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -237,7 +252,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool partial_truncate);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -646,14 +662,91 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 	return txn;
 }
 
+/*
+ * Handle incomplete tuple during streaming.  If streaming is enabled then we
+ * might need to stream the in-progress transaction.  So the problem is that
+ * sometime we might get some incomplete changes which we can not stream
+ * until we get the complete change. e.g.  toast table insert without the main
+ * table insert.  So this function remember the lsn of the last complete change
+ * and the complete size upto last complete lsn so that if we need to stream
+ * we can only stream upto last complete lsn.
+ */
+static void
+ReorderBufferHandleIncompleteTuple(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   ReorderBufferChange *change,
+								   bool toast_insert, Size total_size)
+{
+	ReorderBufferTXN *toptxn;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * If this is a first incomplete change then set the size of the complete
+	 * change.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		(toast_insert || IsSpecInsert(change->action)))
+		toptxn->complete_size = total_size;
+
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Basically,
+	 * both update and insert will do the insert in the toast table.  And as
+	 * explained in the function header we can not stream the only toast
+	 * changes.  So whenever we get the toast insert we set the flag and clear
+	 * the same whenever we get the next insert or update on the main table.
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial tuple and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+
+	/*
+	 * If we don't have any incomplete change after this change then set this
+	 * LSN as last complete lsn.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(toptxn)))
+	{
+		toptxn->last_complete_lsn = change->lsn;
+
+		/*
+		 * If the transaction is serialized and the the changes are complete in
+		 * the top level transaction then immediately stream the transaction.
+		 * The reason for not waiting for memory limit to get full is that in
+		 * the streaming mode, if the transaction serialized that means we have
+		 * already reached the memory limit but that time we could not stream
+		 * this due to incomplete tuple so now stream it as soon as the tuple
+		 * is complete.  Also, if we don't stream the serialized changes then
+		 * if we get some more incomplete changes in this transaction then we
+		 * don't have a way to partly truncate the serialized changes.
+		 */
+		if (rbtxn_is_serialized(txn))
+			ReorderBufferStreamTXN(rb, toptxn);
+	}
+}
+
 /*
  * Queue a change into a transaction so it can be replayed upon commit.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
+	Size	total_size = 0;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
@@ -665,9 +758,28 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * Get the total size of the top transaction before updating the size for
+	 * current change so that if this is the incomplete tuple we know the size
+	 * prior to this change.  That will be used for updating the size of the
+	 * complete changes in the top transaction for streaming.
+	 */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			total_size = txn->toptxn->total_size;
+		else
+			total_size = txn->total_size;
+	}
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* Handle the incomplete tuple, If streaming is enabled */
+	if (ReorderBufferCanStream(rb))
+		ReorderBufferHandleIncompleteTuple(rb, txn, change, toast_insert,
+										   total_size);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -697,7 +809,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1407,11 +1519,45 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 /*
  * Discard changes from a transaction (and subtransactions), after streaming
  * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ * If partial_truncate is false we completely truncate the transaction,
+ * otherwise we truncate upto last_complete_lsn
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 bool partial_truncate)
 {
 	dlist_mutable_iter iter;
+	ReorderBufferTXN *toptxn;
+
+	/* Get the top transaction */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * The serialized transaction should never be partly truncated, because if
+	 * it is serialized then we stream it as soon as its changes get completed.
+	 */
+	Assert(!(rbtxn_is_serialized(txn) && partial_truncate));
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
 
 	/* cleanup subtransactions & their changes */
 	dlist_foreach_modify(iter, &txn->subtxns)
@@ -1428,7 +1574,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, partial_truncate);
 	}
 
 	/* cleanup changes in the toplevel txn */
@@ -1438,30 +1584,19 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		change = dlist_container(ReorderBufferChange, node, iter.cur);
 
+		/* We have truncated upto last complete lsn so stop. */
+		if (partial_truncate && (change->lsn > toptxn->last_complete_lsn))
+		{
+			/* The transaction must have incomplete changes. */
+			Assert(rbtxn_has_incomplete_tuple(toptxn));
+			break;
+		}
+
 		/* remove the change from it's containing list */
 		dlist_delete(&change->node);
-
 		ReorderBufferReturnChange(rb, change);
 	}
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The toplevel transaction, identified by (toptxn==NULL), is marked
-	 * as streamed always, even if it does not contain any changes (that
-	 * is, when all the changes are in subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts
-	 * for XIDs the downstream is not aware of. And of course, it always
-	 * knows about the toplevel xact (we send the XID in all messages),
-	 * but we never stream XIDs of empty subxacts.
-	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
 	 * any memory. We could also keep the hash table and update it with
@@ -1473,9 +1608,39 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
-	/* also reset the number of entries in the transaction */
-	txn->nentries_mem = 0;
-	txn->nentries = 0;
+	/*
+	 * Adjust nentries/nentries_mem based on the changes processed. ��See
+	 * comments where nprocessed is declared.
+	 */
+	if (partial_truncate)
+	{
+		txn->nentries -= txn->nprocessed;
+		txn->nentries_mem -= txn->nprocessed;
+	}
+	else
+	{
+		txn->nentries = 0;
+		txn->nentries_mem = 0;
+	}
+	txn->nprocessed = 0;
+
+	/*
+	 * If this is a top transaction then we can reset the
+	 * last_complete_lsn and complete_size, because by now we would
+	 * have stream all the changes upto last_complete_lsn.
+	 */
+	if (partial_truncate && (txn->toptxn == NULL))
+	{
+		toptxn->last_complete_lsn = InvalidXLogRecPtr;
+		toptxn->complete_size = 0;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
 }
 
 /*
@@ -1762,7 +1927,7 @@ ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
 								   ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed. */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Stop the stream. */
 	rb->stream_stop(rb, txn, last_lsn);
@@ -1794,6 +1959,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 	ReorderBufferChange *volatile specinsert = NULL;
 	volatile bool	stream_started = false;
+	volatile bool	partial_truncate = false;
+
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1816,6 +1983,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
+		ReorderBufferTXN *curtxn;
 
 		if (using_subtxn)
 			BeginInternalSubTransaction(streaming? "stream" : "replay");
@@ -1852,7 +2020,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			/* Set the xid for concurrent abort check. */
 			if (streaming)
-				SetupCheckXidLive(change->txn->xid);
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
 
 			switch (change->action)
 			{
@@ -2116,6 +2287,27 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
 			}
+
+			if (streaming)
+			{
+				/*
+				 * Increment the nprocessed count.  See the detailed comment
+				 * for usage of this in ReorderBufferTXN structure.
+				 */
+				curtxn->nprocessed++;
+
+				/*
+				 * If the transaction contains incomplete tuple and this is the
+				 * last complete change then stop further processing of the
+				 * transaction.  And, set the partial truncate flag to true.
+				 */
+				if (rbtxn_has_incomplete_tuple(txn) &&
+					prev_lsn == txn->last_complete_lsn)
+				{
+					partial_truncate = true;
+					break;
+				}
+			}
 		}
 
 		/*
@@ -2135,7 +2327,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * Done with current changes, call stream_stop callback for streaming
-		 * transaction, commit callback otherwise.  If we have sent
+		 * transaction, commit callback otherwise.  Only If we have sent
 		 * start/begin.
 		 */
 		if (stream_started)
@@ -2187,7 +2379,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming)
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, partial_truncate);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
@@ -2524,7 +2716,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2573,7 +2765,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2596,6 +2788,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2610,8 +2803,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2619,12 +2817,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2685,7 +2891,7 @@ ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
 	memcpy(change->data.inval.invalidations, msgs,
 		   sizeof(SharedInvalidationMessage) * nmsgs);
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -2872,18 +3078,28 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 	dlist_foreach(iter, &rb->toplevel_by_lsn)
 	{
 		ReorderBufferTXN *txn;
+		Size	size = 0;
+		Size	largest_size = 0;
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
+		/*
+		 * If this transaction have some incomplete changes then only consider
+		 * the size upto last complete lsn.
+		 */
+		if (rbtxn_has_incomplete_tuple(txn))
+			size = txn->complete_size;
+		else
+			size = txn->total_size;
+
+		/* If the current transaction is larger then remember it. */
+		if ((largest != NULL || size > largest_size) && size > 0)
+		{
 			largest = txn;
+			largest_size = size;
+		}
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2921,27 +3137,22 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 		 * Pick the largest transaction (or subtransaction) and evict it from
 		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		if (ReorderBufferCanStream(rb))
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
 		{
-			/*
-			* Pick the largest toplevel transaction and evict it from memory by
-			* streaming the already decoded part.
-			*/
-			txn = ReorderBufferLargestTopTXN(rb);
-
 			/* we know there has to be one, because the size is not zero */
 			Assert(txn && !txn->toptxn);
-			Assert(txn->size > 0);
-			Assert(rb->size >= txn->size);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
 			ReorderBufferStreamTXN(rb, txn);
 		}
 		else
 		{
 			/*
-			* Pick the largest transaction (or subtransaction) and evict it from
-			* memory by serializing it to disk.
-			*/
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
 			txn = ReorderBufferLargestTXN(rb);
 
 			/* we know there has to be one, because the size is not zero */
@@ -2950,14 +3161,14 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(rb->size >= txn->size);
 
 			ReorderBufferSerializeTXN(rb, txn);
-		}
 
-		/*
-		 * After eviction, the transaction should have no entries in memory,
-		 * and should use 0 bytes for changes.
-		 */
-		Assert(txn->size == 0);
-		Assert(txn->nentries_mem == 0);
+			/*
+			 * After eviction, the transaction should have no entries in memory, and
+			 * should use 0 bytes for changes.
+			 */
+			Assert(txn->size == 0);
+			Assert(txn->nentries_mem == 0);
+		}
 	}
 
 	/* We must be under the memory limit now. */
@@ -3356,10 +3567,6 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 */
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
-
-	Assert(dlist_is_empty(&txn->changes));
-	Assert(txn->nentries == 0);
-	Assert(txn->nentries_mem == 0);
 }
 
 /*
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cdb12..aa17f7df84 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index b3e2b3f64b..2d86209f61 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -172,6 +172,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -191,6 +193,26 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -199,10 +221,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -350,6 +368,23 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* Size of the complete changes. */
+	Size		complete_size;
+
+	/* LSN of the last complete change. */
+	XLogRecPtr	last_complete_lsn;
+
+	/*
+	 * Number of changes processed. ��This is used to keep track of changes that
+	 * remained to be streamed. ��As of now, this can happen either due to toast
+	 * tuples or speculative insertions where we need to wait for multiple
+	 * changes before we can send them.
+	 */
+	uint64		nprocessed;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -537,7 +572,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
2.23.0

v27/v27-0001-Immediately-WAL-log-subtransaction-and-top-level.patch000644 000765 000024 00000026154 13670577036 026206 0ustar00dilipkumarstaff000000 000000 From 2b84cec6903c50af9264fb0a6efda8356b0f6f2a Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 5 Jun 2020 09:03:16 +0530
Subject: [PATCH v27 01/14] Immediately WAL-log subtransaction and top-level
 XID association.

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead) only when wal_level=logical.
We can not remove the existing XLOG_XACT_ASSIGNMENT WAL as that is
required for avoiding overflow in the hot standby snapshot.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/transam/xact.c        | 50 ++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 23 ++++++++++-
 src/backend/access/transam/xlogreader.c  |  5 +++
 src/backend/replication/logical/decode.c | 44 +++++++++++----------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 107 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index cd30b62d36..04fd5ca870 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to top-level XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5118,6 +5120,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6020,3 +6023,50 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * IsSubTransactionAssignmentPending
+ *
+ * This is used to decide whether we need to WAL log the top-level XID for
+ * operation in a subtransaction.  We require that for logical decoding, see
+ * LogicalDecodingProcessRecord.
+ *
+ * This returns true if wal_level >= logical and we are inside a valid
+ * subtransaction, for which the assignment was not yet written to any WAL
+ * record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	/* wal_level has to be logical */
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it should not be already 'assigned' */
+	return !CurrentTransactionState->assigned;
+}
+
+/*
+ * MarkSubTransactionAssigned
+ *
+ * Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f09e..c526bb1928 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,19 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogResetInsertion) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cb76be4f46..a757baccfc 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1197,6 +1197,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1235,6 +1236,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3abf8..0c0c371739 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -94,11 +94,27 @@ void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
 	XLogRecordBuffer buf;
+	TransactionId txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the top-level xid is valid, we need to assign the subxact to the
+	 * top-level xact. We need to do this for all records, hence we do it
+	 * before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -216,13 +232,8 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	/*
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
-	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
-	 * when the snapshot is being built: it is possible to get later records
-	 * that require subxids to be properly assigned.
 	 */
-	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
 		return;
 
 	switch (info)
@@ -264,22 +275,13 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
 
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			/*
+			 * We assign subxact to the toplevel xact while processing each
+			 * record if required.  So, we don't need to do anything here.
+			 * See LogicalDecodingProcessRecord.
+			 */
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 88025b1cc2..22bb96ca2a 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e917dfe92d..05cc2b696c 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of top-level xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index b0f2a6ed43..b976882229 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -191,6 +191,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -304,6 +306,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0194..2f0c8bf589 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
2.23.0

v27/v27-0012-Add-streaming-option-in-pg_dump.patch000644 000765 000024 00000005337 13670411611 023047 0ustar00dilipkumarstaff000000 000000 From 9daf82c29d7d12e813cee550a7786a08ef90667c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v27 12/14] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index dfe43968b8..8ca4a05822 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4235,8 +4236,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4249,6 +4250,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4265,6 +4267,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4342,6 +4345,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0c2fcfb3a9..af64270c55 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
2.23.0

v27/v27-0013-Change-buffile-interface-required-for-streaming-.patch000644 000765 000024 00000022140 13670411611 026220 0ustar00dilipkumarstaff000000 000000 From f8b755ca0aa9801ab68a0e29a8f1ccad3db1a2c8 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 11 Jun 2020 16:40:25 +0530
Subject: [PATCH v27 13/14] Change buffile interface required for streaming
 transaction

Implement the BuffileTruncate and SEEK_END.  And, also add an
option to provide a mode while opening the shared buffiles, instead
of always opening in readonly mode
---
 src/backend/storage/file/buffile.c        | 52 ++++++++++++++++++++---
 src/backend/storage/file/fd.c             | 10 ++---
 src/backend/storage/file/sharedfileset.c  |  7 +--
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  3 +-
 8 files changed, 65 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 35e8f12e62..184c6d9c3b 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -277,7 +277,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +301,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +321,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -670,11 +670,14 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+			/*
+			 * Get the file size of the last file to get the last offset
+			 * of that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -843,3 +846,40 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int newFile = file->numFiles;
+	off_t newOffset;
+	char segment_name[MAXPGPATH];
+	int i;
+
+	/* Loop over all the  files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno,  we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it
+		 * is the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			SharedFileSetDelete(file->fileset, segment_name, true);
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			FileTruncate(file->files[i], offset, WAIT_EVENT_BUFFILE_READ);
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 7dc6dd2f15..10591fee18 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1741,18 +1741,18 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are read-only if the flag is set and are
+ * automatically closed at the end of the transaction but are not deleted on
+ * close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index f7206c9175..4b39d91320 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -68,7 +68,8 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
 }
 
 /*
@@ -131,13 +132,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 138da0c1b4..6c3114edad 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -544,7 +546,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index c3ab494a45..efba5dca6e 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -563,7 +563,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index 60433f35b4..8b1633415a 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d7df..e209f047e8 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf077e5..b2f4ba4bd8 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,7 +37,8 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
-- 
2.23.0

v27/v27-0011-Provide-new-api-to-get-the-streaming-changes.patch000644 000765 000024 00000014524 13670411611 025340 0ustar00dilipkumarstaff000000 000000 From e42ed7aff7f1f7feb70f4f6fb849c0a1376fd1c8 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Sat, 2 May 2020 11:41:59 +0530
Subject: [PATCH v27 11/14] Provide new api to get the streaming changes

---
 .gitignore                                    |  1 +
 doc/src/sgml/test-decoding.sgml               | 22 ++++++++++++++++++
 src/backend/catalog/system_views.sql          |  8 +++++++
 .../replication/logical/logicalfuncs.c        | 23 +++++++++++++++----
 src/include/catalog/pg_proc.dat               |  9 ++++++++
 5 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/.gitignore b/.gitignore
index 794e35b73c..6083744c07 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d67b..eed6e9d134 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_streaming_changes('test_slot', NULL, NULL);
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 9f509fbc21..5fe6f28ba2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1243,6 +1243,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
 
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_peek_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index b99c94e848..70c28ffa91 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -108,7 +108,8 @@ check_permissions(void)
  * Helper function for the various SQL callable logical decoding functions.
  */
 static Datum
-pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool binary)
+pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
+								 bool binary, bool streaming)
 {
 	Name		name;
 	XLogRecPtr	upto_lsn;
@@ -252,6 +253,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or has been invalidated.")));
 
+		/* If called has not asked for streaming changes then disable it. */
+		ctx->streaming &= streaming;
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
@@ -362,7 +366,16 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 Datum
 pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, false);
+}
+
+/*
+ * SQL function to get the streaming changes as text, consuming the data.
+ */
+Datum
+pg_logical_slot_get_streaming_changes(PG_FUNCTION_ARGS)
+{
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, true);
 }
 
 /*
@@ -371,7 +384,7 @@ pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, false, false);
 }
 
 /*
@@ -380,7 +393,7 @@ pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, true, false);
 }
 
 /*
@@ -389,7 +402,7 @@ pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, true, false);
 }
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7869f721da..875e0bef28 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10115,6 +10115,15 @@
   proargmodes => '{i,i,i,v,o,o,o}',
   proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
   prosrc => 'pg_logical_slot_get_binary_changes' },
+{ oid => '6150', descr => 'get streaming changes from replication slot',
+  proname => 'pg_logical_slot_get_streaming_changes', procost => '1000',
+  prorows => '1000', provariadic => 'text', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'name pg_lsn int4 _text',
+  proallargtypes => '{name,pg_lsn,int4,_text,pg_lsn,xid,text}',
+  proargmodes => '{i,i,i,v,o,o,o}',
+  proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
+  prosrc => 'pg_logical_slot_get_streaming_changes' },
 { oid => '3784', descr => 'peek at changes from replication slot',
   proname => 'pg_logical_slot_peek_changes', procost => '1000',
   prorows => '1000', provariadic => 'text', proisstrict => 'f',
-- 
2.23.0

v27/v27-0003-Extend-the-output-plugin-API-with-stream-methods.patch000644 000765 000024 00000106360 13670411611 026174 0ustar00dilipkumarstaff000000 000000 From 92e418399e28686ee274c458f33dff60c2411704 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v27 03/14] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 ++++++
 doc/src/sgml/logicaldecoding.sgml         | 213 +++++++++++++
 src/backend/replication/logical/logical.c | 365 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++
 src/include/replication/reorderbuffer.h   |  59 ++++
 6 files changed, 811 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c948856e..64f651fa72 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index c89f93cf6b..50cfd6fa47 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,6 +389,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -401,6 +408,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -679,6 +695,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -747,4 +869,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be3b0..26d461effb 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similar to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,321 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475e5d..deef31825d 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -79,6 +79,11 @@ typedef struct LogicalDecodingContext
 	 */
 	void	   *output_writer_private;
 
+	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236c57..0d0a94a648 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
 /*
  * Output plugin callbacks
  */
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index af35287896..65814af9f5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -354,6 +354,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
 struct ReorderBuffer
 {
 	/*
@@ -392,6 +440,17 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
 	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
-- 
2.23.0

v27/v27-0008-Add-support-for-streaming-to-built-in-replicatio.patch000644 000765 000024 00000263043 13670411611 026301 0ustar00dilipkumarstaff000000 000000 From 6dcecde9d2ebc0e3f4b489248af026d14dd7d6d9 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 11 Jun 2020 15:34:29 +0530
Subject: [PATCH v27 08/14] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml      |    4 +-
 doc/src/sgml/ref/create_subscription.sgml     |   11 +
 src/backend/catalog/pg_subscription.c         |    1 +
 src/backend/commands/subscriptioncmds.c       |   45 +-
 src/backend/postmaster/pgstat.c               |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |    3 +
 src/backend/replication/logical/proto.c       |  140 ++-
 src/backend/replication/logical/worker.c      | 1012 +++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c   |  318 +++++-
 src/backend/replication/slotfuncs.c           |    6 +
 src/backend/replication/walsender.c           |    6 +
 src/include/catalog/pg_subscription.h         |    3 +
 src/include/pgstat.h                          |    6 +-
 src/include/replication/logicalproto.h        |   42 +-
 src/include/replication/walreceiver.h         |    1 +
 src/test/subscription/t/009_stream_simple.pl  |   86 ++
 src/test/subscription/t/010_stream_subxact.pl |  102 ++
 src/test/subscription/t/011_stream_ddl.pl     |   95 ++
 .../t/012_stream_subxact_abort.pl             |   82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |   84 ++
 20 files changed, 2019 insertions(+), 40 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index c24ace14d1..d8de56c928 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 5bbc165f70..c25b7c5962 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731115..f28482f0f4 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9ebb026187..9065a1be1b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 309378ae54..6713392d4d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4139,6 +4139,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9bb6..5257ab0394 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd171..83d0642cf3 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a1224d..d2d9469999 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,6 +54,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -64,6 +86,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +94,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -100,6 +124,7 @@ typedef struct SlotErrCallbackArg
 } SlotErrCallbackArg;
 
 static MemoryContext ApplyMessageContext = NULL;
+static MemoryContext LogicalStreamingContext = NULL;
 MemoryContext ApplyContext = NULL;
 
 WalReceiverConn *wrconn = NULL;
@@ -110,12 +135,58 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool	in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;						/* XID of the subxact */
+	off_t           offset;					/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +258,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -552,6 +659,326 @@ apply_handle_origin(StringInfo s)
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+	{
+		MemoryContext oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+
+		/* Read the subxacts info in per-stream context. */
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+		MemoryContextSwitchTo(oldctx);
+	}
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		int			fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = OpenTransientFile(path, O_WRONLY | PG_BINARY);
+		if (fd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+		}
+
+		/* OK, truncate the file at the right offset. */
+		if (ftruncate(fd, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+		CloseTransientFile(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	ensure_transaction();
+
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -565,6 +992,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1010,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1049,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1167,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1312,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1685,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1826,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1477,6 +1938,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 	}
 }
 
+/*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
 /*
  * Apply main loop.
  */
@@ -1493,6 +1970,17 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reeset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													 ALLOCSET_DEFAULT_SIZES);
+
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1941,6 +2429,529 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Let the caller decide which memory context it will be allocated.
+	 * Ideally, during stream start it will be allocated in the
+	 * LogicalStreamingContext which will be reset on stream stop, and
+	 * during the stream abort we need this memory only for short term so
+	 * it will be allocated in ApplyMessageContext.
+	 */
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd >= 0);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.subxacts",
+			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.changes",
+			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		/* Need to allocate this in permanent context */
+		oldcxt = MemoryContextSwitchTo(ApplyContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3117,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 15379e3118..1509f9b826 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +122,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +148,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +211,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +240,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +263,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +373,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +423,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +465,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +486,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +518,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +538,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +562,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +582,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +607,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +639,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -587,6 +719,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -623,6 +840,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -753,11 +1002,44 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -793,7 +1075,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 1b929a603e..cbc416a274 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -157,6 +157,12 @@ create_logical_replication_slot(char *name, char *plugin,
 											   .segment_close = wal_segment_close),
 									NULL, NULL, NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
 	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d0c0674848..ffc3d50081 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndPrepareWrite, WalSndWriteData,
 										WalSndUpdateProgress);
 
+		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
 		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d42d8..617b9094d4 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c55dc1481c..899d7e2013 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -980,7 +980,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561be9..89158ed46f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index ac1acbb27e..95132062c6 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v27/v27-0010-Add-TAP-test-for-streaming-vs.-DDL.patch000644 000765 000024 00000010620 13670411611 022761 0ustar00dilipkumarstaff000000 000000 From 7059fa0607f91afd84ff0452ce437769503e964d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v27 10/14] Add TAP test for streaming vs. DDL

---
 .../subscription/t/014_stream_through_ddl.pl  | 98 +++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000000..b8d78b1972
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v27/v27-0007-Track-statistics-for-streaming.patch000644 000765 000024 00000027713 13670411611 023042 0ustar00dilipkumarstaff000000 000000 From 6e61f1ab42ce80ce3260332e9248371117f7720b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 11 Jun 2020 15:26:18 +0530
Subject: [PATCH v27 07/14] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                  | 33 +++++++++++++++++++
 src/backend/catalog/system_views.sql          |  5 ++-
 .../replication/logical/reorderbuffer.c       | 12 +++++++
 src/backend/replication/walsender.c           | 32 +++++++++++++++---
 src/include/catalog/pg_proc.dat               |  6 ++--
 src/include/replication/reorderbuffer.h       | 13 +++++---
 src/include/replication/walsender_private.h   |  5 +++
 src/test/regress/expected/rules.out           |  7 ++--
 8 files changed, 98 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 89662cc0a3..45208ad8a1 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2494,6 +2494,39 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        Amount of decoded transaction data spilled to disk.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_txns</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of in-progress transactions streamed to subscriber after
+       memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+       Streaming only works with toplevel transactions (subtransactions can't
+       be streamed independently), so the counter does not get incremented for
+       subtransactions.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_count</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times in-progress transactions were streamed to subscriber.
+       Transactions may get streamed repeatedly, and this counter gets incremented
+       on every such invocation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_bytes</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Amount of decoded in-progress transaction data streamed to subscriber.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 56420bbc9d..9f509fbc21 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -788,7 +788,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 36958fe2ee..d76598e105 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -348,6 +348,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3562,6 +3566,14 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		ReorderBufferFreeSnap(rb, txn->snapshot_now);
 	}
 
+	/* Update the stream statistics. */
+	rb->streamCount += 1;
+	rb->streamBytes += (rbtxn_has_incomplete_tuple(txn)) ?
+						txn->complete_size : txn->total_size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	/*
 	 * Access the main routine to decode the changes and send to output plugin.
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e2477c47e0..d0c0674848 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1349,7 +1349,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1370,7 +1370,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2421,6 +2422,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3256,7 +3260,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3314,6 +3318,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillCount;
 		int64		spillBytes;
 		bool		is_sync_standby;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 		int			j;
@@ -3339,6 +3346,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		/*
@@ -3441,6 +3451,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3683,11 +3698,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 {
 	ReorderBuffer *rb = ctx->reorder;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockAcquire(&MyWalSnd->mutex);
 	MyWalSnd->spillTxns = rb->spillTxns;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 61f2c2f5b4..7869f721da 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5237,9 +5237,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 2d86209f61..399f3e49f2 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -549,15 +549,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 734acec2a4..b997d1710e 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -83,6 +83,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b813e32215..cf22f8a038 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2005,9 +2005,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_slru| SELECT s.name,
     s.blks_zeroed,
-- 
2.23.0

v27/v27-0002-Issue-individual-invalidations-with-wal_level-lo.patch000644 000765 000024 00000041664 13670577036 026456 0ustar00dilipkumarstaff000000 000000 From 9164264fa41e9fa93a27ebaa71d7d90bcd5885a3 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 6 Jun 2020 09:54:21 +0530
Subject: [PATCH v27 02/14] Issue individual invalidations with
 wal_level=logical.

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/rmgrdesc/xactdesc.c        |  40 +++++++
 src/backend/access/transam/xact.c             |   7 ++
 src/backend/replication/logical/decode.c      |  16 +++
 .../replication/logical/reorderbuffer.c       | 104 +++++++++++++++---
 src/backend/utils/cache/inval.c               |  49 +++++++++
 src/include/access/xact.h                     |  13 ++-
 src/include/replication/reorderbuffer.h       |  11 ++
 7 files changed, 226 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce75565f..7ab0d11ea9 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		/* not expected, but print something anyway */
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 04fd5ca870..72efa3c1b3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6020,6 +6020,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0c0c371739..a1d87450ce 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -282,6 +282,22 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * See LogicalDecodingProcessRecord.
 			 */
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				Assert(TransactionIdIsValid(xid));
+				ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+											 invals->nmsgs, invals->msgs);
+
+
+				ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 642a1c767f..cd406ca4d2 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -220,7 +220,7 @@ static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
 static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 									   ReorderBufferIterTXNState *state);
-static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
 
 /*
  * ---------------------------------------
@@ -455,6 +455,11 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 				pfree(change->data.msg.message);
 			change->data.msg.message = NULL;
 			break;
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			if (change->data.inval.invalidations)
+				pfree(change->data.inval.invalidations);
+			change->data.inval.invalidations = NULL;
+			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			if (change->data.snapshot)
 			{
@@ -1814,17 +1819,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
-
-						/*
-						 * Every time the CommandId is incremented, we could
-						 * see new catalog contents, so execute all
-						 * invalidations.
-						 */
-						ReorderBufferExecuteInvalidations(rb, txn);
 					}
 
 					break;
 
+				case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+					/*
+					 * Execute the invalidation messages locally.
+					 *
+					 * XXX Do we need to care about relcacheInitFileInval and
+					 * the other fields added to ReorderBufferChange, or just
+					 * about the message itself?
+					 */
+					ReorderBufferExecuteInvalidations(
+							change->data.inval.ninvalidations,
+							change->data.inval.invalidations);
+					break;
+
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -1866,7 +1878,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -1892,7 +1905,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		AbortCurrentTransaction();
 
 		/* make sure there's no cache pollution */
-		ReorderBufferExecuteInvalidations(rb, txn);
+		ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
 
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
@@ -2202,6 +2216,33 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	txn->ntuplecids++;
 }
 
+/*
+ * Setup the invalidation of the toplevel transaction.
+ */
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+							 XLogRecPtr lsn, int nmsgs,
+							 SharedInvalidationMessage *msgs)
+{
+	MemoryContext oldcontext;
+	ReorderBufferChange *change;
+
+	oldcontext = MemoryContextSwitchTo(rb->context);
+
+	change = ReorderBufferGetChange(rb);
+	change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+	change->data.inval.ninvalidations = nmsgs;
+	change->data.inval.invalidations = (SharedInvalidationMessage *)
+		MemoryContextAlloc(rb->context,
+						   sizeof(SharedInvalidationMessage) * nmsgs);
+	memcpy(change->data.inval.invalidations, msgs,
+		   sizeof(SharedInvalidationMessage) * nmsgs);
+
+	ReorderBufferQueueChange(rb, xid, lsn, change);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Setup the invalidation of the toplevel transaction.
  *
@@ -2234,12 +2275,12 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  * in the changestream but we don't know which those are.
  */
 static void
-ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
 {
 	int			i;
 
-	for (i = 0; i < txn->ninvalidations; i++)
-		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	for (i = 0; i < nmsgs; i++)
+		LocalExecuteInvalidationMessage(&msgs[i]);
 }
 
 /*
@@ -2593,6 +2634,24 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				char	   *data;
+				Size		inval_size = sizeof(SharedInvalidationMessage) *
+										change->data.inval.ninvalidations;
+
+				sz += inval_size;
+
+				ReorderBufferSerializeReserve(rb, sz);
+				data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+
+				/* might have been reallocated above */
+				ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+				memcpy(data, change->data.inval.invalidations, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -2740,6 +2799,11 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 
 				break;
 			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			sz += sizeof(SharedInvalidationMessage) *
+					change->data.inval.ninvalidations;
+			break;
+
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
 				Snapshot	snap;
@@ -3006,6 +3070,20 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   change->data.msg.message_size);
 				data += change->data.msg.message_size;
 
+				break;
+			}
+		case REORDER_BUFFER_CHANGE_INVALIDATION:
+			{
+				Size	inval_size = sizeof(SharedInvalidationMessage) *
+									change->data.inval.ninvalidations;
+
+				change->data.inval.invalidations =
+						MemoryContextAlloc(rb->context, inval_size);
+
+				/* read the message */
+				memcpy(change->data.inval.invalidations, data, inval_size);
+				data += inval_size;
+
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33be6..cba5b6c64b 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,12 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transaction.  As of now it was
+ *	enough to log invalidation only at commit because we are only decoding the
+ *	transaction at the commit time.   We only need to log the catalog cache and
+ *	relcache invalidation.  There can not be any active MVCC scan in logical
+ *	decoding so we don't need to log the snapshot invalidation.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +110,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +217,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
 	if (transInvalInfo == NULL)
 		return;
 
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
@@ -1501,3 +1513,40 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage	*invalMessages;
+	int	nmsgs = 0;
+
+	if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+	{
+		ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+								 MakeSharedInvalidMessagesArray);
+		invalMessages = SharedInvalidMessagesArray;
+		nmsgs  = numSharedInvalidMessagesArray;
+		SharedInvalidMessagesArray = NULL;
+		numSharedInvalidMessagesArray = 0;
+	}
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 22bb96ca2a..3f3e137531 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -197,6 +197,17 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+/*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4dc9..af35287896 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -57,6 +57,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_UPDATE,
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_MESSAGE,
+	REORDER_BUFFER_CHANGE_INVALIDATION,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
@@ -149,6 +150,14 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations;		/* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation
+														 * message */
+		}			inval;
 	}			data;
 
 	/*
@@ -459,6 +468,8 @@ void		ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr ls
 void		ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										 RelFileNode node, ItemPointerData pt,
 										 CommandId cmin, CommandId cmax, CommandId combocid);
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  int nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
 										  Size nmsgs, SharedInvalidationMessage *msgs);
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
-- 
2.23.0

v27/v27-0005-Implement-streaming-mode-in-ReorderBuffer.patch000644 000765 000024 00000113643 13670411611 025030 0ustar00dilipkumarstaff000000 000000 From 97d2a59e201f9cd10c339ff3d72f64531f18d5c4 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 1 May 2020 19:56:35 +0530
Subject: [PATCH v27 05/14] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes
we have in memory and invoke new stream API methods. This happens
in ReorderBufferStreamTXN() using about the same logic as in
ReorderBufferCommit() logic.  However, sometime if we have incomplete
toast or speculative insert we spill to the disk because we can not
generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c   |  38 +-
 .../replication/logical/reorderbuffer.c       | 764 ++++++++++++++++--
 src/include/replication/reorderbuffer.h       |  31 +
 3 files changed, 757 insertions(+), 76 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aa..160b167adb 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index cd406ca4d2..47dc31298d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -236,6 +237,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +246,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +382,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -772,6 +786,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+/*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -865,6 +911,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -1024,6 +1073,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1038,6 +1090,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1315,6 +1370,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		dlist_delete(&txn->base_snapshot_node);
 	}
 
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
 	/*
 	 * Remove TXN from its containing list.
 	 *
@@ -1340,6 +1404,80 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferReturnTXN(rb, txn);
 }
 
+/*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
 /*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1491,57 +1629,171 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+	 * aborted. That will happen during catalog access.  Also, reset the
+	 * bsysscan flag.
+	 */
+	if (!TransactionIdDidCommit(xid))
+	{
+		CheckXidAlive = xid;
+		bsysscan = false;
 	}
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction.
+ */
+static void
+ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   Snapshot snapshot_now,
+								   CommandId command_id,
+								   XLogRecPtr last_lsn,
+								   ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+
+	ReorderBufferToastReset(rb, txn);
+	if (specinsert != NULL)
+		ReorderBufferReturnChange(rb, specinsert);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool	stream_started = false;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1564,21 +1816,44 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
-
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
 		{
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Start stream or begin transaction for the first change in the
+			 * current stream.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+					rb->stream_start(rb, txn, change->lsn);
+				else
+					rb->begin(rb, txn);
+				stream_started = true;
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1655,7 +1930,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1695,7 +1971,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1753,7 +2029,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1762,10 +2041,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1796,7 +2072,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1858,14 +2133,34 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.  If we have sent
+		 * start/begin.
+		 */
+		if (stream_started)
+		{
+			if (streaming)
+				rb->stream_stop(rb, txn, prev_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+			stream_started = false;
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1884,14 +2179,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1911,17 +2219,122 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/* Reset the CheckXidAlive */
+		if (streaming)
+			CheckXidAlive = InvalidTransactionId;
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If the error code is ERRCODE_TRANSACTION_ROLLBACK,  that means we
+		 * have detected a concurrent abort of the (sub)transaction we are
+		 * streaming.  So just do the cleanup and return gracefully.
+		 * Otherwise, Re-throw the error.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * Only in the streaming mode we can get this error, because only
+			 * in the streaming mode we send in-progress transaction.
+			 */
+			Assert(streaming);
 
-		PG_RE_THROW();
+			/*
+			 * In the TRY block we only stop the stream after we have send
+			 * all the changes.  So we have detected the concurrent abort then
+			 * the stream should have not been stopped yet.
+			 */
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+
+			/* Handle the concurrent abort. */
+			ReorderBufferHandleConcurrentAbort(rb, txn, snapshot_now,
+											   command_id, prev_lsn,
+											   specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.  We iterate over the top and
+ * subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot snapshot_now;
+	CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
@@ -1946,6 +2359,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2015,6 +2435,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2150,8 +2577,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2159,6 +2595,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2170,19 +2607,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2211,6 +2657,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2295,6 +2742,16 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * Mark toplevel transaction as having catalog changes too if one of its
+	 * children has so that the ReorderBufferBuildTupleCidHash can conveniently
+	 * check just toplevel transaction and decide whethe we need to build the
+	 * hash table or not.  In non-streaming mode we mark the toplevel
+	 * transaction in DecodeCommit as we only stream on commit.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2398,6 +2855,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction  at-a-time to evict and spill its changes to
@@ -2430,11 +2919,38 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStream(rb))
+		{
+			/*
+			* Pick the largest toplevel transaction and evict it from memory by
+			* streaming the already decoded part.
+			*/
+			txn = ReorderBufferLargestTopTXN(rb);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			* Pick the largest transaction (or subtransaction) and evict it from
+			* memory by serializing it to disk.
+			*/
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2750,6 +3266,102 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot snapshot_now;
+	CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -3868,6 +4480,16 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases we assume the CID is from the future command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 65814af9f5..b3e2b3f64b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -171,6 +171,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -190,6 +191,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -224,6 +243,11 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	final_lsn;
 
+	/*
+	 * Toplevel transaction for this subxact (NULL for top-level).
+	 */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN pointing to the end of the commit record + 1.
 	 */
@@ -254,6 +278,13 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
+	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
 	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
-- 
2.23.0

v27/v27-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch000644 000765 000024 00000032536 13670411611 026444 0ustar00dilipkumarstaff000000 000000 From be7449ec75c6776e5ebc1b5237cfb88de7d5a193 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 10:55:19 +0530
Subject: [PATCH v27 04/14] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml  |  9 +++--
 src/backend/access/heap/heapam.c   | 10 ++++++
 src/backend/access/index/genam.c   | 53 ++++++++++++++++++++++++++++
 src/backend/access/table/tableam.c |  8 +++++
 src/backend/utils/time/snapmgr.c   | 13 +++++++
 src/include/access/tableam.h       | 55 ++++++++++++++++++++++++++++++
 src/include/utils/snapmgr.h        |  2 ++
 7 files changed, 147 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 50cfd6fa47..ab689f8d19 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 94eb37d48d..2d77107c4f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments at snapmgr.c
+	 * where these variables are declared.  Normally we have such a check at
+	 * tableam level API but this is called from many places so we need to
+	 * ensure it here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae39a..446b8cbc86 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,9 +430,36 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments at snapmgr.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
+/*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments at snapmgr.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
 /*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments at snapmgr.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index c814733b22..2f52b407c6 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -230,6 +230,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	Relation	rel = scan->rs_rd;
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
+	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
 	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..9f1ecd123f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -153,6 +153,19 @@ static Snapshot SecondarySnapshot = NULL;
 static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction. ��Currently, it is used in logical decoding. ��It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables. ��We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool bsysscan = false;
+
 /*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index eb18739c36..2b7d3df617 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
+#include "utils/snapmgr.h"
 #include "utils/snapshot.h"
 
 
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1015,6 +1025,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1054,6 +1071,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1710,6 +1735,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1727,6 +1760,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1745,6 +1786,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1761,6 +1809,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce84..5af6df698b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,6 +145,8 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
+extern bool	bsysscan;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
 extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
 extern void TeardownHistoricSnapshot(bool is_error);
-- 
2.23.0

#365Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#364)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Jun 12, 2020 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

- Currently, while reading/writing the streaming/subxact files we are
reporting the wait event for example
'pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);', but
BufFileWrite/BufFileRead internally reports the read/write wait event.
So I think we can avoid reporting that?

Yes, we can avoid that. No other place using BufFileRead does any
such reporting.

Basically, this part is still
I have to work upon, once we get the consensus then I can remove those
extra wait event from the patch.

Okay, feel free to send an updated patch with the above change.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#366Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#365)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Jun 12, 2020 at 4:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 12, 2020 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

- Currently, while reading/writing the streaming/subxact files we are
reporting the wait event for example
'pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);', but
BufFileWrite/BufFileRead internally reports the read/write wait event.
So I think we can avoid reporting that?

Yes, we can avoid that. No other place using BufFileRead does any
such reporting.

I agree.

Basically, this part is still
I have to work upon, once we get the consensus then I can remove those
extra wait event from the patch.

Okay, feel free to send an updated patch with the above change.

Sure, I will do that in the next patch set.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#367Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#366)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jun 15, 2020 at 9:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Jun 12, 2020 at 4:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Basically, this part is still
I have to work upon, once we get the consensus then I can remove those
extra wait event from the patch.

Okay, feel free to send an updated patch with the above change.

Sure, I will do that in the next patch set.

I have few more comments on the patch
0013-Change-buffile-interface-required-for-streaming-.patch:

1.
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are read-only if the flag is set and are
+ * automatically closed at the end of the transaction but are not deleted on
+ * close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)

No need to say "are read-only if the flag is set". I don't see any
flag passed to function so that part of the comment doesn't seem
appropriate.

2.
@@ -68,7 +68,8 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
}

  /* Register our cleanup callback. */
- on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+ if (seg)
+ on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
 }

Add comments atop function to explain when we don't want to register
the dsm detach stuff?

3.
+ */
+ newFile = file->numFiles - 1;
+ newOffset = FileSize(file->files[file->numFiles - 1]);
  break;

FileSize can return negative lengths to indicate failure which we
should handle. See other places in the code where FileSize is used?
But I have another question here which is why we need to implement
SEEK_END? How other usages of BufFile interface takes care of this?
I see an API BufFileTell which can give the current read/write
location in the file, isn't that sufficient for your usage? Also, how
before BufFile usage is this thing handled in the patch?

4.
+ /* Loop over all the  files upto the fileno which we want to truncate. */
+ for (i = file->numFiles - 1; i >= fileno; i--)

"the files", extra space in the above part of the comment.

5.
+ /*
+ * Except the fileno,  we can directly delete other files.

Before 'we', there is extra space.

6.
+ else
+ {
+ FileTruncate(file->files[i], offset, WAIT_EVENT_BUFFILE_READ);
+ newOffset = offset;
+ }

The wait event passed here doesn't seem to be appropriate. You might
want to introduce a new wait event WAIT_EVENT_BUFFILE_TRUNCATE. Also,
the error handling for FileTruncate is missing.

7.
+ if ((i != fileno || offset == 0) && fileno != 0)
+ {
+ SharedSegmentName(segment_name, file->name, i);
+ SharedFileSetDelete(file->fileset, segment_name, true);
+ newFile--;
+ newOffset = MAX_PHYSICAL_FILESIZE;
+ }

Similar to the previous comment, I think we should handle the failure
of SharedFileSetDelete.

8. I think the comments related to BufFile shared API usage need to be
expanded in the code to explain the new usage. For ex., see the below
comments atop buffile.c
* BufFile supports temporary files that can be made read-only and shared with
* other backends, as infrastructure for parallel execution. Such files need
* to be created as a member of a SharedFileSet that all participants are
* attached to.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#368Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#367)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jun 15, 2020 at 6:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have few more comments on the patch
0013-Change-buffile-interface-required-for-streaming-.patch:

Review comments on 0014-Worker-tempfile-use-the-shared-buffile-infrastru:
1.
The subxact file is only create if there
+ * are any suxact info under this xid.
+ */
+typedef struct StreamXidHash

Lets slightly reword the part of the comment as "The subxact file is
created iff there is any suxact info under this xid."

2.
@@ -710,6 +740,9 @@ apply_handle_stream_stop(StringInfo s)
subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
stream_close_file();

+ /* Commit the per-stream transaction */
+ CommitTransactionCommand();

Before calling commit, ensure that we are in a valid transaction. I
think we can have an Assert for IsTransactionState().

3.
@@ -761,11 +791,13 @@ apply_handle_stream_abort(StringInfo s)

int64 i;
int64 subidx;
- int fd;
+ BufFile *fd;
bool found = false;
char path[MAXPGPATH];
+ StreamXidHash *ent;

subidx = -1;
+ ensure_transaction();
subxact_info_read(MyLogicalRepWorker->subid, xid);

Why to call ensure_transaction here? Is there any reason that we
won't have a valid transaction by now? If not, then its better to
have an Assert for IsTransactionState().

4.
- if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+ if (BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
  {
- int save_errno = errno;
+ int save_errno = errno;
- CloseTransientFile(fd);
+ BufFileClose(fd);

On error, won't these files be close automatically? If so, why at
this place and before other errors, we need to close this?

5.
if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
{
int save_errno = errno;

BufFileClose(fd);
errno = save_errno;
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not read file \"%s\": %m",

Can we change the error message to "could not read from streaming
transactions file .." or something like that and similarly we can
change the message for failure in reading changes file?

6.
if (BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
{
int save_errno = errno;

BufFileClose(fd);
errno = save_errno;
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not write to file \"%s\": %m",

Similar to previous, can we change it to "could not write to streaming
transactions file

7.
@@ -2855,17 +2844,32 @@ stream_open_file(Oid subid, TransactionId xid,
bool first_segment)
  * for writing, in append mode.
  */
  if (first_segment)
- flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
- else
- flags = (O_WRONLY | O_APPEND | PG_BINARY);
+ {
+ /*
+ * Shared fileset handle must be allocated in the persistent context.
+ */
+ SharedFileSet *fileset =
+ MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
- stream_fd = OpenTransientFile(path, flags);
+ PrepareTempTablespaces();
+ SharedFileSetInit(fileset, NULL);

Why are we calling PrepareTempTablespaces here? It is already called
in SharedFileSetInit.

8.
+ /*
+ * Start a transaction on stream start, this transaction will be committed
+ * on the stream stop.  We need the transaction for handling the buffile,
+ * used for serializing the streaming data and subxact info.
+ */
+ ensure_transaction();

I think we need this for PrepareTempTablespaces to set the
temptablespaces. Also, isn't it required for a cleanup of buffile
resources at the transaction end? Are there any other reasons for it
as well? The comment should be a bit more clear for why we need a
transaction here.

9.
* Open a file for streamed changes from a toplevel transaction identified
* by stream_xid (global variable). If it's the first chunk of streamed
* changes for this transaction, perform cleanup by removing existing
* files after a possible previous crash.
..
stream_open_file(Oid subid, TransactionId xid, bool first_segment)

The above part comment atop stream_open_file needs to be changed after
new implementation.

10.
* enabled. This context is reeset on each stream stop.
*/
LogicalStreamingContext = AllocSetContextCreate(ApplyContext,

/reeset/reset

11.
stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
{
..
+ /* No entry created for this xid so simply return. */
+ if (ent == NULL)
+ return;
..
}

Is there any reason or scenario where this ent can be NULL? If not,
it will be better to have an Assert for the same.

12.
subxact_info_write(Oid subid, TransactionId xid)
{
..
+ /*
+ * If there is no subtransaction then nothing to do,  but if already have
+ * subxact file then delete that.
+ */
+ if (nsubxacts == 0)
  {
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not create file \"%s\": %m",
- path)));
+ if (ent->subxact_fileset)
+ {
+ cleanup_subxact_info();
+ BufFileDeleteShared(ent->subxact_fileset, path);
+ ent->subxact_fileset = NULL;
..
}

Here don't we need to free the subxact_fileset before setting it to NULL?

13.
+ /*
+ * Scan complete hash and delete the underlying files for the the xids.
+ * Also delete the memory for the shared file sets.
+ */

/the the/the. Instead of "delete the memory", it would be better to
say "release the memory".

14.
+ /*
+ * We might not have created the suxact fileset if there is no sub
+ * transaction.
+ */

/suxact/subxact

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#369Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#357)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jun 9, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think one of the usages we still need is in ReorderBufferForget
because it can be called when we skip processing the txn. See the
comments in DecodeCommit where we call this function. If I am
correct, we need to probably collect all invalidations in
ReorderBufferTxn as we are collecting tuplecids and use them here. We
can do the same during processing of XLOG_XACT_INVALIDATIONS.

One more point related to this is that after this patch series, we
need to consider executing all invalidation during transaction abort.
Because it is possible that due to memory overflow, we have processed
some of the messages which also contain a few XACT_INVALIDATION
messages, so to avoid cache pollution, we need to execute all of them
in abort. We also do the similar thing in Rollback/Rollback To
Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval.

I have analyzed this further and I think there is some problem with
that. If Instead of keeping the invalidation as an individual change,
if we try to combine them in ReorderBufferTxn's invalidation then what
happens if the (sub)transaction is aborted. Basically, in this case,
we will end up executing all those invalidations for those we never
polluted the cache if we never try to stream it. So this will affect
the normal case where we haven't streamed the transaction because
every time we have executed the invalidation logged by transaction
those are aborted. One way is we develop the list at the
sub-transaction level and just before sending the transaction (on
commit) combine all the (sub) transaction's invalidation list. But,
I think since we already have the invalidation in the commit message
then there is no point in adding this complexity.
But, my main worry is about the streaming transaction, the problems are
- Immediately on the arrival of individual invalidation, we can not
directly add to the top-level transaction's invalidation list because
later if the transaction aborted before we stream (or we directly
stream on commit) then we will get an unnecessarily long list of
invalidation which is done by aborted subtransaction.
- If we keep collecting in the individual subtransaction's
ReorderBufferTxn->invalidations, then the problem is when to merge
it? I think it is a good idea to merge them all as soon as we try to
stream it/or on commit? So since this solution of combining the (sub)
transaction's invalidation is required for the streaming case we can
use it as common solution whether it streams due to the memory
overflow or due to the commit.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#370Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#369)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jun 16, 2020 at 7:49 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 9, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think one of the usages we still need is in ReorderBufferForget
because it can be called when we skip processing the txn. See the
comments in DecodeCommit where we call this function. If I am
correct, we need to probably collect all invalidations in
ReorderBufferTxn as we are collecting tuplecids and use them here. We
can do the same during processing of XLOG_XACT_INVALIDATIONS.

One more point related to this is that after this patch series, we
need to consider executing all invalidation during transaction abort.
Because it is possible that due to memory overflow, we have processed
some of the messages which also contain a few XACT_INVALIDATION
messages, so to avoid cache pollution, we need to execute all of them
in abort. We also do the similar thing in Rollback/Rollback To
Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval.

I have analyzed this further and I think there is some problem with
that. If Instead of keeping the invalidation as an individual change,
if we try to combine them in ReorderBufferTxn's invalidation then what
happens if the (sub)transaction is aborted. Basically, in this case,
we will end up executing all those invalidations for those we never
polluted the cache if we never try to stream it. So this will affect
the normal case where we haven't streamed the transaction because
every time we have executed the invalidation logged by transaction
those are aborted. One way is we develop the list at the
sub-transaction level and just before sending the transaction (on
commit) combine all the (sub) transaction's invalidation list. But,
I think since we already have the invalidation in the commit message
then there is no point in adding this complexity.
But, my main worry is about the streaming transaction, the problems are
- Immediately on the arrival of individual invalidation, we can not
directly add to the top-level transaction's invalidation list because
later if the transaction aborted before we stream (or we directly
stream on commit) then we will get an unnecessarily long list of
invalidation which is done by aborted subtransaction.

Is there any problem you see with this or you are concerned with the
efficiency? Please note, we already do something similar in
ReorderBufferForget and if your concern is efficiency then that
applies to existing cases as well. I think if we want we can improve
it later in many ways and one of them you have already suggested, at
this time, the main thing is correctness and also aborts are not
frequent enough to worry too much about their performance.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#371Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#370)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jun 17, 2020 at 9:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 16, 2020 at 7:49 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 9, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think one of the usages we still need is in ReorderBufferForget
because it can be called when we skip processing the txn. See the
comments in DecodeCommit where we call this function. If I am
correct, we need to probably collect all invalidations in
ReorderBufferTxn as we are collecting tuplecids and use them here. We
can do the same during processing of XLOG_XACT_INVALIDATIONS.

One more point related to this is that after this patch series, we
need to consider executing all invalidation during transaction abort.
Because it is possible that due to memory overflow, we have processed
some of the messages which also contain a few XACT_INVALIDATION
messages, so to avoid cache pollution, we need to execute all of them
in abort. We also do the similar thing in Rollback/Rollback To
Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval.

I have analyzed this further and I think there is some problem with
that. If Instead of keeping the invalidation as an individual change,
if we try to combine them in ReorderBufferTxn's invalidation then what
happens if the (sub)transaction is aborted. Basically, in this case,
we will end up executing all those invalidations for those we never
polluted the cache if we never try to stream it. So this will affect
the normal case where we haven't streamed the transaction because
every time we have executed the invalidation logged by transaction
those are aborted. One way is we develop the list at the
sub-transaction level and just before sending the transaction (on
commit) combine all the (sub) transaction's invalidation list. But,
I think since we already have the invalidation in the commit message
then there is no point in adding this complexity.
But, my main worry is about the streaming transaction, the problems are
- Immediately on the arrival of individual invalidation, we can not
directly add to the top-level transaction's invalidation list because
later if the transaction aborted before we stream (or we directly
stream on commit) then we will get an unnecessarily long list of
invalidation which is done by aborted subtransaction.

Is there any problem you see with this or you are concerned with the
efficiency? Please note, we already do something similar in
ReorderBufferForget and if your concern is efficiency then that
applies to existing cases as well. I think if we want we can improve
it later in many ways and one of them you have already suggested, at
this time, the main thing is correctness and also aborts are not
frequent enough to worry too much about their performance.

As of now, I am not seeing the problem, I was just concerned about
processing more invalidation messages in the aborted cases compared to
current code, even if the streaming is off/ or transaction never
streamed as memory size is not crossed. But, I agree that it is only
in the case of the abort, so I will work on this and later maybe we
can test the performance.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#372Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#368)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jun 16, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 15, 2020 at 6:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have few more comments on the patch
0013-Change-buffile-interface-required-for-streaming-.patch:

Review comments on 0014-Worker-tempfile-use-the-shared-buffile-infrastru:

changes_filename(char *path, Oid subid, TransactionId xid)
 {
- char tempdirpath[MAXPGPATH];
-
- TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
-
- /*
- * We might need to create the tablespace's tempfile directory, if no
- * one has yet done so.
- */
- if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not create directory \"%s\": %m",
- tempdirpath)));
-
- snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.changes",
- tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+ snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);

Today, I was studying this change and its impact. Initially, I
thought that because the patch has removed pgsql_tmp prefix from the
filename, it might create problems if the temporary files remain on
the disk after the crash. Now as the patch has started using BufFile
interface, it seems to be internally taking care of the same by
generating names like
"base/pgsql_tmp/pgsql_tmp13774.0.sharedfileset/16393-513.changes.0".
Basically, it ensures to create the file in the directory starting
with pgsql_tmp. I have tried by crashing the server in a situation
where the temp files remain and after the restart, they are removed.
So, it seems okay to generate file names like that but I still suggest
testing other paths like backup where we ignore files whose names
start with PG_TEMP_FILE_PREFIX.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#373Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#356)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Jun 7, 2020 at 5:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Jun 5, 2020 at 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Let me know what you think of the changes? If you find them okay,
then feel to include them in the next patch-set.

[1] - /messages/by-id/CAONYFtOv+Er1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg@mail.gmail.com

Thanks for the patch, I will review it and include it in my next version.

I have merged your changes 0002 in this version.

Okay, I have done review of
0002-Issue-individual-invalidations-with-wal_level-lo.patch and below
are my comments:

1. I don't think it is a good idea that logical decoding process the
new XLOG_XACT_INVALIDATIONS and existing WAL records for invalidations
like XLOG_INVALIDATIONS and what we do in DecodeCommit (see code in
the check "if (parsed->nmsgs > 0)").  I think if that is required for
some particular reason then we should write detailed comments about
the same.  I have tried some experiments to see if those are really
required:
a. After applying patch 0002, I have tried by commenting out the
processing of invalidations via DecodeCommit and found some regression
tests were failing but the reason for failure was that we are not
setting RBTXN_HAS_CATALOG_CHANGES for the toptxn when subtxn has
catalog changes and when I did that all regression tests started
passing.  See the attached diff patch
(v27-0003-Incremental-patch-for-0002-to-test-removal-of-du) atop 0002
patch.
b. The processing of invalidations for XLOG_INVALIDATIONS is added by
commit c6ff84b06a for xid-less transactions.  See
https://postgr.es/m/CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com
to know why that has been added.  Now, after this patch we will
process the same invalidations via XLOG_XACT_INVALIDATIONS and
XLOG_INVALIDATIONS which doesn't seem warranted.  Also, the below
assertion will fail for xid-less transactions (try create index
concurrently statement):
+ case XLOG_XACT_INVALIDATIONS:
+ {
+ TransactionId xid;
+ xl_xact_invalidations *invals;
+
+ xid = XLogRecGetXid(r);
+ invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+ Assert(TransactionIdIsValid(xid));

I feel we don't need the processing of XLOG_INVALIDATIONS in logical
decoding after this patch but to prove that first we need to write a
test case which need XLOG_INVALIDATIONS in the HEAD as commit
c6ff84b06a doesn't add one. I think we need two code paths in
XLOG_XACT_INVALIDATIONS where if it is for xid-less transactions, then
execute actions immediately as we are doing in processing of
XLOG_INVALIDATIONS, otherwise, do what we are doing currently in the
patch. If the above point (b) is correct, I am not sure if it is a
good idea to use RM_XACT_ID as resource manager if for this WAL in
LogLogicalInvalidations, what do you think?

I think one of the usages we still need is in ReorderBufferForget
because it can be called when we skip processing the txn. See the
comments in DecodeCommit where we call this function. If I am
correct, we need to probably collect all invalidations in
ReorderBufferTxn as we are collecting tuplecids and use them here. We
can do the same during processing of XLOG_XACT_INVALIDATIONS.

I had also thought a bit about removing logging of invalidations at
commit time altogether but it seems processing hot-standby is somewhat
tightly coupled with existing WAL logging. See xact_redo_commit (a
comment atop call to ProcessCommittedInvalidationMessages). It says
we need to maintain the order when we process invalidations. If we
can later find a way to avoid that we can probably remove it but for
now maybe we can live with it.

Yes, I have made the changes. Basically, now I am only using the
XLOG_XACT_INVALIDATIONS for generating all the invalidation messages.
So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we
are directly appending it to the txn->invalidations. I have tested
the XLOG_INVALIDATIONS part but while sending this mail I realized
that we could write some automated test for the same. I will work on
that soon.

2.
+ /* not expected, but print something anyway */
+ else if (msg->id == SHAREDINVALSMGR_ID)
+ appendStringInfoString(buf, " smgr");
+ /* not expected, but print something anyway */
+ else if (msg->id == SHAREDINVALRELMAP_ID)

I think the above comment is not valid after we started logging at CCI.

Yup, fixed.

3.
+
+ xid = XLogRecGetXid(r);
+ invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+ Assert(TransactionIdIsValid(xid));
+ ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+ invals->nmsgs, invals->msgs);

Here, it should check !ctx->forward as we do in DecodeCommit, do we
have any reason for not doing so. We can test once by changing this.

Yeah, it should have this check.

Mostly it contains changes in 0002, apart from that we needed some
changes in 0005,0006 to rebase on 0002 and also there is one bug fix
in 0005, basically the txn->snapshot_now was not getting set to NULL
after freeing so it was getting double free. I have also removed the
extra wait even from the 0014 as BufFile is already logging the wait
event internally and also some changes because BufFileWrite interface
is changed in recent commits.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v28.tarapplication/x-tar; name=v28.tarDownload
v28/000755 000765 000024 00000000000 13672702732 013012 5ustar00dilipkumarstaff000000 000000 v28/v28-0009-Enable-streaming-for-all-subscription-TAP-tests.patch000644 000765 000024 00000035517 13672702732 026064 0ustar00dilipkumarstaff000000 000000 From 968d1b2e3331596287094f673d99db766c9b2735 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v28 09/14] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318fc7c..6f7bedc130 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2fbc..94c71f8ae2 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f871a..4ba80869b9 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9181..a6fae9c3f1 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17a78..202871a658 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10a19..70c86b22ac 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6d63..f9c8d1d348 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334ed89..cdf9b8e7bb 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bdba9..21f50c7012 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133f69..30561d8f96 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae38592b..9a6bac6822 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..ed56fbf96c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cba4c..4df1ddef63 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1a8a..c3caff6149 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7c2f..c62eb521e7 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30f59..2be7542831 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0578..2da9607a7d 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9435..96ffc091b0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
2.23.0

v28/v28-0014-Worker-tempfile-use-the-shared-buffile-infrastru.patch000644 000765 000024 00000070550 13672702732 026356 0ustar00dilipkumarstaff000000 000000 From b1fcf35b029f6d0c4de1e63ee415893d25113bc2 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 11 Jun 2020 16:42:07 +0530
Subject: [PATCH v28 14/14] Worker tempfile use the shared buffile
 infrastructure

Tobe merged with 0008, kept separate to make it easy for the
review.
---
 src/backend/replication/logical/worker.c | 580 +++++++++++------------
 1 file changed, 268 insertions(+), 312 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index d2d9469999..28aad1d6ff 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -56,6 +56,7 @@
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -85,6 +86,7 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -123,10 +125,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a xid we create this entry in the
+ * xidhash and we also create the streaming file and store the fileset handle.
+ * So that on the subsequent stream for the xid we can search the entry in the
+ * hash and get the fileset handle.  The subxact file is only create if there
+ * are any suxact info under this xid.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
-static MemoryContext LogicalStreamingContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -136,15 +154,26 @@ bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
 /* fields valid only when processing streamed transaction */
-bool	in_streamed_transaction = false;
+bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
-static int	stream_fd = -1;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
 
 typedef struct SubXactInfo
 {
-	TransactionId xid;						/* XID of the subxact */
-	off_t           offset;					/* offset in the file */
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
 } SubXactInfo;
 
 static uint32 nsubxacts = 0;
@@ -171,13 +200,6 @@ static void stream_open_file(Oid subid, TransactionId xid, bool first);
 static void stream_write_change(char action, StringInfo s);
 static void stream_close_file(void);
 
-/*
- * Array of serialized XIDs.
- */
-static int	nxids = 0;
-static int	maxnxids = 0;
-static TransactionId	*xids = NULL;
-
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
@@ -275,7 +297,7 @@ handle_streamed_transaction(const char action, StringInfo s)
 	if (!in_streamed_transaction)
 		return false;
 
-	Assert(stream_fd != -1);
+	Assert(stream_fd != NULL);
 	Assert(TransactionIdIsValid(stream_xid));
 
 	/*
@@ -666,31 +688,39 @@ static void
 apply_handle_stream_start(StringInfo s)
 {
 	bool		first_segment;
+	HASHCTL		hash_ctl;
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
 	/* notify handle methods we're processing a remote transaction */
 	in_streamed_transaction = true;
 
 	/* extract XID of the top-level transaction */
 	stream_xid = logicalrep_read_stream_start(s, &first_segment);
 
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
 	/* open the spool file for this transaction */
 	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
 
-	/*
-	 * if this is not the first segment, open existing file
-	 *
-	 * XXX Note that the cleanup is performed by stream_open_file.
-	 */
+	/* if this is not the first segment, open existing file */
 	if (!first_segment)
-	{
-		MemoryContext oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
-
-		/* Read the subxacts info in per-stream context. */
 		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
-		MemoryContextSwitchTo(oldctx);
-	}
 
 	pgstat_report_activity(STATE_RUNNING, NULL);
 }
@@ -710,6 +740,9 @@ apply_handle_stream_stop(StringInfo s)
 	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
 	stream_close_file();
 
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
 	in_streamed_transaction = false;
 
 	/* Reset per-stream context */
@@ -736,10 +769,7 @@ apply_handle_stream_abort(StringInfo s)
 	 * just delete the files with serialized info.
 	 */
 	if (xid == subxid)
-	{
 		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
-		return;
-	}
 	else
 	{
 		/*
@@ -761,11 +791,13 @@ apply_handle_stream_abort(StringInfo s)
 
 		int64		i;
 		int64		subidx;
-		int			fd;
+		BufFile    *fd;
 		bool		found = false;
 		char		path[MAXPGPATH];
+		StreamXidHash *ent;
 
 		subidx = -1;
+		ensure_transaction();
 		subxact_info_read(MyLogicalRepWorker->subid, xid);
 
 		/* XXX optimize the search by bsearch on sorted data */
@@ -787,33 +819,32 @@ apply_handle_stream_abort(StringInfo s)
 		{
 			/* Cleanup the subxact info */
 			cleanup_subxact_info();
+			CommitTransactionCommand();
 			return;
 		}
 
 		Assert((subidx >= 0) && (subidx < nsubxacts));
 
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
 		changes_filename(path, MyLogicalRepWorker->subid, xid);
-		fd = OpenTransientFile(path, O_WRONLY | PG_BINARY);
-		if (fd < 0)
-		{
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not open file \"%s\": %m",
-							path)));
-		}
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
 
-		/* OK, truncate the file at the right offset. */
-		if (ftruncate(fd, subxacts[subidx].offset))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not truncate file \"%s\": %m", path)));
-		CloseTransientFile(fd);
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
 
 		/* discard the subxacts added later */
 		nsubxacts = subidx;
 
 		/* write the updated subxact list */
 		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
 	}
 }
 
@@ -823,16 +854,16 @@ apply_handle_stream_abort(StringInfo s)
 static void
 apply_handle_stream_commit(StringInfo s)
 {
-	int			fd;
 	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
-
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
+	bool		found;
 	LogicalRepCommitData commit_data;
-
+	StreamXidHash *ent;
 	MemoryContext oldcxt;
+	BufFile    *fd;
 
 	Assert(!in_streamed_transaction);
 
@@ -840,25 +871,21 @@ apply_handle_stream_commit(StringInfo s)
 
 	elog(DEBUG1, "received commit for streamed transaction %u", xid);
 
-	/* open the spool file for the committed transaction */
-	changes_filename(path, MyLogicalRepWorker->subid, xid);
-
 	elog(DEBUG1, "replaying changes from file '%s'", path);
 
-	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (fd < 0)
-	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
-	}
-
 	ensure_transaction();
-
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	buffer = palloc(8192);
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
 	initStringInfo(&s2);
 
 	MemoryContextSwitchTo(oldcxt);
@@ -881,9 +908,7 @@ apply_handle_stream_commit(StringInfo s)
 		int			len;
 
 		/* read length of the on-disk record */
-		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
-		nbytes = read(fd, &len, sizeof(len));
-		pgstat_report_wait_end();
+		nbytes = BufFileRead(fd, &len, sizeof(len));
 
 		/* have we reached end of the file? */
 		if (nbytes == 0)
@@ -894,7 +919,7 @@ apply_handle_stream_commit(StringInfo s)
 		{
 			int			save_errno = errno;
 
-			CloseTransientFile(fd);
+			BufFileClose(fd);
 			errno = save_errno;
 			ereport(ERROR,
 					(errcode_for_file_access(),
@@ -908,19 +933,17 @@ apply_handle_stream_commit(StringInfo s)
 		buffer = repalloc(buffer, len);
 
 		/* and finally read the data into the buffer */
-		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
-		if (read(fd, buffer, len) != len)
+		if (BufFileRead(fd, buffer, len) != len)
 		{
 			int			save_errno = errno;
 
-			CloseTransientFile(fd);
+			BufFileClose(fd);
 			errno = save_errno;
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not read file: %m")));
 			return;
 		}
-		pgstat_report_wait_end();
 
 		/* copy the buffer to the stringinfo and call apply_dispatch */
 		resetStringInfo(&s2);
@@ -948,15 +971,11 @@ apply_handle_stream_commit(StringInfo s)
 		 */
 		send_feedback(InvalidXLogRecPtr, false, false);
 	}
-
-	if (CloseTransientFile(fd) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not close file \"%s\": %m", path)));
+	BufFileClose(fd);
 
 	/*
-	 * Update origin state so we can restart streaming from correct
-	 * position in case of crash.
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
 	 */
 	replorigin_session_origin_lsn = commit_data.end_lsn;
 	replorigin_session_origin_timestamp = commit_data.committime;
@@ -1946,12 +1965,39 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 static void
 worker_onexit(int code, Datum arg)
 {
-	int	i;
+	HASH_SEQ_STATUS status;
+	StreamXidHash *ent;
+	char		path[MAXPGPATH];
+
+	/* nothing to clean */
+	if (xidhash == NULL)
+		return;
+
+	/*
+	 * Scan complete hash and delete the underlying files for the the xids.
+	 * Also delete the memory for the shared file sets.
+	 */
+	hash_seq_init(&status, xidhash);
+	while ((ent = (StreamXidHash *) hash_seq_search(&status)) != NULL)
+	{
+		changes_filename(path, MyLogicalRepWorker->subid, ent->xid);
+		BufFileDeleteShared(ent->stream_fileset, path);
+		pfree(ent->stream_fileset);
 
-	elog(LOG, "cleanup files for %d transactions", nxids);
+		/*
+		 * We might not have created the suxact fileset if there is no sub
+		 * transaction.
+		 */
+		if (ent->subxact_fileset)
+		{
+			subxact_filename(path, MyLogicalRepWorker->subid, ent->xid);
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+		}
+	}
 
-	for (i = nxids-1; i >= 0; i--)
-		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+	/* Remove the xid hash */
+	hash_destroy(xidhash);
 }
 
 /*
@@ -1976,7 +2022,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	 */
 	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
 													"LogicalStreamingContext",
-													 ALLOCSET_DEFAULT_SIZES);
+													ALLOCSET_DEFAULT_SIZES);
 
 	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
 	before_shmem_exit(worker_onexit, (Datum) 0);
@@ -2085,7 +2131,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -2441,64 +2487,60 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 static void
 subxact_info_write(Oid subid, TransactionId xid)
 {
-	int			fd;
 	char		path[MAXPGPATH];
+	bool		found;
 	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
 
 	Assert(TransactionIdIsValid(xid));
 
 	subxact_filename(path, subid, xid);
 
-	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
-	if (fd < 0)
-	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create file \"%s\": %m",
-						path)));
-		return;
-	}
-
-	len = sizeof(SubXactInfo) * nsubxacts;
-
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
 
-	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
 	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not write to file \"%s\": %m",
-						path)));
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			ent->subxact_fileset = NULL;
+		}
 		return;
 	}
 
-	if ((len > 0) && (write(fd, subxacts, len) != len))
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
 	{
-		int			save_errno = errno;
+		ent->subxact_fileset =
+			MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
 
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not write to file \"%s\": %m",
-						path)));
-		return;
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
 	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
 
-	pgstat_report_wait_end();
+	len = sizeof(SubXactInfo) * nsubxacts;
 
-	/*
-	 * We don't need to fsync or anything, as we'll recreate the files after a
-	 * crash from scratch. So just close the file.
-	 */
-	if (CloseTransientFile(fd) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not close file \"%s\": %m", path)));
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+
+	BufFileWrite(fd, subxacts, len);
+	BufFileClose(fd);
 
 	/*
 	 * But we free the memory allocated for subxact info. There might be one
@@ -2513,41 +2555,45 @@ subxact_info_write(Oid subid, TransactionId xid)
  *	  Restore information about subxacts of a streamed transaction.
  *
  * Read information about subxacts into the global variables.
- *
- * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
  */
 static void
 subxact_info_read(Oid subid, TransactionId xid)
 {
-	int			fd;
 	char		path[MAXPGPATH];
+	bool		found;
 	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
 
 	Assert(TransactionIdIsValid(xid));
 	Assert(!subxacts);
 	Assert(nsubxacts == 0);
 	Assert(nsubxacts_max == 0);
 
-	subxact_filename(path, subid, xid);
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
 
-	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (fd < 0)
-	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
 		return;
-	}
 
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
 
 	/* read number of subxact items */
-	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
 	{
 		int			save_errno = errno;
 
-		CloseTransientFile(fd);
+		BufFileClose(fd);
 		errno = save_errno;
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -2556,29 +2602,27 @@ subxact_info_read(Oid subid, TransactionId xid)
 		return;
 	}
 
-	pgstat_report_wait_end();
-
 	len = sizeof(SubXactInfo) * nsubxacts;
 
 	/* we keep the maximum as a power of 2 */
 	nsubxacts_max = 1 << my_log2(nsubxacts);
 
 	/*
-	 * Let the caller decide which memory context it will be allocated.
-	 * Ideally, during stream start it will be allocated in the
-	 * LogicalStreamingContext which will be reset on stream stop, and
-	 * during the stream abort we need this memory only for short term so
-	 * it will be allocated in ApplyMessageContext.
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
 	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
 	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
 
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
-
-	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
 	{
 		int			save_errno = errno;
 
-		CloseTransientFile(fd);
+		BufFileClose(fd);
 		errno = save_errno;
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -2587,12 +2631,7 @@ subxact_info_read(Oid subid, TransactionId xid)
 		return;
 	}
 
-	pgstat_report_wait_end();
-
-	if (CloseTransientFile(fd) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not close file \"%s\": %m", path)));
+	BufFileClose(fd);
 }
 
 /*
@@ -2606,7 +2645,7 @@ subxact_info_add(TransactionId xid)
 
 	/* We must have a valid top level stream xid and a stream fd. */
 	Assert(TransactionIdIsValid(stream_xid));
-	Assert(stream_fd >= 0);
+	Assert(stream_fd != NULL);
 
 	/*
 	 * If the XID matches the toplevel transaction, we don't want to add it.
@@ -2658,7 +2697,13 @@ subxact_info_add(TransactionId xid)
 	}
 
 	subxacts[nsubxacts].xid = xid;
-	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
 
 	nsubxacts++;
 }
@@ -2667,44 +2712,14 @@ subxact_info_add(TransactionId xid)
 static void
 subxact_filename(char *path, Oid subid, TransactionId xid)
 {
-	char		tempdirpath[MAXPGPATH];
-
-	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
-
-	/*
-	 * We might need to create the tablespace's tempfile directory, if no
-	 * one has yet done so.
-	 */
-	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create directory \"%s\": %m",
-						tempdirpath)));
-
-	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.subxacts",
-			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
 }
 
 /* format filename for file containing serialized changes */
-static void
+static inline void
 changes_filename(char *path, Oid subid, TransactionId xid)
 {
-	char		tempdirpath[MAXPGPATH];
-
-	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
-
-	/*
-	 * We might need to create the tablespace's tempfile directory, if no
-	 * one has yet done so.
-	 */
-	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create directory \"%s\": %m",
-						tempdirpath)));
-
-	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.changes",
-			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
 }
 
 /*
@@ -2721,60 +2736,31 @@ changes_filename(char *path, Oid subid, TransactionId xid)
 static void
 stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
 {
-	int			i;
 	char		path[MAXPGPATH];
-	bool		found = false;
+	StreamXidHash *ent;
 
-	subxact_filename(path, subid, xid);
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
 
-	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not remove file \"%s\": %m", path)));
+	/* No entry created for this xid so simply return. */
+	if (ent == NULL)
+		return;
 
+	/* Delete the change file and release the stream fileset memory */
 	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
 
-	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not remove file \"%s\": %m", path)));
-
-	/*
-	 * Cleanup the XID from the array - find the XID in the array and
-	 * remove it by shifting all the remaining elements. The array is
-	 * bound to be fairly small (maximum number of in-progress xacts,
-	 * so max_connections + max_prepared_transactions) so simply loop
-	 * through the array and find index of the XID. Then move the rest
-	 * of the array by one element to the left.
-	 *
-	 * Notice we also call this from stream_open_file for first segment
-	 * of each transaction, to deal with possible left-overs after a
-	 * crash, so it's entirely possible not to find the XID in the
-	 * array here. In that case we don't remove anything.
-	 *
-	 * XXX Perhaps it'd be better to handle this automatically after a
-	 * restart, instead of doing it over and over for each transaction.
-	 */
-	for (i = 0; i < nxids; i++)
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
 	{
-		if (xids[i] == xid)
-		{
-			found = true;
-			break;
-		}
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
 	}
-
-	if (!found)
-		return;
-
-	/*
-	 * Move the last entry from the array to the place. We don't keep
-	 * the streamed transactions sorted or anything - we only expect
-	 * a few of them in progress (max_connections + max_prepared_xacts)
-	 * so linear search is just fine.
-	 */
-	xids[i] = xids[nxids-1];
-	nxids--;
 }
 
 /*
@@ -2793,79 +2779,62 @@ static void
 stream_open_file(Oid subid, TransactionId xid, bool first_segment)
 {
 	char		path[MAXPGPATH];
-	int			flags;
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
 
 	Assert(in_streamed_transaction);
 	Assert(OidIsValid(subid));
 	Assert(TransactionIdIsValid(xid));
-	Assert(stream_fd == -1);
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
 
 	/*
-	 * If this is the first segment for this transaction, try removing
-	 * existing files (if there are any, possibly after a crash).
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
 	 */
 	if (first_segment)
 	{
-		MemoryContext	oldcxt;
-
-		/* XXX make sure there are no previous files for this transaction */
-		stream_cleanup_files(subid, xid, true);
-
-		/* Need to allocate this in permanent context */
-		oldcxt = MemoryContextSwitchTo(ApplyContext);
-
 		/*
-		 * We need to remember the XIDs we spilled to files, so that we can
-		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
-		 *
-		 * The number of XIDs we may need to track is fairly small, because
-		 * we can only stream toplevel xacts (so limited by max_connections
-		 * and max_prepared_transactions), and we only stream the large ones.
-		 * So we simply keep the XIDs in an unsorted array. If the number of
-		 * xacts gets large for some reason (e.g. very high max_connections),
-		 * a more elaborate approach might be better - e.g. sorted array, to
-		 * speed-up the lookups.
+		 * Shared fileset handle must be allocated in the persistent context.
 		 */
-		if (nxids == maxnxids)	/* array of XIDs is full */
-		{
-			if (!xids)
-			{
-				maxnxids = 64;
-				xids = palloc(maxnxids * sizeof(TransactionId));
-			}
-			else
-			{
-				maxnxids = 2 * maxnxids;
-				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
-			}
-		}
+		SharedFileSet *fileset =
+		MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
 
-		xids[nxids++] = xid;
+		PrepareTempTablespaces();
+		SharedFileSetInit(fileset, NULL);
+		stream_fd = BufFileCreateShared(fileset, path);
 
-		MemoryContextSwitchTo(oldcxt);
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
 	}
-
-	changes_filename(path, subid, xid);
-
-	elog(DEBUG1, "opening file '%s' for streamed changes", path);
-
-	/*
-	 * If this is the first streamed segment, the file must not exist, so
-	 * make sure we're the ones creating it. Otherwise just open the file
-	 * for writing, in append mode.
-	 */
-	if (first_segment)
-		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
 	else
-		flags = (O_WRONLY | O_APPEND | PG_BINARY);
-
-	stream_fd = OpenTransientFile(path, flags);
-
-	if (stream_fd < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+	MemoryContextSwitchTo(oldcxt);
 }
 
 /*
@@ -2880,12 +2849,12 @@ stream_close_file(void)
 {
 	Assert(in_streamed_transaction);
 	Assert(TransactionIdIsValid(stream_xid));
-	Assert(stream_fd != -1);
+	Assert(stream_fd != NULL);
 
-	CloseTransientFile(stream_fd);
+	BufFileClose(stream_fd);
 
 	stream_xid = InvalidTransactionId;
-	stream_fd = -1;
+	stream_fd = NULL;
 }
 
 /*
@@ -2907,34 +2876,21 @@ stream_write_change(char action, StringInfo s)
 
 	Assert(in_streamed_transaction);
 	Assert(TransactionIdIsValid(stream_xid));
-	Assert(stream_fd != -1);
+	Assert(stream_fd != NULL);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
-
 	/* first write the size */
-	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not serialize streamed change to file: %m")));
+	BufFileWrite(stream_fd, &len, sizeof(len));
 
 	/* then the action */
-	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not serialize streamed change to file: %m")));
+	BufFileWrite(stream_fd, &action, sizeof(action));
 
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
-	if (write(stream_fd, &s->data[s->cursor], len) != len)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not serialize streamed change to file: %m")));
-
-	pgstat_report_wait_end();
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
 }
 
 /*
-- 
2.23.0

v28/v28-0012-Add-streaming-option-in-pg_dump.patch000644 000765 000024 00000005337 13672702732 023062 0ustar00dilipkumarstaff000000 000000 From b8d589497cc84736059e0b44a2f266fb7ffd3f84 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v28 12/14] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index a41a3db876..d0fb24e5f8 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4235,8 +4236,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4249,6 +4250,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4265,6 +4267,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4342,6 +4345,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0c2fcfb3a9..af64270c55 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
2.23.0

v28/v28-0001-Immediately-WAL-log-subtransaction-and-top-level.patch000644 000765 000024 00000026154 13672702732 026203 0ustar00dilipkumarstaff000000 000000 From d14573820273ded01a10c4f0f7c1cec752a1d653 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 5 Jun 2020 09:03:16 +0530
Subject: [PATCH v28 01/14] Immediately WAL-log subtransaction and top-level
 XID association.

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead) only when wal_level=logical.
We can not remove the existing XLOG_XACT_ASSIGNMENT WAL as that is
required for avoiding overflow in the hot standby snapshot.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/transam/xact.c        | 50 ++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 23 ++++++++++-
 src/backend/access/transam/xlogreader.c  |  5 +++
 src/backend/replication/logical/decode.c | 44 +++++++++++----------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 107 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index cd30b62d36..04fd5ca870 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to top-level XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5118,6 +5120,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6020,3 +6023,50 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * IsSubTransactionAssignmentPending
+ *
+ * This is used to decide whether we need to WAL log the top-level XID for
+ * operation in a subtransaction.  We require that for logical decoding, see
+ * LogicalDecodingProcessRecord.
+ *
+ * This returns true if wal_level >= logical and we are inside a valid
+ * subtransaction, for which the assignment was not yet written to any WAL
+ * record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	/* wal_level has to be logical */
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it should not be already 'assigned' */
+	return !CurrentTransactionState->assigned;
+}
+
+/*
+ * MarkSubTransactionAssigned
+ *
+ * Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f09e..c526bb1928 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,19 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogResetInsertion) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cb76be4f46..a757baccfc 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1197,6 +1197,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1235,6 +1236,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3abf8..0c0c371739 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -94,11 +94,27 @@ void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
 	XLogRecordBuffer buf;
+	TransactionId txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the top-level xid is valid, we need to assign the subxact to the
+	 * top-level xact. We need to do this for all records, hence we do it
+	 * before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -216,13 +232,8 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	/*
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
-	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
-	 * when the snapshot is being built: it is possible to get later records
-	 * that require subxids to be properly assigned.
 	 */
-	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
 		return;
 
 	switch (info)
@@ -264,22 +275,13 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
 
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			/*
+			 * We assign subxact to the toplevel xact while processing each
+			 * record if required.  So, we don't need to do anything here.
+			 * See LogicalDecodingProcessRecord.
+			 */
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 88025b1cc2..22bb96ca2a 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 347a38f57c..a5468c1037 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of top-level xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index b0f2a6ed43..b976882229 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -191,6 +191,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -304,6 +306,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0194..2f0c8bf589 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
2.23.0

v28/v28-0006-Bugfix-handling-of-incomplete-toast-spec-insert.patch000644 000765 000024 00000061256 13672702732 026174 0ustar00dilipkumarstaff000000 000000 From f00a49fee0d859742e636b3d0e815711ab323a1a Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Wed, 17 Jun 2020 18:22:35 +0530
Subject: [PATCH v28 06/14] Bugfix handling of incomplete toast/spec insert

---
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/replication/logical/decode.c      |  17 +-
 .../replication/logical/reorderbuffer.c       | 335 ++++++++++++++----
 src/include/access/heapam_xlog.h              |   1 +
 src/include/replication/reorderbuffer.h       |  47 ++-
 5 files changed, 328 insertions(+), 75 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 287a185d9c..95dec05047 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1955,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3cbbf589ed..4d3c6f8f28 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 709f5f1d41..a5cb827b18 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -179,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -237,7 +252,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool partial_truncate);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -641,14 +657,91 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 	return txn;
 }
 
+/*
+ * Handle incomplete tuple during streaming.  If streaming is enabled then we
+ * might need to stream the in-progress transaction.  So the problem is that
+ * sometime we might get some incomplete changes which we can not stream
+ * until we get the complete change. e.g.  toast table insert without the main
+ * table insert.  So this function remember the lsn of the last complete change
+ * and the complete size upto last complete lsn so that if we need to stream
+ * we can only stream upto last complete lsn.
+ */
+static void
+ReorderBufferHandleIncompleteTuple(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   ReorderBufferChange *change,
+								   bool toast_insert, Size total_size)
+{
+	ReorderBufferTXN *toptxn;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * If this is a first incomplete change then set the size of the complete
+	 * change.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		(toast_insert || IsSpecInsert(change->action)))
+		toptxn->complete_size = total_size;
+
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Basically,
+	 * both update and insert will do the insert in the toast table.  And as
+	 * explained in the function header we can not stream the only toast
+	 * changes.  So whenever we get the toast insert we set the flag and clear
+	 * the same whenever we get the next insert or update on the main table.
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial tuple and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+
+	/*
+	 * If we don't have any incomplete change after this change then set this
+	 * LSN as last complete lsn.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(toptxn)))
+	{
+		toptxn->last_complete_lsn = change->lsn;
+
+		/*
+		 * If the transaction is serialized and the the changes are complete in
+		 * the top level transaction then immediately stream the transaction.
+		 * The reason for not waiting for memory limit to get full is that in
+		 * the streaming mode, if the transaction serialized that means we have
+		 * already reached the memory limit but that time we could not stream
+		 * this due to incomplete tuple so now stream it as soon as the tuple
+		 * is complete.  Also, if we don't stream the serialized changes then
+		 * if we get some more incomplete changes in this transaction then we
+		 * don't have a way to partly truncate the serialized changes.
+		 */
+		if (rbtxn_is_serialized(txn))
+			ReorderBufferStreamTXN(rb, toptxn);
+	}
+}
+
 /*
  * Queue a change into a transaction so it can be replayed upon commit.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
+	Size	total_size = 0;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
@@ -660,9 +753,28 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * Get the total size of the top transaction before updating the size for
+	 * current change so that if this is the incomplete tuple we know the size
+	 * prior to this change.  That will be used for updating the size of the
+	 * complete changes in the top transaction for streaming.
+	 */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			total_size = txn->toptxn->total_size;
+		else
+			total_size = txn->total_size;
+	}
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* Handle the incomplete tuple, If streaming is enabled */
+	if (ReorderBufferCanStream(rb))
+		ReorderBufferHandleIncompleteTuple(rb, txn, change, toast_insert,
+										   total_size);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -692,7 +804,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1405,11 +1517,45 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 /*
  * Discard changes from a transaction (and subtransactions), after streaming
  * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ * If partial_truncate is false we completely truncate the transaction,
+ * otherwise we truncate upto last_complete_lsn
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 bool partial_truncate)
 {
 	dlist_mutable_iter iter;
+	ReorderBufferTXN *toptxn;
+
+	/* Get the top transaction */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * The serialized transaction should never be partly truncated, because if
+	 * it is serialized then we stream it as soon as its changes get completed.
+	 */
+	Assert(!(rbtxn_is_serialized(txn) && partial_truncate));
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
 
 	/* cleanup subtransactions & their changes */
 	dlist_foreach_modify(iter, &txn->subtxns)
@@ -1426,7 +1572,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, partial_truncate);
 	}
 
 	/* cleanup changes in the toplevel txn */
@@ -1436,30 +1582,19 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		change = dlist_container(ReorderBufferChange, node, iter.cur);
 
+		/* We have truncated upto last complete lsn so stop. */
+		if (partial_truncate && (change->lsn > toptxn->last_complete_lsn))
+		{
+			/* The transaction must have incomplete changes. */
+			Assert(rbtxn_has_incomplete_tuple(toptxn));
+			break;
+		}
+
 		/* remove the change from it's containing list */
 		dlist_delete(&change->node);
-
 		ReorderBufferReturnChange(rb, change);
 	}
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The toplevel transaction, identified by (toptxn==NULL), is marked
-	 * as streamed always, even if it does not contain any changes (that
-	 * is, when all the changes are in subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts
-	 * for XIDs the downstream is not aware of. And of course, it always
-	 * knows about the toplevel xact (we send the XID in all messages),
-	 * but we never stream XIDs of empty subxacts.
-	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
 	 * any memory. We could also keep the hash table and update it with
@@ -1471,9 +1606,39 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
-	/* also reset the number of entries in the transaction */
-	txn->nentries_mem = 0;
-	txn->nentries = 0;
+	/*
+	 * Adjust nentries/nentries_mem based on the changes processed.  See
+	 * comments where nprocessed is declared.
+	 */
+	if (partial_truncate)
+	{
+		txn->nentries -= txn->nprocessed;
+		txn->nentries_mem -= txn->nprocessed;
+	}
+	else
+	{
+		txn->nentries = 0;
+		txn->nentries_mem = 0;
+	}
+	txn->nprocessed = 0;
+
+	/*
+	 * If this is a top transaction then we can reset the
+	 * last_complete_lsn and complete_size, because by now we would
+	 * have stream all the changes upto last_complete_lsn.
+	 */
+	if (partial_truncate && (txn->toptxn == NULL))
+	{
+		toptxn->last_complete_lsn = InvalidXLogRecPtr;
+		toptxn->complete_size = 0;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
 }
 
 /*
@@ -1760,7 +1925,7 @@ ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
 								   ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed. */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Stop the stream. */
 	rb->stream_stop(rb, txn, last_lsn);
@@ -1792,6 +1957,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 	ReorderBufferChange *volatile specinsert = NULL;
 	volatile bool	stream_started = false;
+	volatile bool	partial_truncate = false;
+
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1814,6 +1981,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
+		ReorderBufferTXN *curtxn;
 
 		if (using_subtxn)
 			BeginInternalSubTransaction(streaming? "stream" : "replay");
@@ -1850,7 +2018,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			/* Set the xid for concurrent abort check. */
 			if (streaming)
-				SetupCheckXidLive(change->txn->xid);
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
 
 			switch (change->action)
 			{
@@ -2106,6 +2277,27 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
 			}
+
+			if (streaming)
+			{
+				/*
+				 * Increment the nprocessed count.  See the detailed comment
+				 * for usage of this in ReorderBufferTXN structure.
+				 */
+				curtxn->nprocessed++;
+
+				/*
+				 * If the transaction contains incomplete tuple and this is the
+				 * last complete change then stop further processing of the
+				 * transaction.  And, set the partial truncate flag to true.
+				 */
+				if (rbtxn_has_incomplete_tuple(txn) &&
+					prev_lsn == txn->last_complete_lsn)
+				{
+					partial_truncate = true;
+					break;
+				}
+			}
 		}
 
 		/*
@@ -2125,7 +2317,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * Done with current changes, call stream_stop callback for streaming
-		 * transaction, commit callback otherwise.  If we have sent
+		 * transaction, commit callback otherwise.  Only If we have sent
 		 * start/begin.
 		 */
 		if (stream_started)
@@ -2176,7 +2368,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming)
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, partial_truncate);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
@@ -2510,7 +2702,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2559,7 +2751,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2582,6 +2774,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2596,8 +2789,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2605,12 +2803,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2853,18 +3059,28 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 	dlist_foreach(iter, &rb->toplevel_by_lsn)
 	{
 		ReorderBufferTXN *txn;
+		Size	size = 0;
+		Size	largest_size = 0;
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
+		/*
+		 * If this transaction have some incomplete changes then only consider
+		 * the size upto last complete lsn.
+		 */
+		if (rbtxn_has_incomplete_tuple(txn))
+			size = txn->complete_size;
+		else
+			size = txn->total_size;
+
+		/* If the current transaction is larger then remember it. */
+		if ((largest != NULL || size > largest_size) && size > 0)
+		{
 			largest = txn;
+			largest_size = size;
+		}
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2902,27 +3118,22 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 		 * Pick the largest transaction (or subtransaction) and evict it from
 		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		if (ReorderBufferCanStream(rb))
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
 		{
-			/*
-			* Pick the largest toplevel transaction and evict it from memory by
-			* streaming the already decoded part.
-			*/
-			txn = ReorderBufferLargestTopTXN(rb);
-
 			/* we know there has to be one, because the size is not zero */
 			Assert(txn && !txn->toptxn);
-			Assert(txn->size > 0);
-			Assert(rb->size >= txn->size);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
 			ReorderBufferStreamTXN(rb, txn);
 		}
 		else
 		{
 			/*
-			* Pick the largest transaction (or subtransaction) and evict it from
-			* memory by serializing it to disk.
-			*/
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
 			txn = ReorderBufferLargestTXN(rb);
 
 			/* we know there has to be one, because the size is not zero */
@@ -2931,14 +3142,14 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(rb->size >= txn->size);
 
 			ReorderBufferSerializeTXN(rb, txn);
-		}
 
-		/*
-		 * After eviction, the transaction should have no entries in memory,
-		 * and should use 0 bytes for changes.
-		 */
-		Assert(txn->size == 0);
-		Assert(txn->nentries_mem == 0);
+			/*
+			 * After eviction, the transaction should have no entries in memory, and
+			 * should use 0 bytes for changes.
+			 */
+			Assert(txn->size == 0);
+			Assert(txn->nentries_mem == 0);
+		}
 	}
 
 	/* We must be under the memory limit now. */
@@ -3320,10 +3531,6 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 */
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
-
-	Assert(dlist_is_empty(&txn->changes));
-	Assert(txn->nentries == 0);
-	Assert(txn->nentries_mem == 0);
 }
 
 /*
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cdb12..aa17f7df84 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index c38f7345b9..6e3b24a801 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -171,6 +171,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -190,6 +192,26 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -198,10 +220,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -347,6 +365,23 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* Size of the complete changes. */
+	Size		complete_size;
+
+	/* LSN of the last complete change. */
+	XLogRecPtr	last_complete_lsn;
+
+	/*
+	 * Number of changes processed.  This is used to keep track of changes that
+	 * remained to be streamed.  As of now, this can happen either due to toast
+	 * tuples or speculative insertions where we need to wait for multiple
+	 * changes before we can send them.
+	 */
+	uint64		nprocessed;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -534,7 +569,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
2.23.0

v28/v28-0011-Provide-new-api-to-get-the-streaming-changes.patch000644 000765 000024 00000014524 13672702732 025353 0ustar00dilipkumarstaff000000 000000 From 6742cf370cdb756c6af77b3d1b0f4f95ff6e8fee Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Sat, 2 May 2020 11:41:59 +0530
Subject: [PATCH v28 11/14] Provide new api to get the streaming changes

---
 .gitignore                                    |  1 +
 doc/src/sgml/test-decoding.sgml               | 22 ++++++++++++++++++
 src/backend/catalog/system_views.sql          |  8 +++++++
 .../replication/logical/logicalfuncs.c        | 23 +++++++++++++++----
 src/include/catalog/pg_proc.dat               |  9 ++++++++
 5 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/.gitignore b/.gitignore
index 794e35b73c..6083744c07 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d67b..eed6e9d134 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_streaming_changes('test_slot', NULL, NULL);
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3db900d2e6..5e223f87f1 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1243,6 +1243,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
 
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_peek_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index b99c94e848..70c28ffa91 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -108,7 +108,8 @@ check_permissions(void)
  * Helper function for the various SQL callable logical decoding functions.
  */
 static Datum
-pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool binary)
+pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
+								 bool binary, bool streaming)
 {
 	Name		name;
 	XLogRecPtr	upto_lsn;
@@ -252,6 +253,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or has been invalidated.")));
 
+		/* If called has not asked for streaming changes then disable it. */
+		ctx->streaming &= streaming;
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
@@ -362,7 +366,16 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 Datum
 pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, false);
+}
+
+/*
+ * SQL function to get the streaming changes as text, consuming the data.
+ */
+Datum
+pg_logical_slot_get_streaming_changes(PG_FUNCTION_ARGS)
+{
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, true);
 }
 
 /*
@@ -371,7 +384,7 @@ pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, false, false);
 }
 
 /*
@@ -380,7 +393,7 @@ pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, true, false);
 }
 
 /*
@@ -389,7 +402,7 @@ pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, true, false);
 }
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7869f721da..875e0bef28 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10115,6 +10115,15 @@
   proargmodes => '{i,i,i,v,o,o,o}',
   proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
   prosrc => 'pg_logical_slot_get_binary_changes' },
+{ oid => '6150', descr => 'get streaming changes from replication slot',
+  proname => 'pg_logical_slot_get_streaming_changes', procost => '1000',
+  prorows => '1000', provariadic => 'text', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'name pg_lsn int4 _text',
+  proallargtypes => '{name,pg_lsn,int4,_text,pg_lsn,xid,text}',
+  proargmodes => '{i,i,i,v,o,o,o}',
+  proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
+  prosrc => 'pg_logical_slot_get_streaming_changes' },
 { oid => '3784', descr => 'peek at changes from replication slot',
   proname => 'pg_logical_slot_peek_changes', procost => '1000',
   prorows => '1000', provariadic => 'text', proisstrict => 'f',
-- 
2.23.0

v28/v28-0005-Implement-streaming-mode-in-ReorderBuffer.patch000644 000765 000024 00000112044 13672702732 025035 0ustar00dilipkumarstaff000000 000000 From 723c24dbd42725251bb830e2a4fea3c62f5287c1 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Wed, 17 Jun 2020 18:20:30 +0530
Subject: [PATCH v28 05/14] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes
we have in memory and invoke new stream API methods. This happens
in ReorderBufferStreamTXN() using about the same logic as in
ReorderBufferCommit() logic.  However, sometime if we have incomplete
toast or speculative insert we spill to the disk because we can not
generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c   |  38 +-
 .../replication/logical/reorderbuffer.c       | 755 ++++++++++++++++--
 src/include/replication/reorderbuffer.h       |  26 +
 3 files changed, 743 insertions(+), 76 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aa..160b167adb 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 364a5bba6d..709f5f1d41 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -236,6 +237,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +246,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +382,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -767,6 +781,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+/*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -863,6 +909,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	/* set the reference to top-level transaction */
 	subtxn->toptxn = txn;
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -1022,6 +1071,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1036,6 +1088,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1313,6 +1368,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		dlist_delete(&txn->base_snapshot_node);
 	}
 
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
 	/*
 	 * Remove TXN from its containing list.
 	 *
@@ -1338,6 +1402,80 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferReturnTXN(rb, txn);
 }
 
+/*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
 /*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1489,57 +1627,171 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+	 * aborted. That will happen during catalog access.  Also, reset the
+	 * bsysscan flag.
+	 */
+	if (!TransactionIdDidCommit(xid))
+	{
+		CheckXidAlive = xid;
+		bsysscan = false;
 	}
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction.
+ */
+static void
+ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   Snapshot snapshot_now,
+								   CommandId command_id,
+								   XLogRecPtr last_lsn,
+								   ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+
+	ReorderBufferToastReset(rb, txn);
+	if (specinsert != NULL)
+		ReorderBufferReturnChange(rb, specinsert);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool	stream_started = false;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1562,21 +1814,44 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
-
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
 		{
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Start stream or begin transaction for the first change in the
+			 * current stream.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+					rb->stream_start(rb, txn, change->lsn);
+				else
+					rb->begin(rb, txn);
+				stream_started = true;
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1653,7 +1928,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1693,7 +1969,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1751,7 +2027,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1760,10 +2039,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1794,7 +2070,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1848,14 +2123,34 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.  If we have sent
+		 * start/begin.
+		 */
+		if (stream_started)
+		{
+			if (streaming)
+				rb->stream_stop(rb, txn, prev_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+			stream_started = false;
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1873,14 +2168,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1899,17 +2207,122 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/* Reset the CheckXidAlive */
+		if (streaming)
+			CheckXidAlive = InvalidTransactionId;
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If the error code is ERRCODE_TRANSACTION_ROLLBACK,  that means we
+		 * have detected a concurrent abort of the (sub)transaction we are
+		 * streaming.  So just do the cleanup and return gracefully.
+		 * Otherwise, Re-throw the error.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * Only in the streaming mode we can get this error, because only
+			 * in the streaming mode we send in-progress transaction.
+			 */
+			Assert(streaming);
+
+			/*
+			 * In the TRY block we only stop the stream after we have send
+			 * all the changes.  So we have detected the concurrent abort then
+			 * the stream should have not been stopped yet.
+			 */
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
 
-		PG_RE_THROW();
+			/* Handle the concurrent abort. */
+			ReorderBufferHandleConcurrentAbort(rb, txn, snapshot_now,
+											   command_id, prev_lsn,
+											   specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.  We iterate over the top and
+ * subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot snapshot_now;
+	CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
@@ -1934,6 +2347,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2003,6 +2423,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2136,8 +2563,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2145,6 +2581,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2156,19 +2593,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2197,6 +2643,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2389,6 +2836,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction  at-a-time to evict and spill its changes to
@@ -2421,11 +2900,38 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStream(rb))
+		{
+			/*
+			* Pick the largest toplevel transaction and evict it from memory by
+			* streaming the already decoded part.
+			*/
+			txn = ReorderBufferLargestTopTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			* Pick the largest transaction (or subtransaction) and evict it from
+			* memory by serializing it to disk.
+			*/
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2723,6 +3229,103 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot snapshot_now;
+	CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -3822,6 +4425,16 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases we assume the CID is from the future command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 24b4dd65d6..c38f7345b9 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -170,6 +170,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -189,6 +190,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -256,6 +275,13 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
+	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
 	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
-- 
2.23.0

v28/v28-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch000644 000765 000024 00000032536 13672702732 026457 0ustar00dilipkumarstaff000000 000000 From dc0f94a442ff80513ae4cdec873483635eedbdf4 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 10:55:19 +0530
Subject: [PATCH v28 04/14] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml  |  9 +++--
 src/backend/access/heap/heapam.c   | 10 ++++++
 src/backend/access/index/genam.c   | 53 ++++++++++++++++++++++++++++
 src/backend/access/table/tableam.c |  8 +++++
 src/backend/utils/time/snapmgr.c   | 13 +++++++
 src/include/access/tableam.h       | 55 ++++++++++++++++++++++++++++++
 src/include/utils/snapmgr.h        |  2 ++
 7 files changed, 147 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 50cfd6fa47..ab689f8d19 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 537913d1bb..287a185d9c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments at snapmgr.c
+	 * where these variables are declared.  Normally we have such a check at
+	 * tableam level API but this is called from many places so we need to
+	 * ensure it here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae39a..446b8cbc86 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,9 +430,36 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments at snapmgr.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
+/*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments at snapmgr.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
 /*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments at snapmgr.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index c814733b22..2f52b407c6 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -230,6 +230,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	Relation	rel = scan->rs_rd;
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
+	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
 	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..9f1ecd123f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -153,6 +153,19 @@ static Snapshot SecondarySnapshot = NULL;
 static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool bsysscan = false;
+
 /*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index eb18739c36..2b7d3df617 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
+#include "utils/snapmgr.h"
 #include "utils/snapshot.h"
 
 
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1015,6 +1025,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1054,6 +1071,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1710,6 +1735,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1727,6 +1760,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1745,6 +1786,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1761,6 +1809,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce84..5af6df698b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,6 +145,8 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
+extern bool	bsysscan;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
 extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
 extern void TeardownHistoricSnapshot(bool is_error);
-- 
2.23.0

v28/v28-0013-Change-buffile-interface-required-for-streaming-.patch000644 000765 000024 00000022150 13672702732 026234 0ustar00dilipkumarstaff000000 000000 From 258ac14f39cc3e8e4afad57d9e8e3a8bb50c6be8 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 11 Jun 2020 16:40:25 +0530
Subject: [PATCH v28 13/14] Change buffile interface required for streaming
 transaction

Implement the BuffileTruncate and SEEK_END.  And, also add an
option to provide a mode while opening the shared buffiles, instead
of always opening in readonly mode
---
 src/backend/storage/file/buffile.c        | 53 ++++++++++++++++++++---
 src/backend/storage/file/fd.c             | 10 ++---
 src/backend/storage/file/sharedfileset.c  |  7 +--
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  3 +-
 8 files changed, 66 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 3907349b69..be16cf7e36 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -277,7 +277,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +301,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +321,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -666,11 +666,15 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +842,40 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the  files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno,  we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			SharedFileSetDelete(file->fileset, segment_name, true);
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			FileTruncate(file->files[i], offset, WAIT_EVENT_BUFFILE_READ);
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 7dc6dd2f15..10591fee18 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1741,18 +1741,18 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are read-only if the flag is set and are
+ * automatically closed at the end of the transaction but are not deleted on
+ * close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index f7206c9175..4b39d91320 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -68,7 +68,8 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
 }
 
 /*
@@ -131,13 +132,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59c50..788815cdab 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a4303b..b83fb50dac 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752bab0d..fc34c49522 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d7df..e209f047e8 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf077e5..b2f4ba4bd8 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,7 +37,8 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
-- 
2.23.0

v28/v28-0010-Add-TAP-test-for-streaming-vs.-DDL.patch000644 000765 000024 00000010620 13672702732 022774 0ustar00dilipkumarstaff000000 000000 From 6f2150e4fbf087f5b0084b984909d5634f27e5f7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v28 10/14] Add TAP test for streaming vs. DDL

---
 .../subscription/t/014_stream_through_ddl.pl  | 98 +++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000000..b8d78b1972
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v28/v28-0003-Extend-the-output-plugin-API-with-stream-methods.patch000644 000765 000024 00000106360 13672702732 026207 0ustar00dilipkumarstaff000000 000000 From cd1a20719d767ad898a0bc69f3ed960973339662 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v28 03/14] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 ++++++
 doc/src/sgml/logicaldecoding.sgml         | 213 +++++++++++++
 src/backend/replication/logical/logical.c | 365 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++
 src/include/replication/reorderbuffer.h   |  59 ++++
 6 files changed, 811 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c948856e..64f651fa72 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index c89f93cf6b..50cfd6fa47 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,6 +389,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -401,6 +408,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -679,6 +695,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -747,4 +869,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be3b0..26d461effb 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similar to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,321 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475e5d..deef31825d 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -79,6 +79,11 @@ typedef struct LogicalDecodingContext
 	 */
 	void	   *output_writer_private;
 
+	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236c57..0d0a94a648 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
 /*
  * Output plugin callbacks
  */
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 9cd645d0ec..24b4dd65d6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -356,6 +356,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
 struct ReorderBuffer
 {
 	/*
@@ -394,6 +442,17 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
 	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
-- 
2.23.0

v28/v28-0008-Add-support-for-streaming-to-built-in-replicatio.patch000644 000765 000024 00000263043 13672702732 026314 0ustar00dilipkumarstaff000000 000000 From c16b8321fef712705b3331f818dc41eacfd33f11 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 11 Jun 2020 15:34:29 +0530
Subject: [PATCH v28 08/14] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml      |    4 +-
 doc/src/sgml/ref/create_subscription.sgml     |   11 +
 src/backend/catalog/pg_subscription.c         |    1 +
 src/backend/commands/subscriptioncmds.c       |   45 +-
 src/backend/postmaster/pgstat.c               |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |    3 +
 src/backend/replication/logical/proto.c       |  140 ++-
 src/backend/replication/logical/worker.c      | 1012 +++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c   |  318 +++++-
 src/backend/replication/slotfuncs.c           |    6 +
 src/backend/replication/walsender.c           |    6 +
 src/include/catalog/pg_subscription.h         |    3 +
 src/include/pgstat.h                          |    6 +-
 src/include/replication/logicalproto.h        |   42 +-
 src/include/replication/walreceiver.h         |    1 +
 src/test/subscription/t/009_stream_simple.pl  |   86 ++
 src/test/subscription/t/010_stream_subxact.pl |  102 ++
 src/test/subscription/t/011_stream_ddl.pl     |   95 ++
 .../t/012_stream_subxact_abort.pl             |   82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |   84 ++
 20 files changed, 2019 insertions(+), 40 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index c24ace14d1..d8de56c928 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 5bbc165f70..c25b7c5962 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731115..f28482f0f4 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9ebb026187..9065a1be1b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c022597bc0..a55ccc0c03 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4138,6 +4138,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9bb6..5257ab0394 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd171..83d0642cf3 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a1224d..d2d9469999 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,6 +54,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -64,6 +86,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +94,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -100,6 +124,7 @@ typedef struct SlotErrCallbackArg
 } SlotErrCallbackArg;
 
 static MemoryContext ApplyMessageContext = NULL;
+static MemoryContext LogicalStreamingContext = NULL;
 MemoryContext ApplyContext = NULL;
 
 WalReceiverConn *wrconn = NULL;
@@ -110,12 +135,58 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool	in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;						/* XID of the subxact */
+	off_t           offset;					/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +258,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -552,6 +659,326 @@ apply_handle_origin(StringInfo s)
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+	{
+		MemoryContext oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+
+		/* Read the subxacts info in per-stream context. */
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+		MemoryContextSwitchTo(oldctx);
+	}
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		int			fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = OpenTransientFile(path, O_WRONLY | PG_BINARY);
+		if (fd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+		}
+
+		/* OK, truncate the file at the right offset. */
+		if (ftruncate(fd, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+		CloseTransientFile(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	ensure_transaction();
+
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -565,6 +992,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1010,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1049,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1167,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1312,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1685,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1826,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1477,6 +1938,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 	}
 }
 
+/*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
 /*
  * Apply main loop.
  */
@@ -1493,6 +1970,17 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reeset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													 ALLOCSET_DEFAULT_SIZES);
+
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1941,6 +2429,529 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Let the caller decide which memory context it will be allocated.
+	 * Ideally, during stream start it will be allocated in the
+	 * LogicalStreamingContext which will be reset on stream stop, and
+	 * during the stream abort we need this memory only for short term so
+	 * it will be allocated in ApplyMessageContext.
+	 */
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd >= 0);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.subxacts",
+			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.changes",
+			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		/* Need to allocate this in permanent context */
+		oldcxt = MemoryContextSwitchTo(ApplyContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3117,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 15379e3118..1509f9b826 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +122,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +148,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +211,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +240,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +263,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +373,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +423,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +465,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +486,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +518,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +538,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +562,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +582,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +607,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +639,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -587,6 +719,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -623,6 +840,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -753,11 +1002,44 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -793,7 +1075,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 06e4955de7..5f74ca1eed 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -157,6 +157,12 @@ create_logical_replication_slot(char *name, char *plugin,
 											   .segment_close = wal_segment_close),
 									NULL, NULL, NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
 	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d0c0674848..ffc3d50081 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndPrepareWrite, WalSndWriteData,
 										WalSndUpdateProgress);
 
+		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
 		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d42d8..617b9094d4 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..6352ff945a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -981,7 +981,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561be9..89158ed46f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index ac1acbb27e..95132062c6 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v28/v28-0007-Track-statistics-for-streaming.patch000644 000765 000024 00000027667 13672702732 023065 0ustar00dilipkumarstaff000000 000000 From ed1ae17c2d4a92d01f80a94b6fe8e127120496c9 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 11 Jun 2020 15:26:18 +0530
Subject: [PATCH v28 07/14] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                  | 33 +++++++++++++++++++
 src/backend/catalog/system_views.sql          |  5 ++-
 .../replication/logical/reorderbuffer.c       | 12 +++++++
 src/backend/replication/walsender.c           | 32 +++++++++++++++---
 src/include/catalog/pg_proc.dat               |  6 ++--
 src/include/replication/reorderbuffer.h       | 13 +++++---
 src/include/replication/walsender_private.h   |  5 +++
 src/test/regress/expected/rules.out           |  7 ++--
 8 files changed, 98 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dfa9d0d641..8a40639e39 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2498,6 +2498,39 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        Amount of decoded transaction data spilled to disk.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_txns</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of in-progress transactions streamed to subscriber after
+       memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+       Streaming only works with toplevel transactions (subtransactions can't
+       be streamed independently), so the counter does not get incremented for
+       subtransactions.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_count</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times in-progress transactions were streamed to subscriber.
+       Transactions may get streamed repeatedly, and this counter gets incremented
+       on every such invocation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_bytes</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Amount of decoded in-progress transaction data streamed to subscriber.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5314e9348f..3db900d2e6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -788,7 +788,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index a5cb827b18..c97d758d09 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -348,6 +348,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3526,6 +3530,14 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->snapshot_now = NULL;
 	}
 
+	/* Update the stream statistics. */
+	rb->streamCount += 1;
+	rb->streamBytes += (rbtxn_has_incomplete_tuple(txn)) ?
+						txn->complete_size : txn->total_size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	/*
 	 * Access the main routine to decode the changes and send to output plugin.
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e2477c47e0..d0c0674848 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1349,7 +1349,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1370,7 +1370,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2421,6 +2422,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3256,7 +3260,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3314,6 +3318,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillCount;
 		int64		spillBytes;
 		bool		is_sync_standby;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 		int			j;
@@ -3339,6 +3346,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		/*
@@ -3441,6 +3451,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3683,11 +3698,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 {
 	ReorderBuffer *rb = ctx->reorder;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockAcquire(&MyWalSnd->mutex);
 	MyWalSnd->spillTxns = rb->spillTxns;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 61f2c2f5b4..7869f721da 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5237,9 +5237,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6e3b24a801..7b6b08d058 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -546,15 +546,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 734acec2a4..b997d1710e 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -83,6 +83,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b813e32215..cf22f8a038 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2005,9 +2005,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_slru| SELECT s.name,
     s.blks_zeroed,
-- 
2.23.0

v28/v28-0002-Issue-individual-invalidations-with-wal_level-lo.patch000644 000765 000024 00000037225 13672702732 026451 0ustar00dilipkumarstaff000000 000000 From b9bc13cd3e2a37005eeb827f8fbc0d9f7d6263c7 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 6 Jun 2020 09:54:21 +0530
Subject: [PATCH v28 02/14] Issue individual invalidations with
 wal_level=logical.

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/rmgrdesc/xactdesc.c        | 40 +++++++++++++
 src/backend/access/transam/xact.c             |  7 +++
 src/backend/replication/logical/decode.c      | 60 +++++++++++--------
 .../replication/logical/reorderbuffer.c       | 54 +++++++++++++----
 src/backend/utils/cache/inval.c               | 57 ++++++++++++++++++
 src/include/access/xact.h                     | 13 +++-
 src/include/replication/reorderbuffer.h       | 11 ++++
 7 files changed, 206 insertions(+), 36 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce75565f..404d988625 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else if (msg->id == SHAREDINVALSNAPSHOT_ID)
+			appendStringInfo(buf, " snapshot %u", msg->sn.relId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 04fd5ca870..72efa3c1b3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6020,6 +6020,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0c0c371739..3cbbf589ed 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -278,10 +278,39 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 			/*
 			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here.
-			 * See LogicalDecodingProcessRecord.
+			 * record if required.  So, we don't need to do anything here. See
+			 * LogicalDecodingProcessRecord.
 			 */
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/*
+				 * If these are transaction invalidation then append them to
+				 * the transaction's invalidation list.  Otherwise, immediately
+				 * execute the invalidations for the xid less transaction.
+				 */
+				if (!ctx->fast_forward)
+				{
+					if (TransactionIdIsValid(xid))
+					{
+						ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+													  invals->nmsgs, invals->msgs);
+						ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+														  buf->origptr);
+					}
+					else
+						ReorderBufferImmediateInvalidation(ctx->reorder,
+														   invals->nmsgs,
+														   invals->msgs);
+				}
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
@@ -334,15 +363,11 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_STANDBY_LOCK:
 			break;
 		case XLOG_INVALIDATIONS:
-			{
-				xl_invalidations *invalidations =
-				(xl_invalidations *) XLogRecGetData(r);
-
-				if (!ctx->fast_forward)
-					ReorderBufferImmediateInvalidation(ctx->reorder,
-													   invalidations->nmsgs,
-													   invalidations->msgs);
-			}
+			/*
+			 * Now, we are already WAL logging the command level invalidations
+			 * with XLOG_XACT_INVALIDATIONS.  So we don't need to handle them
+			 * here again.
+			 */
 			break;
 		default:
 			elog(ERROR, "unexpected RM_STANDBY_ID record type: %u", info);
@@ -573,19 +598,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
-	/*
-	 * Process invalidation messages, even if we're not interested in the
-	 * transaction's contents, since the various caches need to always be
-	 * consistent.
-	 */
-	if (parsed->nmsgs > 0)
-	{
-		if (!ctx->fast_forward)
-			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
-										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
-	}
-
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 642a1c767f..364a5bba6d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -860,6 +860,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to top-level transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -1824,7 +1827,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 					break;
-
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
 		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
 										   txn->invalidations);
-	else
-		Assert(txn->ninvalidations == 0);
 
 	/* remove potential on-disk data, and deallocate */
 	ReorderBufferCleanupTXN(rb, txn);
@@ -2216,17 +2216,40 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	if (txn->ninvalidations != 0)
-		elog(ERROR, "only ever add one set of invalidations");
+	/*
+	 * We collect all the invalidations under the top transaction so that
+	 * we can execute them all together.
+	 */
+	if (txn->toptxn)
+		txn = txn->toptxn;
 
 	Assert(nmsgs > 0);
 
-	txn->ninvalidations = nmsgs;
-	txn->invalidations = (SharedInvalidationMessage *)
-		MemoryContextAlloc(rb->context,
-						   sizeof(SharedInvalidationMessage) * nmsgs);
-	memcpy(txn->invalidations, msgs,
-		   sizeof(SharedInvalidationMessage) * nmsgs);
+	/*
+	 * If there is no invalidation yet for this transaction then allocate the
+	 * memory for the same and copy the invalidations.  Otherwise, expand the
+	 * memory for the new invalidation and append them to the existing
+	 * invalidations.
+	 */
+	if (txn->ninvalidations == 0)
+	{
+		txn->ninvalidations = nmsgs;
+		txn->invalidations = (SharedInvalidationMessage *)
+			MemoryContextAlloc(rb->context,
+							   sizeof(SharedInvalidationMessage) * nmsgs);
+		memcpy(txn->invalidations, msgs,
+			   sizeof(SharedInvalidationMessage) * nmsgs);
+	}
+	else
+	{
+		txn->invalidations = (SharedInvalidationMessage *)
+			repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
+					 (txn->ninvalidations + nmsgs));
+
+		memcpy(txn->invalidations + txn->ninvalidations, msgs,
+			   nmsgs * sizeof(SharedInvalidationMessage));
+		txn->ninvalidations += nmsgs;
+	}
 }
 
 /*
@@ -2254,6 +2277,15 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * Mark top-level transaction as having catalog changes too if one of its
+	 * children has so that the ReorderBufferBuildTupleCidHash can
+	 * conveniently check just top-level transaction and decide whether to
+	 * build the hash table or not.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33be6..d81999747a 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,10 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transaction.  As of now it was
+ *	enough to log invalidation only at commit because we are only decoding the
+ *	transaction at the commit time.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +108,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +215,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1094,6 +1101,11 @@ CommandEndInvalidationMessages(void)
 
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
+
+	/* WAL Log per-command invalidation messages for wal_level=logical */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
 							   &transInvalInfo->CurrentCmdInvalidMsgs);
 }
@@ -1501,3 +1513,48 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * LogLogicalInvalidations
+ *
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage *invalMessages;
+	int			nmsgs = 0;
+
+	/* Quick exit if we haven't done anything with invalidation messages. */
+	if (transInvalInfo == NULL)
+		return;
+
+	ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+										MakeSharedInvalidMessagesArray);
+
+	Assert(!(numSharedInvalidMessagesArray > 0 &&
+			SharedInvalidMessagesArray == NULL));
+
+	invalMessages = SharedInvalidMessagesArray;
+	nmsgs = numSharedInvalidMessagesArray;
+	SharedInvalidMessagesArray = NULL;
+	numSharedInvalidMessagesArray = 0;
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+							nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 22bb96ca2a..3f3e137531 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -197,6 +197,17 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+/*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4dc9..9cd645d0ec 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -149,6 +149,14 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations;		/* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation
+														 * message */
+		}			inval;
 	}			data;
 
 	/*
@@ -220,6 +228,9 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	end_lsn;
 
+	/* Toplevel transaction for this subxact (NULL for top-level). */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN of the last lsn at which snapshot information reside, so we can
 	 * restart decoding from there and fully recover this transaction from
-- 
2.23.0

#374Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#357)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jun 9, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think one of the usages we still need is in ReorderBufferForget
because it can be called when we skip processing the txn. See the
comments in DecodeCommit where we call this function. If I am
correct, we need to probably collect all invalidations in
ReorderBufferTxn as we are collecting tuplecids and use them here. We
can do the same during processing of XLOG_XACT_INVALIDATIONS.

One more point related to this is that after this patch series, we
need to consider executing all invalidation during transaction abort.
Because it is possible that due to memory overflow, we have processed
some of the messages which also contain a few XACT_INVALIDATION
messages, so to avoid cache pollution, we need to execute all of them
in abort. We also do the similar thing in Rollback/Rollback To
Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval.

Yes, we need to do that, So now we are collecting all the
invalidation under txn->invalidation so they are getting executed on
abort.

Few other comments on
0002-Issue-individual-invalidations-with-wal_level-lo.patch
---------------------------------------------------------------------------------------------------------------
1.
+ if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+ {
+ ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+ MakeSharedInvalidMessagesArray);
+ invalMessages = SharedInvalidMessagesArray;
+ nmsgs  = numSharedInvalidMessagesArray;
+ SharedInvalidMessagesArray = NULL;
+ numSharedInvalidMessagesArray = 0;

a. Immediately after ProcessInvalidationMessagesMulti, isn't it better
to have an Assertion like Assert(!(numSharedInvalidMessagesArray > 0
&& SharedInvalidMessagesArray == NULL));?

Done

b. Why check "if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)" is
required? If you see xactGetCommittedInvalidationMessages where we do
something similar, we only check for valid value of transInvalInfo and
here we check the same in the caller of LogLogicalInvalidations, isn't
that sufficient? If that is sufficient, we can either have the same
check here or have an Assert for the same.

I have put the same check here.

2.
@@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
if (transInvalInfo == NULL)
return;

+ if (XLogLogicalInfoActive())
+ LogLogicalInvalidations();
+
ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
LocalExecuteInvalidationMessage);
Generally, we WAL log the action after performing it but here you are
writing WAL first. Is there any specific reason? If so, can we write
a comment about the same?

Yeah, there is no reason for the same so moved it down.

3.
+ * When wal_level=logical, write invalidations into WAL at each command end to
+ * support the decoding of the in-progress transaction.  As of now it was
+ * enough to log invalidation only at commit because we are only decoding the
+ * transaction at the commit time.   We only need to log the catalog cache and
+ * relcache invalidation.  There can not be any active MVCC scan in logical
+ * decoding so we don't need to log the snapshot invalidation.

I think this comment doesn't hold good after we have changed the patch
to LOG invalidations at the time of CCI.

Right, modified.

4.
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()

Add the function name atop of this function in comments to match the
style with other nearby functions. How about modifying it to
something like: "Emit WAL for invalidations. This is currently only
used for logging invalidations at the command end."

Done

5.
+ *
+ * XXX Do we need to care about relcacheInitFileInval and
+ * the other fields added to ReorderBufferChange, or just
+ * about the message itself?
+ */

I don't think we need to do anything about relcacheInitFileInval.
This is used to remove the stale files (RELCACHE_INIT_FILENAME) that
have obsolete information about relcache. The walsender process that
is doing decoding doesn't require us to do anything about this. Also,
if you see before this patch, we don't do anything about relcache
files during decoding of invalidation messages. In short, I think we
can remove this comment unless you see some use of it.

Now, we have removed the Invalidation change itself so this comment is gone.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#375Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#367)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jun 15, 2020 at 6:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 15, 2020 at 9:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Jun 12, 2020 at 4:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Basically, this part is still
I have to work upon, once we get the consensus then I can remove those
extra wait event from the patch.

Okay, feel free to send an updated patch with the above change.

Sure, I will do that in the next patch set.

I have few more comments on the patch
0013-Change-buffile-interface-required-for-streaming-.patch:

1.
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are read-only if the flag is set and are
+ * automatically closed at the end of the transaction but are not deleted on
+ * close.
*/
File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)

No need to say "are read-only if the flag is set". I don't see any
flag passed to function so that part of the comment doesn't seem
appropriate.

Done

2.
@@ -68,7 +68,8 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
}

/* Register our cleanup callback. */
- on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+ if (seg)
+ on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
}

Add comments atop function to explain when we don't want to register
the dsm detach stuff?

Done, I am planning to work on more cleaner function for on_proc_exit
as we discussed offlist. I will work on this in the next version.

3.
+ */
+ newFile = file->numFiles - 1;
+ newOffset = FileSize(file->files[file->numFiles - 1]);
break;

FileSize can return negative lengths to indicate failure which we
should handle.

Done

See other places in the code where FileSize is used?

But I have another question here which is why we need to implement
SEEK_END? How other usages of BufFile interface takes care of this?
I see an API BufFileTell which can give the current read/write
location in the file, isn't that sufficient for your usage? Also, how
before BufFile usage is this thing handled in the patch?

So far we never supported to open the file in write mode, only we
create in write mode. So if we have created the file and its open we
can always use BufFileTell, which will tell the current end location
of the file. But, once we close and open again it always set to read
from the start of the file as per the current use case. We need a way
to jump to the end of the last file for appending it.

4.
+ /* Loop over all the  files upto the fileno which we want to truncate. */
+ for (i = file->numFiles - 1; i >= fileno; i--)

"the files", extra space in the above part of the comment.

Fixed

5.
+ /*
+ * Except the fileno,  we can directly delete other files.

Before 'we', there is extra space.

Done.

6.
+ else
+ {
+ FileTruncate(file->files[i], offset, WAIT_EVENT_BUFFILE_READ);
+ newOffset = offset;
+ }

The wait event passed here doesn't seem to be appropriate. You might
want to introduce a new wait event WAIT_EVENT_BUFFILE_TRUNCATE. Also,
the error handling for FileTruncate is missing.

Done

7.
+ if ((i != fileno || offset == 0) && fileno != 0)
+ {
+ SharedSegmentName(segment_name, file->name, i);
+ SharedFileSetDelete(file->fileset, segment_name, true);
+ newFile--;
+ newOffset = MAX_PHYSICAL_FILESIZE;
+ }

Similar to the previous comment, I think we should handle the failure
of SharedFileSetDelete.

8. I think the comments related to BufFile shared API usage need to be
expanded in the code to explain the new usage. For ex., see the below
comments atop buffile.c
* BufFile supports temporary files that can be made read-only and shared with
* other backends, as infrastructure for parallel execution. Such files need
* to be created as a member of a SharedFileSet that all participants are
* attached to.

Other fixes (offlist raised by my colleague Neha Sharma)
1. In BufFileTruncateShared, the files were not closed before
deleting. (in 0013)
2. In apply_handle_stream_commit, the file name in debug message was
printed before populating the name (0014)
3. On concurrent abort we are truncating all the changes including
some incomplete changes, so later when we get the complete changes we
don't have the previous changes, e.g, if we had specinsert in the
last stream and due to concurrent abort detection if we delete that
changes later we will get spec_confirm without spec insert. We could
have simply avoided deleting all the changes, but I think the better
fix is once we detect the concurrent abort for any transaction, then
why do we need to collect the changes for that, we can simply avoid
that. So I have put that fix. (0006)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v29.tarapplication/x-tar; name=v29.tarDownload
v29/000755 000765 000024 00000000000 13674044751 013015 5ustar00dilipkumarstaff000000 000000 v29/v29-0014-Worker-tempfile-use-the-shared-buffile-infrastru.patch000644 000765 000024 00000073447 13674044751 026372 0ustar00dilipkumarstaff000000 000000 From 7f4f7cf5a08d44e83b0165854e173cfe8a84b329 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 11 Jun 2020 16:42:07 +0530
Subject: [PATCH v29 14/14] Worker tempfile use the shared buffile
 infrastructure

Tobe merged with 0008, kept separate to make it easy for the
review.
---
 src/backend/replication/logical/worker.c | 621 ++++++++++-------------
 1 file changed, 275 insertions(+), 346 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index d2d9469999..14e057cfff 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -56,6 +56,7 @@
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -85,6 +86,7 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -123,10 +125,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a xid we create this entry in the
+ * xidhash and we also create the streaming file and store the fileset handle.
+ * So that on the subsequent stream for the xid we can search the entry in the
+ * hash and get the fileset handle.  The subxact file is created iff there is
+ * any suxact info under this xid.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
-static MemoryContext LogicalStreamingContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -136,15 +154,26 @@ bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
 /* fields valid only when processing streamed transaction */
-bool	in_streamed_transaction = false;
+bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
-static int	stream_fd = -1;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
 
 typedef struct SubXactInfo
 {
-	TransactionId xid;						/* XID of the subxact */
-	off_t           offset;					/* offset in the file */
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
 } SubXactInfo;
 
 static uint32 nsubxacts = 0;
@@ -171,13 +200,6 @@ static void stream_open_file(Oid subid, TransactionId xid, bool first);
 static void stream_write_change(char action, StringInfo s);
 static void stream_close_file(void);
 
-/*
- * Array of serialized XIDs.
- */
-static int	nxids = 0;
-static int	maxnxids = 0;
-static TransactionId	*xids = NULL;
-
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
@@ -275,7 +297,7 @@ handle_streamed_transaction(const char action, StringInfo s)
 	if (!in_streamed_transaction)
 		return false;
 
-	Assert(stream_fd != -1);
+	Assert(stream_fd != NULL);
 	Assert(TransactionIdIsValid(stream_xid));
 
 	/*
@@ -666,31 +688,39 @@ static void
 apply_handle_stream_start(StringInfo s)
 {
 	bool		first_segment;
+	HASHCTL		hash_ctl;
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
 	/* notify handle methods we're processing a remote transaction */
 	in_streamed_transaction = true;
 
 	/* extract XID of the top-level transaction */
 	stream_xid = logicalrep_read_stream_start(s, &first_segment);
 
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
 	/* open the spool file for this transaction */
 	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
 
-	/*
-	 * if this is not the first segment, open existing file
-	 *
-	 * XXX Note that the cleanup is performed by stream_open_file.
-	 */
+	/* if this is not the first segment, open existing file */
 	if (!first_segment)
-	{
-		MemoryContext oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
-
-		/* Read the subxacts info in per-stream context. */
 		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
-		MemoryContextSwitchTo(oldctx);
-	}
 
 	pgstat_report_activity(STATE_RUNNING, NULL);
 }
@@ -710,6 +740,12 @@ apply_handle_stream_stop(StringInfo s)
 	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
 	stream_close_file();
 
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
 	in_streamed_transaction = false;
 
 	/* Reset per-stream context */
@@ -736,10 +772,7 @@ apply_handle_stream_abort(StringInfo s)
 	 * just delete the files with serialized info.
 	 */
 	if (xid == subxid)
-	{
 		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
-		return;
-	}
 	else
 	{
 		/*
@@ -761,11 +794,13 @@ apply_handle_stream_abort(StringInfo s)
 
 		int64		i;
 		int64		subidx;
-		int			fd;
+		BufFile    *fd;
 		bool		found = false;
 		char		path[MAXPGPATH];
+		StreamXidHash *ent;
 
 		subidx = -1;
+		ensure_transaction();
 		subxact_info_read(MyLogicalRepWorker->subid, xid);
 
 		/* XXX optimize the search by bsearch on sorted data */
@@ -787,33 +822,32 @@ apply_handle_stream_abort(StringInfo s)
 		{
 			/* Cleanup the subxact info */
 			cleanup_subxact_info();
+			CommitTransactionCommand();
 			return;
 		}
 
 		Assert((subidx >= 0) && (subidx < nsubxacts));
 
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
 		changes_filename(path, MyLogicalRepWorker->subid, xid);
-		fd = OpenTransientFile(path, O_WRONLY | PG_BINARY);
-		if (fd < 0)
-		{
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not open file \"%s\": %m",
-							path)));
-		}
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
 
-		/* OK, truncate the file at the right offset. */
-		if (ftruncate(fd, subxacts[subidx].offset))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not truncate file \"%s\": %m", path)));
-		CloseTransientFile(fd);
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
 
 		/* discard the subxacts added later */
 		nsubxacts = subidx;
 
 		/* write the updated subxact list */
 		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
 	}
 }
 
@@ -823,16 +857,16 @@ apply_handle_stream_abort(StringInfo s)
 static void
 apply_handle_stream_commit(StringInfo s)
 {
-	int			fd;
 	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
-
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
+	bool		found;
 	LogicalRepCommitData commit_data;
-
+	StreamXidHash *ent;
 	MemoryContext oldcxt;
+	BufFile    *fd;
 
 	Assert(!in_streamed_transaction);
 
@@ -840,25 +874,20 @@ apply_handle_stream_commit(StringInfo s)
 
 	elog(DEBUG1, "received commit for streamed transaction %u", xid);
 
-	/* open the spool file for the committed transaction */
-	changes_filename(path, MyLogicalRepWorker->subid, xid);
-
-	elog(DEBUG1, "replaying changes from file '%s'", path);
-
-	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (fd < 0)
-	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
-	}
-
 	ensure_transaction();
-
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	buffer = palloc(8192);
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
 	initStringInfo(&s2);
 
 	MemoryContextSwitchTo(oldcxt);
@@ -881,9 +910,7 @@ apply_handle_stream_commit(StringInfo s)
 		int			len;
 
 		/* read length of the on-disk record */
-		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
-		nbytes = read(fd, &len, sizeof(len));
-		pgstat_report_wait_end();
+		nbytes = BufFileRead(fd, &len, sizeof(len));
 
 		/* have we reached end of the file? */
 		if (nbytes == 0)
@@ -891,16 +918,9 @@ apply_handle_stream_commit(StringInfo s)
 
 		/* do we have a correct length? */
 		if (nbytes != sizeof(len))
-		{
-			int			save_errno = errno;
-
-			CloseTransientFile(fd);
-			errno = save_errno;
 			ereport(ERROR,
 					(errcode_for_file_access(),
-					 errmsg("could not read file: %m")));
-			return;
-		}
+					 errmsg("could not read from streaming transaction's changes file: %m")));
 
 		Assert(len > 0);
 
@@ -908,19 +928,10 @@ apply_handle_stream_commit(StringInfo s)
 		buffer = repalloc(buffer, len);
 
 		/* and finally read the data into the buffer */
-		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
-		if (read(fd, buffer, len) != len)
-		{
-			int			save_errno = errno;
-
-			CloseTransientFile(fd);
-			errno = save_errno;
+		if (BufFileRead(fd, buffer, len) != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
-					 errmsg("could not read file: %m")));
-			return;
-		}
-		pgstat_report_wait_end();
+					 errmsg("could not read from streaming transaction's changes file: %m")));
 
 		/* copy the buffer to the stringinfo and call apply_dispatch */
 		resetStringInfo(&s2);
@@ -948,15 +959,11 @@ apply_handle_stream_commit(StringInfo s)
 		 */
 		send_feedback(InvalidXLogRecPtr, false, false);
 	}
-
-	if (CloseTransientFile(fd) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not close file \"%s\": %m", path)));
+	BufFileClose(fd);
 
 	/*
-	 * Update origin state so we can restart streaming from correct
-	 * position in case of crash.
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
 	 */
 	replorigin_session_origin_lsn = commit_data.end_lsn;
 	replorigin_session_origin_timestamp = commit_data.committime;
@@ -1946,12 +1953,39 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 static void
 worker_onexit(int code, Datum arg)
 {
-	int	i;
+	HASH_SEQ_STATUS status;
+	StreamXidHash *ent;
+	char		path[MAXPGPATH];
+
+	/* nothing to clean */
+	if (xidhash == NULL)
+		return;
+
+	/*
+	 * Scan complete hash and delete the underlying files for the xids.
+	 * Also release the memory for the shared file sets.
+	 */
+	hash_seq_init(&status, xidhash);
+	while ((ent = (StreamXidHash *) hash_seq_search(&status)) != NULL)
+	{
+		changes_filename(path, MyLogicalRepWorker->subid, ent->xid);
+		BufFileDeleteShared(ent->stream_fileset, path);
+		pfree(ent->stream_fileset);
 
-	elog(LOG, "cleanup files for %d transactions", nxids);
+		/*
+		 * We might not have created the subxact fileset if there is no sub
+		 * transaction.
+		 */
+		if (ent->subxact_fileset)
+		{
+			subxact_filename(path, MyLogicalRepWorker->subid, ent->xid);
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+		}
+	}
 
-	for (i = nxids-1; i >= 0; i--)
-		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+	/* Remove the xid hash */
+	hash_destroy(xidhash);
 }
 
 /*
@@ -1972,11 +2006,11 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 
 	/*
 	 * This memory context used for per stream data when streaming mode is
-	 * enabled.  This context is reeset on each stream stop.
+	 * enabled.  This context is reset on each stream stop.
 	 */
 	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
 													"LogicalStreamingContext",
-													 ALLOCSET_DEFAULT_SIZES);
+													ALLOCSET_DEFAULT_SIZES);
 
 	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
 	before_shmem_exit(worker_onexit, (Datum) 0);
@@ -2085,7 +2119,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -2441,64 +2475,62 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 static void
 subxact_info_write(Oid subid, TransactionId xid)
 {
-	int			fd;
 	char		path[MAXPGPATH];
+	bool		found;
 	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
 
 	Assert(TransactionIdIsValid(xid));
 
 	subxact_filename(path, subid, xid);
 
-	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
-	if (fd < 0)
-	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create file \"%s\": %m",
-						path)));
-		return;
-	}
-
-	len = sizeof(SubXactInfo) * nsubxacts;
-
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
 
-	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
 	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not write to file \"%s\": %m",
-						path)));
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
 		return;
 	}
 
-	if ((len > 0) && (write(fd, subxacts, len) != len))
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
 	{
-		int			save_errno = errno;
+		ent->subxact_fileset =
+			MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
 
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not write to file \"%s\": %m",
-						path)));
-		return;
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
 	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
 
-	pgstat_report_wait_end();
+	len = sizeof(SubXactInfo) * nsubxacts;
 
-	/*
-	 * We don't need to fsync or anything, as we'll recreate the files after a
-	 * crash from scratch. So just close the file.
-	 */
-	if (CloseTransientFile(fd) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not close file \"%s\": %m", path)));
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
 
 	/*
 	 * But we free the memory allocated for subxact info. There might be one
@@ -2513,50 +2545,45 @@ subxact_info_write(Oid subid, TransactionId xid)
  *	  Restore information about subxacts of a streamed transaction.
  *
  * Read information about subxacts into the global variables.
- *
- * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
  */
 static void
 subxact_info_read(Oid subid, TransactionId xid)
 {
-	int			fd;
 	char		path[MAXPGPATH];
+	bool		found;
 	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
 
 	Assert(TransactionIdIsValid(xid));
 	Assert(!subxacts);
 	Assert(nsubxacts == 0);
 	Assert(nsubxacts_max == 0);
 
-	subxact_filename(path, subid, xid);
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
 
-	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (fd < 0)
-	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
 		return;
-	}
 
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+	subxact_filename(path, subid, xid);
 
-	/* read number of subxact items */
-	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
-	{
-		int			save_errno = errno;
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
 
-		CloseTransientFile(fd);
-		errno = save_errno;
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
 		ereport(ERROR,
 				(errcode_for_file_access(),
-				 errmsg("could not read file \"%s\": %m",
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
 						path)));
-		return;
-	}
-
-	pgstat_report_wait_end();
 
 	len = sizeof(SubXactInfo) * nsubxacts;
 
@@ -2564,35 +2591,23 @@ subxact_info_read(Oid subid, TransactionId xid)
 	nsubxacts_max = 1 << my_log2(nsubxacts);
 
 	/*
-	 * Let the caller decide which memory context it will be allocated.
-	 * Ideally, during stream start it will be allocated in the
-	 * LogicalStreamingContext which will be reset on stream stop, and
-	 * during the stream abort we need this memory only for short term so
-	 * it will be allocated in ApplyMessageContext.
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
 	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
 	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
 
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
-
-	if ((len > 0) && ((read(fd, subxacts, len)) != len))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-		errno = save_errno;
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
 		ereport(ERROR,
 				(errcode_for_file_access(),
-				 errmsg("could not read file \"%s\": %m",
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
 						path)));
-		return;
-	}
-
-	pgstat_report_wait_end();
 
-	if (CloseTransientFile(fd) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not close file \"%s\": %m", path)));
+	BufFileClose(fd);
 }
 
 /*
@@ -2606,7 +2621,7 @@ subxact_info_add(TransactionId xid)
 
 	/* We must have a valid top level stream xid and a stream fd. */
 	Assert(TransactionIdIsValid(stream_xid));
-	Assert(stream_fd >= 0);
+	Assert(stream_fd != NULL);
 
 	/*
 	 * If the XID matches the toplevel transaction, we don't want to add it.
@@ -2658,7 +2673,13 @@ subxact_info_add(TransactionId xid)
 	}
 
 	subxacts[nsubxacts].xid = xid;
-	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
 
 	nsubxacts++;
 }
@@ -2667,44 +2688,14 @@ subxact_info_add(TransactionId xid)
 static void
 subxact_filename(char *path, Oid subid, TransactionId xid)
 {
-	char		tempdirpath[MAXPGPATH];
-
-	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
-
-	/*
-	 * We might need to create the tablespace's tempfile directory, if no
-	 * one has yet done so.
-	 */
-	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create directory \"%s\": %m",
-						tempdirpath)));
-
-	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.subxacts",
-			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
 }
 
 /* format filename for file containing serialized changes */
-static void
+static inline void
 changes_filename(char *path, Oid subid, TransactionId xid)
 {
-	char		tempdirpath[MAXPGPATH];
-
-	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
-
-	/*
-	 * We might need to create the tablespace's tempfile directory, if no
-	 * one has yet done so.
-	 */
-	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create directory \"%s\": %m",
-						tempdirpath)));
-
-	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.changes",
-			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
 }
 
 /*
@@ -2721,60 +2712,29 @@ changes_filename(char *path, Oid subid, TransactionId xid)
 static void
 stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
 {
-	int			i;
 	char		path[MAXPGPATH];
-	bool		found = false;
+	StreamXidHash *ent;
 
-	subxact_filename(path, subid, xid);
-
-	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not remove file \"%s\": %m", path)));
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
 
+	/* Delete the change file and release the stream fileset memory */
 	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
 
-	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not remove file \"%s\": %m", path)));
-
-	/*
-	 * Cleanup the XID from the array - find the XID in the array and
-	 * remove it by shifting all the remaining elements. The array is
-	 * bound to be fairly small (maximum number of in-progress xacts,
-	 * so max_connections + max_prepared_transactions) so simply loop
-	 * through the array and find index of the XID. Then move the rest
-	 * of the array by one element to the left.
-	 *
-	 * Notice we also call this from stream_open_file for first segment
-	 * of each transaction, to deal with possible left-overs after a
-	 * crash, so it's entirely possible not to find the XID in the
-	 * array here. In that case we don't remove anything.
-	 *
-	 * XXX Perhaps it'd be better to handle this automatically after a
-	 * restart, instead of doing it over and over for each transaction.
-	 */
-	for (i = 0; i < nxids; i++)
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
 	{
-		if (xids[i] == xid)
-		{
-			found = true;
-			break;
-		}
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
 	}
-
-	if (!found)
-		return;
-
-	/*
-	 * Move the last entry from the array to the place. We don't keep
-	 * the streamed transactions sorted or anything - we only expect
-	 * a few of them in progress (max_connections + max_prepared_xacts)
-	 * so linear search is just fine.
-	 */
-	xids[i] = xids[nxids-1];
-	nxids--;
 }
 
 /*
@@ -2783,8 +2743,8 @@ stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
  *
  * Open a file for streamed changes from a toplevel transaction identified
  * by stream_xid (global variable). If it's the first chunk of streamed
- * changes for this transaction, perform cleanup by removing existing
- * files after a possible previous crash.
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
  *
  * This can only be called at the beginning of a "streaming" block, i.e.
  * between stream_start/stream_stop messages from the upstream.
@@ -2793,79 +2753,61 @@ static void
 stream_open_file(Oid subid, TransactionId xid, bool first_segment)
 {
 	char		path[MAXPGPATH];
-	int			flags;
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
 
 	Assert(in_streamed_transaction);
 	Assert(OidIsValid(subid));
 	Assert(TransactionIdIsValid(xid));
-	Assert(stream_fd == -1);
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
 
 	/*
-	 * If this is the first segment for this transaction, try removing
-	 * existing files (if there are any, possibly after a crash).
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
 	 */
 	if (first_segment)
 	{
-		MemoryContext	oldcxt;
-
-		/* XXX make sure there are no previous files for this transaction */
-		stream_cleanup_files(subid, xid, true);
-
-		/* Need to allocate this in permanent context */
-		oldcxt = MemoryContextSwitchTo(ApplyContext);
-
 		/*
-		 * We need to remember the XIDs we spilled to files, so that we can
-		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
-		 *
-		 * The number of XIDs we may need to track is fairly small, because
-		 * we can only stream toplevel xacts (so limited by max_connections
-		 * and max_prepared_transactions), and we only stream the large ones.
-		 * So we simply keep the XIDs in an unsorted array. If the number of
-		 * xacts gets large for some reason (e.g. very high max_connections),
-		 * a more elaborate approach might be better - e.g. sorted array, to
-		 * speed-up the lookups.
+		 * Shared fileset handle must be allocated in the persistent context.
 		 */
-		if (nxids == maxnxids)	/* array of XIDs is full */
-		{
-			if (!xids)
-			{
-				maxnxids = 64;
-				xids = palloc(maxnxids * sizeof(TransactionId));
-			}
-			else
-			{
-				maxnxids = 2 * maxnxids;
-				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
-			}
-		}
+		SharedFileSet *fileset =
+		MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
 
-		xids[nxids++] = xid;
+		SharedFileSetInit(fileset, NULL);
+		stream_fd = BufFileCreateShared(fileset, path);
 
-		MemoryContextSwitchTo(oldcxt);
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
 	}
-
-	changes_filename(path, subid, xid);
-
-	elog(DEBUG1, "opening file '%s' for streamed changes", path);
-
-	/*
-	 * If this is the first streamed segment, the file must not exist, so
-	 * make sure we're the ones creating it. Otherwise just open the file
-	 * for writing, in append mode.
-	 */
-	if (first_segment)
-		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
 	else
-		flags = (O_WRONLY | O_APPEND | PG_BINARY);
-
-	stream_fd = OpenTransientFile(path, flags);
-
-	if (stream_fd < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+	MemoryContextSwitchTo(oldcxt);
 }
 
 /*
@@ -2880,12 +2822,12 @@ stream_close_file(void)
 {
 	Assert(in_streamed_transaction);
 	Assert(TransactionIdIsValid(stream_xid));
-	Assert(stream_fd != -1);
+	Assert(stream_fd != NULL);
 
-	CloseTransientFile(stream_fd);
+	BufFileClose(stream_fd);
 
 	stream_xid = InvalidTransactionId;
-	stream_fd = -1;
+	stream_fd = NULL;
 }
 
 /*
@@ -2907,34 +2849,21 @@ stream_write_change(char action, StringInfo s)
 
 	Assert(in_streamed_transaction);
 	Assert(TransactionIdIsValid(stream_xid));
-	Assert(stream_fd != -1);
+	Assert(stream_fd != NULL);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
-
 	/* first write the size */
-	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not serialize streamed change to file: %m")));
+	BufFileWrite(stream_fd, &len, sizeof(len));
 
 	/* then the action */
-	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not serialize streamed change to file: %m")));
+	BufFileWrite(stream_fd, &action, sizeof(action));
 
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
-	if (write(stream_fd, &s->data[s->cursor], len) != len)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not serialize streamed change to file: %m")));
-
-	pgstat_report_wait_end();
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
 }
 
 /*
-- 
2.23.0

v29/v29-0006-Bugfix-handling-of-incomplete-toast-spec-insert.patch000644 000765 000024 00000062170 13674044751 026174 0ustar00dilipkumarstaff000000 000000 From a122e19aca7aa3e57a20a4e429406c826c27ab42 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Wed, 17 Jun 2020 18:22:35 +0530
Subject: [PATCH v29 06/14] Bugfix handling of incomplete toast/spec insert

---
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/replication/logical/decode.c      |  17 +-
 .../replication/logical/reorderbuffer.c       | 343 ++++++++++++++----
 src/include/access/heapam_xlog.h              |   1 +
 src/include/replication/reorderbuffer.h       |  50 ++-
 5 files changed, 339 insertions(+), 75 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 287a185d9c..95dec05047 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1955,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3cbbf589ed..4d3c6f8f28 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 709f5f1d41..8aae7e9f76 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -179,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -237,7 +252,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool partial_truncate);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -641,17 +657,102 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 	return txn;
 }
 
+/*
+ * Handle incomplete tuple during streaming.  If streaming is enabled then we
+ * might need to stream the in-progress transaction.  So the problem is that
+ * sometime we might get some incomplete changes which we can not stream
+ * until we get the complete change. e.g.  toast table insert without the main
+ * table insert.  So this function remember the lsn of the last complete change
+ * and the complete size upto last complete lsn so that if we need to stream
+ * we can only stream upto last complete lsn.
+ */
+static void
+ReorderBufferHandleIncompleteTuple(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   ReorderBufferChange *change,
+								   bool toast_insert, Size total_size)
+{
+	ReorderBufferTXN *toptxn;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * If this is a first incomplete change then set the size of the complete
+	 * change.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		(toast_insert || IsSpecInsert(change->action)))
+		toptxn->complete_size = total_size;
+
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Basically,
+	 * both update and insert will do the insert in the toast table.  And as
+	 * explained in the function header we can not stream the only toast
+	 * changes.  So whenever we get the toast insert we set the flag and clear
+	 * the same whenever we get the next insert or update on the main table.
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial tuple and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+
+	/*
+	 * If we don't have any incomplete change after this change then set this
+	 * LSN as last complete lsn.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(toptxn)))
+	{
+		toptxn->last_complete_lsn = change->lsn;
+
+		/*
+		 * If the transaction is serialized and the the changes are complete in
+		 * the top level transaction then immediately stream the transaction.
+		 * The reason for not waiting for memory limit to get full is that in
+		 * the streaming mode, if the transaction serialized that means we have
+		 * already reached the memory limit but that time we could not stream
+		 * this due to incomplete tuple so now stream it as soon as the tuple
+		 * is complete.  Also, if we don't stream the serialized changes then
+		 * if we get some more incomplete changes in this transaction then we
+		 * don't have a way to partly truncate the serialized changes.
+		 */
+		if (rbtxn_is_serialized(txn))
+			ReorderBufferStreamTXN(rb, toptxn);
+	}
+}
+
 /*
  * Queue a change into a transaction so it can be replayed upon commit.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
+	Size	total_size = 0;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the last changes we have detected that the transaction
+	 * is aborted.  So there is no point in collecting further changes for the
+	 * transaction.
+	 */
+	if (txn->concurrent_abort)
+		return;
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -660,9 +761,28 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * Get the total size of the top transaction before updating the size for
+	 * current change so that if this is the incomplete tuple we know the size
+	 * prior to this change.  That will be used for updating the size of the
+	 * complete changes in the top transaction for streaming.
+	 */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			total_size = txn->toptxn->total_size;
+		else
+			total_size = txn->total_size;
+	}
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* Handle the incomplete tuple, If streaming is enabled */
+	if (ReorderBufferCanStream(rb))
+		ReorderBufferHandleIncompleteTuple(rb, txn, change, toast_insert,
+										   total_size);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -692,7 +812,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1405,11 +1525,45 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 /*
  * Discard changes from a transaction (and subtransactions), after streaming
  * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ * If partial_truncate is false we completely truncate the transaction,
+ * otherwise we truncate upto last_complete_lsn
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 bool partial_truncate)
 {
 	dlist_mutable_iter iter;
+	ReorderBufferTXN *toptxn;
+
+	/* Get the top transaction */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * The serialized transaction should never be partly truncated, because if
+	 * it is serialized then we stream it as soon as its changes get completed.
+	 */
+	Assert(!(rbtxn_is_serialized(txn) && partial_truncate));
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
 
 	/* cleanup subtransactions & their changes */
 	dlist_foreach_modify(iter, &txn->subtxns)
@@ -1426,7 +1580,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, partial_truncate);
 	}
 
 	/* cleanup changes in the toplevel txn */
@@ -1436,30 +1590,19 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		change = dlist_container(ReorderBufferChange, node, iter.cur);
 
+		/* We have truncated upto last complete lsn so stop. */
+		if (partial_truncate && (change->lsn > toptxn->last_complete_lsn))
+		{
+			/* The transaction must have incomplete changes. */
+			Assert(rbtxn_has_incomplete_tuple(toptxn));
+			break;
+		}
+
 		/* remove the change from it's containing list */
 		dlist_delete(&change->node);
-
 		ReorderBufferReturnChange(rb, change);
 	}
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The toplevel transaction, identified by (toptxn==NULL), is marked
-	 * as streamed always, even if it does not contain any changes (that
-	 * is, when all the changes are in subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts
-	 * for XIDs the downstream is not aware of. And of course, it always
-	 * knows about the toplevel xact (we send the XID in all messages),
-	 * but we never stream XIDs of empty subxacts.
-	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
 	 * any memory. We could also keep the hash table and update it with
@@ -1471,9 +1614,39 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
-	/* also reset the number of entries in the transaction */
-	txn->nentries_mem = 0;
-	txn->nentries = 0;
+	/*
+	 * Adjust nentries/nentries_mem based on the changes processed.  See
+	 * comments where nprocessed is declared.
+	 */
+	if (partial_truncate)
+	{
+		txn->nentries -= txn->nprocessed;
+		txn->nentries_mem -= txn->nprocessed;
+	}
+	else
+	{
+		txn->nentries = 0;
+		txn->nentries_mem = 0;
+	}
+	txn->nprocessed = 0;
+
+	/*
+	 * If this is a top transaction then we can reset the
+	 * last_complete_lsn and complete_size, because by now we would
+	 * have stream all the changes upto last_complete_lsn.
+	 */
+	if (partial_truncate && (txn->toptxn == NULL))
+	{
+		toptxn->last_complete_lsn = InvalidXLogRecPtr;
+		toptxn->complete_size = 0;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
 }
 
 /*
@@ -1760,7 +1933,7 @@ ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
 								   ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed. */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Stop the stream. */
 	rb->stream_stop(rb, txn, last_lsn);
@@ -1792,6 +1965,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 	ReorderBufferChange *volatile specinsert = NULL;
 	volatile bool	stream_started = false;
+	volatile bool	partial_truncate = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1850,7 +2025,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			/* Set the xid for concurrent abort check. */
 			if (streaming)
-				SetupCheckXidLive(change->txn->xid);
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
 
 			switch (change->action)
 			{
@@ -2106,6 +2284,27 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
 			}
+
+			if (streaming)
+			{
+				/*
+				 * Increment the nprocessed count.  See the detailed comment
+				 * for usage of this in ReorderBufferTXN structure.
+				 */
+				curtxn->nprocessed++;
+
+				/*
+				 * If the transaction contains incomplete tuple and this is the
+				 * last complete change then stop further processing of the
+				 * transaction.  And, set the partial truncate flag to true.
+				 */
+				if (rbtxn_has_incomplete_tuple(txn) &&
+					prev_lsn == txn->last_complete_lsn)
+				{
+					partial_truncate = true;
+					break;
+				}
+			}
 		}
 
 		/*
@@ -2125,7 +2324,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * Done with current changes, call stream_stop callback for streaming
-		 * transaction, commit callback otherwise.  If we have sent
+		 * transaction, commit callback otherwise.  Only If we have sent
 		 * start/begin.
 		 */
 		if (stream_started)
@@ -2176,7 +2375,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming)
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, partial_truncate);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
@@ -2236,6 +2435,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
+			curtxn->concurrent_abort = true;
 
 			/* Handle the concurrent abort. */
 			ReorderBufferHandleConcurrentAbort(rb, txn, snapshot_now,
@@ -2510,7 +2710,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2559,7 +2759,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2582,6 +2782,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2596,8 +2797,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2605,12 +2811,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2853,18 +3067,28 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 	dlist_foreach(iter, &rb->toplevel_by_lsn)
 	{
 		ReorderBufferTXN *txn;
+		Size	size = 0;
+		Size	largest_size = 0;
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
+		/*
+		 * If this transaction have some incomplete changes then only consider
+		 * the size upto last complete lsn.
+		 */
+		if (rbtxn_has_incomplete_tuple(txn))
+			size = txn->complete_size;
+		else
+			size = txn->total_size;
+
+		/* If the current transaction is larger then remember it. */
+		if ((largest != NULL || size > largest_size) && size > 0)
+		{
 			largest = txn;
+			largest_size = size;
+		}
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2902,27 +3126,22 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 		 * Pick the largest transaction (or subtransaction) and evict it from
 		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		if (ReorderBufferCanStream(rb))
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
 		{
-			/*
-			* Pick the largest toplevel transaction and evict it from memory by
-			* streaming the already decoded part.
-			*/
-			txn = ReorderBufferLargestTopTXN(rb);
-
 			/* we know there has to be one, because the size is not zero */
 			Assert(txn && !txn->toptxn);
-			Assert(txn->size > 0);
-			Assert(rb->size >= txn->size);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
 			ReorderBufferStreamTXN(rb, txn);
 		}
 		else
 		{
 			/*
-			* Pick the largest transaction (or subtransaction) and evict it from
-			* memory by serializing it to disk.
-			*/
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
 			txn = ReorderBufferLargestTXN(rb);
 
 			/* we know there has to be one, because the size is not zero */
@@ -2931,14 +3150,14 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(rb->size >= txn->size);
 
 			ReorderBufferSerializeTXN(rb, txn);
-		}
 
-		/*
-		 * After eviction, the transaction should have no entries in memory,
-		 * and should use 0 bytes for changes.
-		 */
-		Assert(txn->size == 0);
-		Assert(txn->nentries_mem == 0);
+			/*
+			 * After eviction, the transaction should have no entries in memory, and
+			 * should use 0 bytes for changes.
+			 */
+			Assert(txn->size == 0);
+			Assert(txn->nentries_mem == 0);
+		}
 	}
 
 	/* We must be under the memory limit now. */
@@ -3320,10 +3539,6 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 */
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
-
-	Assert(dlist_is_empty(&txn->changes));
-	Assert(txn->nentries == 0);
-	Assert(txn->nentries_mem == 0);
 }
 
 /*
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cdb12..aa17f7df84 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index c38f7345b9..1556e6fa00 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -171,6 +171,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -190,6 +192,26 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -198,10 +220,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -347,6 +365,26 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* Size of the complete changes. */
+	Size		complete_size;
+
+	/* LSN of the last complete change. */
+	XLogRecPtr	last_complete_lsn;
+
+	/*
+	 * Number of changes processed.  This is used to keep track of changes that
+	 * remained to be streamed.  As of now, this can happen either due to toast
+	 * tuples or speculative insertions where we need to wait for multiple
+	 * changes before we can send them.
+	 */
+	uint64		nprocessed;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -534,7 +572,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
2.23.0

v29/v29-0012-Add-streaming-option-in-pg_dump.patch000644 000765 000024 00000005337 13674044751 023066 0ustar00dilipkumarstaff000000 000000 From 5b5102266116c79f4a858f39e3861798cc610ce8 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v29 12/14] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index a41a3db876..d0fb24e5f8 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4235,8 +4236,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4249,6 +4250,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4265,6 +4267,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4342,6 +4345,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0c2fcfb3a9..af64270c55 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
2.23.0

v29/v29-0009-Enable-streaming-for-all-subscription-TAP-tests.patch000644 000765 000024 00000035517 13674044751 026070 0ustar00dilipkumarstaff000000 000000 From 0b26ed072709cc4f8e17a9df97d1bcc048e8e524 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v29 09/14] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318fc7c..6f7bedc130 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2fbc..94c71f8ae2 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f871a..4ba80869b9 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9181..a6fae9c3f1 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17a78..202871a658 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10a19..70c86b22ac 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6d63..f9c8d1d348 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334ed89..cdf9b8e7bb 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bdba9..21f50c7012 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133f69..30561d8f96 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae38592b..9a6bac6822 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..ed56fbf96c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cba4c..4df1ddef63 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1a8a..c3caff6149 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7c2f..c62eb521e7 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30f59..2be7542831 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0578..2da9607a7d 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9435..96ffc091b0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
2.23.0

v29/v29-0001-Immediately-WAL-log-subtransaction-and-top-level.patch000644 000765 000024 00000026154 13674044751 026207 0ustar00dilipkumarstaff000000 000000 From a69048829e2d4773cc0ba939e2211101ad79c665 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 5 Jun 2020 09:03:16 +0530
Subject: [PATCH v29 01/14] Immediately WAL-log subtransaction and top-level
 XID association.

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead) only when wal_level=logical.
We can not remove the existing XLOG_XACT_ASSIGNMENT WAL as that is
required for avoiding overflow in the hot standby snapshot.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/transam/xact.c        | 50 ++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 23 ++++++++++-
 src/backend/access/transam/xlogreader.c  |  5 +++
 src/backend/replication/logical/decode.c | 44 +++++++++++----------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 107 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 905dc7d8d3..a93fb8a4f0 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to top-level XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5120,6 +5122,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6022,3 +6025,50 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * IsSubTransactionAssignmentPending
+ *
+ * This is used to decide whether we need to WAL log the top-level XID for
+ * operation in a subtransaction.  We require that for logical decoding, see
+ * LogicalDecodingProcessRecord.
+ *
+ * This returns true if wal_level >= logical and we are inside a valid
+ * subtransaction, for which the assignment was not yet written to any WAL
+ * record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	/* wal_level has to be logical */
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it should not be already 'assigned' */
+	return !CurrentTransactionState->assigned;
+}
+
+/*
+ * MarkSubTransactionAssigned
+ *
+ * Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f09e..c526bb1928 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,19 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogResetInsertion) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cb76be4f46..a757baccfc 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1197,6 +1197,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1235,6 +1236,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3abf8..0c0c371739 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -94,11 +94,27 @@ void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
 	XLogRecordBuffer buf;
+	TransactionId txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the top-level xid is valid, we need to assign the subxact to the
+	 * top-level xact. We need to do this for all records, hence we do it
+	 * before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -216,13 +232,8 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	/*
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
-	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
-	 * when the snapshot is being built: it is possible to get later records
-	 * that require subxids to be properly assigned.
 	 */
-	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
 		return;
 
 	switch (info)
@@ -264,22 +275,13 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
 
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			/*
+			 * We assign subxact to the toplevel xact while processing each
+			 * record if required.  So, we don't need to do anything here.
+			 * See LogicalDecodingProcessRecord.
+			 */
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index db191879b9..aef8555367 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 347a38f57c..a5468c1037 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of top-level xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index b0f2a6ed43..b976882229 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -191,6 +191,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -304,6 +306,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0194..2f0c8bf589 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
2.23.0

v29/v29-0013-Change-buffile-interface-required-for-streaming-.patch000644 000765 000024 00000031555 13674044751 026251 0ustar00dilipkumarstaff000000 000000 From 982d3fc0f173c313abb4f489a34b3f7077ec4f3a Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 11 Jun 2020 16:40:25 +0530
Subject: [PATCH v29 13/14] Change buffile interface required for streaming
 transaction

Implement the BuffileTruncate and SEEK_END.  And, also add an
option to provide a mode while opening the shared buffiles, instead
of always opening in readonly mode
---
 src/backend/postmaster/pgstat.c           |  3 +
 src/backend/storage/file/buffile.c        | 81 ++++++++++++++++++++---
 src/backend/storage/file/fd.c             |  9 ++-
 src/backend/storage/file/sharedfileset.c  | 21 ++++--
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  3 +-
 10 files changed, 103 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a55ccc0c03..a9fbe41f8e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 3907349b69..bde6fa1ef3 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,12 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.  Buffile
+ * infrasturcture can be used in the single backend as well if the files need
+ * to be survived across the transaction as well as files needs to be opened
+ * and closed multiple times.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +279,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +303,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +323,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -666,11 +668,22 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
+			break;
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +851,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 7dc6dd2f15..060811ca78 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1741,18 +1741,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index f7206c9175..c81d298fc3 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -34,16 +34,22 @@ static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name)
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
  *
  * Under the covers the set is one or more directories which will eventually
  * be deleted when there are no backends attached.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * one backend but the files needs to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases dsm segment should be passed NULL so that the files will be
+ * deleted on the proc exit.
  */
 void
 SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
@@ -68,7 +74,8 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
 }
 
 /*
@@ -131,13 +138,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59c50..788815cdab 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a4303b..b83fb50dac 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6352ff945a..0dfbac46b4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752bab0d..fc34c49522 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d7df..e209f047e8 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf077e5..b2f4ba4bd8 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,7 +37,8 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
-- 
2.23.0

v29/v29-0003-Extend-the-output-plugin-API-with-stream-methods.patch000644 000765 000024 00000106360 13674044751 026213 0ustar00dilipkumarstaff000000 000000 From 3feb04ea6bcc1f0308db32eef6a6349de94eea22 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v29 03/14] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 ++++++
 doc/src/sgml/logicaldecoding.sgml         | 213 +++++++++++++
 src/backend/replication/logical/logical.c | 365 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++
 src/include/replication/reorderbuffer.h   |  59 ++++
 6 files changed, 811 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c948856e..64f651fa72 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index c89f93cf6b..50cfd6fa47 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,6 +389,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -401,6 +408,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -679,6 +695,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -747,4 +869,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be3b0..26d461effb 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similar to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,321 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475e5d..deef31825d 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -79,6 +79,11 @@ typedef struct LogicalDecodingContext
 	 */
 	void	   *output_writer_private;
 
+	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236c57..0d0a94a648 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
 /*
  * Output plugin callbacks
  */
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 9cd645d0ec..24b4dd65d6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -356,6 +356,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
 struct ReorderBuffer
 {
 	/*
@@ -394,6 +442,17 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
 	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
-- 
2.23.0

v29/v29-0005-Implement-streaming-mode-in-ReorderBuffer.patch000644 000765 000024 00000112044 13674044751 025041 0ustar00dilipkumarstaff000000 000000 From bb2d712eaf798edfdb507acd3acb7d5dd1057fcc Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Wed, 17 Jun 2020 18:20:30 +0530
Subject: [PATCH v29 05/14] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes
we have in memory and invoke new stream API methods. This happens
in ReorderBufferStreamTXN() using about the same logic as in
ReorderBufferCommit() logic.  However, sometime if we have incomplete
toast or speculative insert we spill to the disk because we can not
generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c   |  38 +-
 .../replication/logical/reorderbuffer.c       | 755 ++++++++++++++++--
 src/include/replication/reorderbuffer.h       |  26 +
 3 files changed, 743 insertions(+), 76 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aa..160b167adb 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 364a5bba6d..709f5f1d41 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -236,6 +237,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +246,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +382,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -767,6 +781,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+/*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -863,6 +909,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	/* set the reference to top-level transaction */
 	subtxn->toptxn = txn;
 
+	/* set the reference to toplevel transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -1022,6 +1071,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1036,6 +1088,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1313,6 +1368,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		dlist_delete(&txn->base_snapshot_node);
 	}
 
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
 	/*
 	 * Remove TXN from its containing list.
 	 *
@@ -1338,6 +1402,80 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferReturnTXN(rb, txn);
 }
 
+/*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
 /*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1489,57 +1627,171 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+	 * aborted. That will happen during catalog access.  Also, reset the
+	 * bsysscan flag.
+	 */
+	if (!TransactionIdDidCommit(xid))
+	{
+		CheckXidAlive = xid;
+		bsysscan = false;
 	}
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction.
+ */
+static void
+ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   Snapshot snapshot_now,
+								   CommandId command_id,
+								   XLogRecPtr last_lsn,
+								   ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+
+	ReorderBufferToastReset(rb, txn);
+	if (specinsert != NULL)
+		ReorderBufferReturnChange(rb, specinsert);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool	stream_started = false;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1562,21 +1814,44 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
-
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
 		{
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Start stream or begin transaction for the first change in the
+			 * current stream.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+					rb->stream_start(rb, txn, change->lsn);
+				else
+					rb->begin(rb, txn);
+				stream_started = true;
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1653,7 +1928,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1693,7 +1969,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1751,7 +2027,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1760,10 +2039,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1794,7 +2070,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1848,14 +2123,34 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.  If we have sent
+		 * start/begin.
+		 */
+		if (stream_started)
+		{
+			if (streaming)
+				rb->stream_stop(rb, txn, prev_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+			stream_started = false;
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1873,14 +2168,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1899,17 +2207,122 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/* Reset the CheckXidAlive */
+		if (streaming)
+			CheckXidAlive = InvalidTransactionId;
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If the error code is ERRCODE_TRANSACTION_ROLLBACK,  that means we
+		 * have detected a concurrent abort of the (sub)transaction we are
+		 * streaming.  So just do the cleanup and return gracefully.
+		 * Otherwise, Re-throw the error.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * Only in the streaming mode we can get this error, because only
+			 * in the streaming mode we send in-progress transaction.
+			 */
+			Assert(streaming);
+
+			/*
+			 * In the TRY block we only stop the stream after we have send
+			 * all the changes.  So we have detected the concurrent abort then
+			 * the stream should have not been stopped yet.
+			 */
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
 
-		PG_RE_THROW();
+			/* Handle the concurrent abort. */
+			ReorderBufferHandleConcurrentAbort(rb, txn, snapshot_now,
+											   command_id, prev_lsn,
+											   specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.  We iterate over the top and
+ * subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot snapshot_now;
+	CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
@@ -1934,6 +2347,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2003,6 +2423,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2136,8 +2563,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2145,6 +2581,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2156,19 +2593,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2197,6 +2643,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2389,6 +2836,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction  at-a-time to evict and spill its changes to
@@ -2421,11 +2900,38 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStream(rb))
+		{
+			/*
+			* Pick the largest toplevel transaction and evict it from memory by
+			* streaming the already decoded part.
+			*/
+			txn = ReorderBufferLargestTopTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			* Pick the largest transaction (or subtransaction) and evict it from
+			* memory by serializing it to disk.
+			*/
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2723,6 +3229,103 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot snapshot_now;
+	CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -3822,6 +4425,16 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases we assume the CID is from the future command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 24b4dd65d6..c38f7345b9 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -170,6 +170,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -189,6 +190,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -256,6 +275,13 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
+	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
 	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
-- 
2.23.0

v29/v29-0008-Add-support-for-streaming-to-built-in-replicatio.patch000644 000765 000024 00000263043 13674044751 026320 0ustar00dilipkumarstaff000000 000000 From b29b7781b6042a8a9d3e6917770b5380cef1e979 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 11 Jun 2020 15:34:29 +0530
Subject: [PATCH v29 08/14] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml      |    4 +-
 doc/src/sgml/ref/create_subscription.sgml     |   11 +
 src/backend/catalog/pg_subscription.c         |    1 +
 src/backend/commands/subscriptioncmds.c       |   45 +-
 src/backend/postmaster/pgstat.c               |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |    3 +
 src/backend/replication/logical/proto.c       |  140 ++-
 src/backend/replication/logical/worker.c      | 1012 +++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c   |  318 +++++-
 src/backend/replication/slotfuncs.c           |    6 +
 src/backend/replication/walsender.c           |    6 +
 src/include/catalog/pg_subscription.h         |    3 +
 src/include/pgstat.h                          |    6 +-
 src/include/replication/logicalproto.h        |   42 +-
 src/include/replication/walreceiver.h         |    1 +
 src/test/subscription/t/009_stream_simple.pl  |   86 ++
 src/test/subscription/t/010_stream_subxact.pl |  102 ++
 src/test/subscription/t/011_stream_ddl.pl     |   95 ++
 .../t/012_stream_subxact_abort.pl             |   82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |   84 ++
 20 files changed, 2019 insertions(+), 40 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index c24ace14d1..d8de56c928 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 5bbc165f70..c25b7c5962 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731115..f28482f0f4 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9ebb026187..9065a1be1b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c022597bc0..a55ccc0c03 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4138,6 +4138,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9bb6..5257ab0394 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd171..83d0642cf3 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a1224d..d2d9469999 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,6 +54,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -64,6 +86,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +94,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -100,6 +124,7 @@ typedef struct SlotErrCallbackArg
 } SlotErrCallbackArg;
 
 static MemoryContext ApplyMessageContext = NULL;
+static MemoryContext LogicalStreamingContext = NULL;
 MemoryContext ApplyContext = NULL;
 
 WalReceiverConn *wrconn = NULL;
@@ -110,12 +135,58 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool	in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;						/* XID of the subxact */
+	off_t           offset;					/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +258,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -552,6 +659,326 @@ apply_handle_origin(StringInfo s)
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+	{
+		MemoryContext oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+
+		/* Read the subxacts info in per-stream context. */
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+		MemoryContextSwitchTo(oldctx);
+	}
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		int			fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = OpenTransientFile(path, O_WRONLY | PG_BINARY);
+		if (fd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+		}
+
+		/* OK, truncate the file at the right offset. */
+		if (ftruncate(fd, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+		CloseTransientFile(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	ensure_transaction();
+
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -565,6 +992,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1010,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1049,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1167,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1312,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1685,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1826,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1477,6 +1938,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 	}
 }
 
+/*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
 /*
  * Apply main loop.
  */
@@ -1493,6 +1970,17 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reeset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													 ALLOCSET_DEFAULT_SIZES);
+
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1941,6 +2429,529 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Let the caller decide which memory context it will be allocated.
+	 * Ideally, during stream start it will be allocated in the
+	 * LogicalStreamingContext which will be reset on stream stop, and
+	 * during the stream abort we need this memory only for short term so
+	 * it will be allocated in ApplyMessageContext.
+	 */
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd >= 0);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.subxacts",
+			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.changes",
+			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		/* Need to allocate this in permanent context */
+		oldcxt = MemoryContextSwitchTo(ApplyContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3117,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 15379e3118..1509f9b826 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +122,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +148,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +211,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +240,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +263,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +373,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +423,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +465,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +486,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +518,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +538,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +562,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +582,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +607,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +639,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -587,6 +719,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -623,6 +840,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -753,11 +1002,44 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -793,7 +1075,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 06e4955de7..5f74ca1eed 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -157,6 +157,12 @@ create_logical_replication_slot(char *name, char *plugin,
 											   .segment_close = wal_segment_close),
 									NULL, NULL, NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
 	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d0c0674848..ffc3d50081 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndPrepareWrite, WalSndWriteData,
 										WalSndUpdateProgress);
 
+		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
 		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d42d8..617b9094d4 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..6352ff945a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -981,7 +981,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561be9..89158ed46f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index ac1acbb27e..95132062c6 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v29/v29-0002-Issue-individual-invalidations-with-wal_level-lo.patch000644 000765 000024 00000037225 13674044751 026455 0ustar00dilipkumarstaff000000 000000 From 08c58f89132288aaff7fcaf3ad1731b00e983be5 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 6 Jun 2020 09:54:21 +0530
Subject: [PATCH v29 02/14] Issue individual invalidations with
 wal_level=logical.

When wal_level=logical, write individual invalidations into WAL so
that decoding can use this information.

We still add the invalidations to the cache, and write them to WAL
at commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not
need to be changed.

The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures,
which still rely on the invalidations written to commit records.

The invalidations are decoded and added as a new ReorderBufferChange
type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during
replay, unlike the existing invalidations (which are either decoded
as part of commit record, or executed immediately during decoding
and not added to reorderbuffer at all).

LogStandbyInvalidations was accumulating all the invalidations in
memory, and then only wrote them once at commit time, which may
reduce the performance impact by amortizing the overhead and
deduplicating the invalidations.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/rmgrdesc/xactdesc.c        | 40 +++++++++++++
 src/backend/access/transam/xact.c             |  7 +++
 src/backend/replication/logical/decode.c      | 60 +++++++++++--------
 .../replication/logical/reorderbuffer.c       | 54 +++++++++++++----
 src/backend/utils/cache/inval.c               | 57 ++++++++++++++++++
 src/include/access/xact.h                     | 13 +++-
 src/include/replication/reorderbuffer.h       | 11 ++++
 7 files changed, 206 insertions(+), 36 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce75565f..404d988625 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else if (msg->id == SHAREDINVALSNAPSHOT_ID)
+			appendStringInfo(buf, " snapshot %u", msg->sn.relId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a93fb8a4f0..d93b40f2f8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6022,6 +6022,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0c0c371739..3cbbf589ed 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -278,10 +278,39 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 			/*
 			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here.
-			 * See LogicalDecodingProcessRecord.
+			 * record if required.  So, we don't need to do anything here. See
+			 * LogicalDecodingProcessRecord.
 			 */
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/*
+				 * If these are transaction invalidation then append them to
+				 * the transaction's invalidation list.  Otherwise, immediately
+				 * execute the invalidations for the xid less transaction.
+				 */
+				if (!ctx->fast_forward)
+				{
+					if (TransactionIdIsValid(xid))
+					{
+						ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+													  invals->nmsgs, invals->msgs);
+						ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+														  buf->origptr);
+					}
+					else
+						ReorderBufferImmediateInvalidation(ctx->reorder,
+														   invals->nmsgs,
+														   invals->msgs);
+				}
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
@@ -334,15 +363,11 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_STANDBY_LOCK:
 			break;
 		case XLOG_INVALIDATIONS:
-			{
-				xl_invalidations *invalidations =
-				(xl_invalidations *) XLogRecGetData(r);
-
-				if (!ctx->fast_forward)
-					ReorderBufferImmediateInvalidation(ctx->reorder,
-													   invalidations->nmsgs,
-													   invalidations->msgs);
-			}
+			/*
+			 * Now, we are already WAL logging the command level invalidations
+			 * with XLOG_XACT_INVALIDATIONS.  So we don't need to handle them
+			 * here again.
+			 */
 			break;
 		default:
 			elog(ERROR, "unexpected RM_STANDBY_ID record type: %u", info);
@@ -573,19 +598,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
-	/*
-	 * Process invalidation messages, even if we're not interested in the
-	 * transaction's contents, since the various caches need to always be
-	 * consistent.
-	 */
-	if (parsed->nmsgs > 0)
-	{
-		if (!ctx->fast_forward)
-			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
-										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
-	}
-
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 642a1c767f..364a5bba6d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -860,6 +860,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to top-level transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -1824,7 +1827,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 					break;
-
 				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
@@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
 		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
 										   txn->invalidations);
-	else
-		Assert(txn->ninvalidations == 0);
 
 	/* remove potential on-disk data, and deallocate */
 	ReorderBufferCleanupTXN(rb, txn);
@@ -2216,17 +2216,40 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	if (txn->ninvalidations != 0)
-		elog(ERROR, "only ever add one set of invalidations");
+	/*
+	 * We collect all the invalidations under the top transaction so that
+	 * we can execute them all together.
+	 */
+	if (txn->toptxn)
+		txn = txn->toptxn;
 
 	Assert(nmsgs > 0);
 
-	txn->ninvalidations = nmsgs;
-	txn->invalidations = (SharedInvalidationMessage *)
-		MemoryContextAlloc(rb->context,
-						   sizeof(SharedInvalidationMessage) * nmsgs);
-	memcpy(txn->invalidations, msgs,
-		   sizeof(SharedInvalidationMessage) * nmsgs);
+	/*
+	 * If there is no invalidation yet for this transaction then allocate the
+	 * memory for the same and copy the invalidations.  Otherwise, expand the
+	 * memory for the new invalidation and append them to the existing
+	 * invalidations.
+	 */
+	if (txn->ninvalidations == 0)
+	{
+		txn->ninvalidations = nmsgs;
+		txn->invalidations = (SharedInvalidationMessage *)
+			MemoryContextAlloc(rb->context,
+							   sizeof(SharedInvalidationMessage) * nmsgs);
+		memcpy(txn->invalidations, msgs,
+			   sizeof(SharedInvalidationMessage) * nmsgs);
+	}
+	else
+	{
+		txn->invalidations = (SharedInvalidationMessage *)
+			repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
+					 (txn->ninvalidations + nmsgs));
+
+		memcpy(txn->invalidations + txn->ninvalidations, msgs,
+			   nmsgs * sizeof(SharedInvalidationMessage));
+		txn->ninvalidations += nmsgs;
+	}
 }
 
 /*
@@ -2254,6 +2277,15 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * Mark top-level transaction as having catalog changes too if one of its
+	 * children has so that the ReorderBufferBuildTupleCidHash can
+	 * conveniently check just top-level transaction and decide whether to
+	 * build the hash table or not.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33be6..d81999747a 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,10 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transaction.  As of now it was
+ *	enough to log invalidation only at commit because we are only decoding the
+ *	transaction at the commit time.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +108,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +215,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1094,6 +1101,11 @@ CommandEndInvalidationMessages(void)
 
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
+
+	/* WAL Log per-command invalidation messages for wal_level=logical */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
 							   &transInvalInfo->CurrentCmdInvalidMsgs);
 }
@@ -1501,3 +1513,48 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * LogLogicalInvalidations
+ *
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage *invalMessages;
+	int			nmsgs = 0;
+
+	/* Quick exit if we haven't done anything with invalidation messages. */
+	if (transInvalInfo == NULL)
+		return;
+
+	ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+										MakeSharedInvalidMessagesArray);
+
+	Assert(!(numSharedInvalidMessagesArray > 0 &&
+			SharedInvalidMessagesArray == NULL));
+
+	invalMessages = SharedInvalidMessagesArray;
+	nmsgs = numSharedInvalidMessagesArray;
+	SharedInvalidMessagesArray = NULL;
+	numSharedInvalidMessagesArray = 0;
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+							nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index aef8555367..ac3f5e3b60 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -197,6 +197,17 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+/*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4dc9..9cd645d0ec 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -149,6 +149,14 @@ typedef struct ReorderBufferChange
 			CommandId	cmax;
 			CommandId	combocid;
 		}			tuplecid;
+
+		/* Invalidation. */
+		struct
+		{
+			uint32		ninvalidations;		/* Number of messages */
+			SharedInvalidationMessage *invalidations;	/* invalidation
+														 * message */
+		}			inval;
 	}			data;
 
 	/*
@@ -220,6 +228,9 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	end_lsn;
 
+	/* Toplevel transaction for this subxact (NULL for top-level). */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN of the last lsn at which snapshot information reside, so we can
 	 * restart decoding from there and fully recover this transaction from
-- 
2.23.0

v29/v29-0007-Track-statistics-for-streaming.patch000644 000765 000024 00000027667 13674044751 023071 0ustar00dilipkumarstaff000000 000000 From 00321fbd62d3a0a27321f8eb863699cced05d74e Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 11 Jun 2020 15:26:18 +0530
Subject: [PATCH v29 07/14] Track statistics for streaming

---
 doc/src/sgml/monitoring.sgml                  | 33 +++++++++++++++++++
 src/backend/catalog/system_views.sql          |  5 ++-
 .../replication/logical/reorderbuffer.c       | 12 +++++++
 src/backend/replication/walsender.c           | 32 +++++++++++++++---
 src/include/catalog/pg_proc.dat               |  6 ++--
 src/include/replication/reorderbuffer.h       | 13 +++++---
 src/include/replication/walsender_private.h   |  5 +++
 src/test/regress/expected/rules.out           |  7 ++--
 8 files changed, 98 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dfa9d0d641..8a40639e39 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2498,6 +2498,39 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        Amount of decoded transaction data spilled to disk.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_txns</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of in-progress transactions streamed to subscriber after
+       memory used by logical decoding exceeds <literal>logical_work_mem</literal>.
+       Streaming only works with toplevel transactions (subtransactions can't
+       be streamed independently), so the counter does not get incremented for
+       subtransactions.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_count</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times in-progress transactions were streamed to subscriber.
+       Transactions may get streamed repeatedly, and this counter gets incremented
+       on every such invocation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stream_bytes</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Amount of decoded in-progress transaction data streamed to subscriber.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5314e9348f..3db900d2e6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -788,7 +788,10 @@ CREATE VIEW pg_stat_replication AS
             W.reply_time,
             W.spill_txns,
             W.spill_count,
-            W.spill_bytes
+            W.spill_bytes,
+            W.stream_txns,
+            W.stream_count,
+            W.stream_bytes
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8aae7e9f76..e1b3201daa 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -348,6 +348,10 @@ ReorderBufferAllocate(void)
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
 
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3534,6 +3538,14 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->snapshot_now = NULL;
 	}
 
+	/* Update the stream statistics. */
+	rb->streamCount += 1;
+	rb->streamBytes += (rbtxn_has_incomplete_tuple(txn)) ?
+						txn->complete_size : txn->total_size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	/*
 	 * Access the main routine to decode the changes and send to output plugin.
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e2477c47e0..d0c0674848 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1349,7 +1349,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
  * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the lag tracker (see XLogSendPhysical),
- * and update the spill statistics.
+ * and update the spill/stream statistics.
  */
 static void
 WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid)
@@ -1370,7 +1370,8 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	sendTime = now;
 
 	/*
-	 * Update statistics about transactions that spilled to disk.
+	 * Update statistics about transactions that spilled to disk or streamed to
+	 * subscriber (before being committed).
 	 */
 	UpdateSpillStats(ctx);
 }
@@ -2421,6 +2422,9 @@ InitWalSenderSlot(void)
 			walsnd->spillTxns = 0;
 			walsnd->spillCount = 0;
 			walsnd->spillBytes = 0;
+			walsnd->streamTxns = 0;
+			walsnd->streamCount = 0;
+			walsnd->streamBytes = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3256,7 +3260,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	15
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3314,6 +3318,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		spillCount;
 		int64		spillBytes;
 		bool		is_sync_standby;
+		int64		streamTxns;
+		int64		streamCount;
+		int64		streamBytes;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 		int			j;
@@ -3339,6 +3346,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		spillTxns = walsnd->spillTxns;
 		spillCount = walsnd->spillCount;
 		spillBytes = walsnd->spillBytes;
+		streamTxns = walsnd->streamTxns;
+		streamCount = walsnd->streamCount;
+		streamBytes = walsnd->streamBytes;
 		SpinLockRelease(&walsnd->mutex);
 
 		/*
@@ -3441,6 +3451,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[12] = Int64GetDatum(spillTxns);
 			values[13] = Int64GetDatum(spillCount);
 			values[14] = Int64GetDatum(spillBytes);
+
+			/* stream over-sized transactions */
+			values[15] = Int64GetDatum(streamTxns);
+			values[16] = Int64GetDatum(streamCount);
+			values[17] = Int64GetDatum(streamBytes);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3683,11 +3698,18 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
 {
 	ReorderBuffer *rb = ctx->reorder;
 
-	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld",
+	MyWalSnd->streamTxns = rb->streamTxns;
+	MyWalSnd->streamCount = rb->streamCount;
+	MyWalSnd->streamBytes = rb->streamBytes;
+
+	elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
 		 rb,
 		 (long long) rb->spillTxns,
 		 (long long) rb->spillCount,
-		 (long long) rb->spillBytes);
+		 (long long) rb->spillBytes,
+		 (long long) rb->streamTxns,
+		 (long long) rb->streamCount,
+		 (long long) rb->streamBytes);
 
 	SpinLockAcquire(&MyWalSnd->mutex);
 	MyWalSnd->spillTxns = rb->spillTxns;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 61f2c2f5b4..7869f721da 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5237,9 +5237,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1556e6fa00..b066202831 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -549,15 +549,20 @@ struct ReorderBuffer
 	Size		size;
 
 	/*
-	 * Statistics about transactions spilled to disk.
+	 * Statistics about transactions streamed or spilled to disk.
 	 *
-	 * A single transaction may be spilled repeatedly, which is why we keep
-	 * two different counters. For spilling, the transaction counter includes
-	 * both toplevel transactions and subtransactions.
+	 * A single transaction may be streamed/spilled repeatedly, which is
+	 * why we keep two different counters. For spilling, the transaction
+	 * counter includes both toplevel transactions and subtransactions.
+	 * For streaming, it only includes toplevel transactions (we never
+	 * stream individual subtransactions).
 	 */
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 734acec2a4..b997d1710e 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -83,6 +83,11 @@ typedef struct WalSnd
 	int64		spillTxns;
 	int64		spillCount;
 	int64		spillBytes;
+
+	/* Statistics for in-progress transactions streamed to subscriber. */
+	int64           streamTxns;
+	int64           streamCount;
+	int64           streamBytes;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b813e32215..cf22f8a038 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2005,9 +2005,12 @@ pg_stat_replication| SELECT s.pid,
     w.reply_time,
     w.spill_txns,
     w.spill_count,
-    w.spill_bytes
+    w.spill_bytes,
+    w.stream_txns,
+    w.stream_count,
+    w.stream_bytes
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_slru| SELECT s.name,
     s.blks_zeroed,
-- 
2.23.0

v29/v29-0011-Provide-new-api-to-get-the-streaming-changes.patch000644 000765 000024 00000014524 13674044751 025357 0ustar00dilipkumarstaff000000 000000 From d062dae25d511e8504c45ada7e72b3a6a30ed915 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Sat, 2 May 2020 11:41:59 +0530
Subject: [PATCH v29 11/14] Provide new api to get the streaming changes

---
 .gitignore                                    |  1 +
 doc/src/sgml/test-decoding.sgml               | 22 ++++++++++++++++++
 src/backend/catalog/system_views.sql          |  8 +++++++
 .../replication/logical/logicalfuncs.c        | 23 +++++++++++++++----
 src/include/catalog/pg_proc.dat               |  9 ++++++++
 5 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/.gitignore b/.gitignore
index 794e35b73c..6083744c07 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d67b..eed6e9d134 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_streaming_changes('test_slot', NULL, NULL);
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3db900d2e6..5e223f87f1 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1243,6 +1243,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
 
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_peek_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index b99c94e848..70c28ffa91 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -108,7 +108,8 @@ check_permissions(void)
  * Helper function for the various SQL callable logical decoding functions.
  */
 static Datum
-pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool binary)
+pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
+								 bool binary, bool streaming)
 {
 	Name		name;
 	XLogRecPtr	upto_lsn;
@@ -252,6 +253,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or has been invalidated.")));
 
+		/* If called has not asked for streaming changes then disable it. */
+		ctx->streaming &= streaming;
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
@@ -362,7 +366,16 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 Datum
 pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, false);
+}
+
+/*
+ * SQL function to get the streaming changes as text, consuming the data.
+ */
+Datum
+pg_logical_slot_get_streaming_changes(PG_FUNCTION_ARGS)
+{
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, true);
 }
 
 /*
@@ -371,7 +384,7 @@ pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, false, false);
 }
 
 /*
@@ -380,7 +393,7 @@ pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, true, false);
 }
 
 /*
@@ -389,7 +402,7 @@ pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, true, false);
 }
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7869f721da..875e0bef28 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10115,6 +10115,15 @@
   proargmodes => '{i,i,i,v,o,o,o}',
   proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
   prosrc => 'pg_logical_slot_get_binary_changes' },
+{ oid => '6150', descr => 'get streaming changes from replication slot',
+  proname => 'pg_logical_slot_get_streaming_changes', procost => '1000',
+  prorows => '1000', provariadic => 'text', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'name pg_lsn int4 _text',
+  proallargtypes => '{name,pg_lsn,int4,_text,pg_lsn,xid,text}',
+  proargmodes => '{i,i,i,v,o,o,o}',
+  proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
+  prosrc => 'pg_logical_slot_get_streaming_changes' },
 { oid => '3784', descr => 'peek at changes from replication slot',
   proname => 'pg_logical_slot_peek_changes', procost => '1000',
   prorows => '1000', provariadic => 'text', proisstrict => 'f',
-- 
2.23.0

v29/v29-0010-Add-TAP-test-for-streaming-vs.-DDL.patch000644 000765 000024 00000010620 13674044751 023000 0ustar00dilipkumarstaff000000 000000 From b670a059c23d1e5ea3f25fbb9bab85cd294eacdb Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v29 10/14] Add TAP test for streaming vs. DDL

---
 .../subscription/t/014_stream_through_ddl.pl  | 98 +++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000000..b8d78b1972
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v29/v29-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch000644 000765 000024 00000032536 13674044751 026463 0ustar00dilipkumarstaff000000 000000 From 0770e1299eb3baacdcf4debca6f96c5961cf04f5 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 10:55:19 +0530
Subject: [PATCH v29 04/14] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml  |  9 +++--
 src/backend/access/heap/heapam.c   | 10 ++++++
 src/backend/access/index/genam.c   | 53 ++++++++++++++++++++++++++++
 src/backend/access/table/tableam.c |  8 +++++
 src/backend/utils/time/snapmgr.c   | 13 +++++++
 src/include/access/tableam.h       | 55 ++++++++++++++++++++++++++++++
 src/include/utils/snapmgr.h        |  2 ++
 7 files changed, 147 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 50cfd6fa47..ab689f8d19 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 537913d1bb..287a185d9c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments at snapmgr.c
+	 * where these variables are declared.  Normally we have such a check at
+	 * tableam level API but this is called from many places so we need to
+	 * ensure it here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae39a..446b8cbc86 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,9 +430,36 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments at snapmgr.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
+/*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments at snapmgr.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
 /*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments at snapmgr.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index c814733b22..2f52b407c6 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -230,6 +230,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	Relation	rel = scan->rs_rd;
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
+	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
 	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..9f1ecd123f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -153,6 +153,19 @@ static Snapshot SecondarySnapshot = NULL;
 static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool bsysscan = false;
+
 /*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index eb18739c36..2b7d3df617 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
+#include "utils/snapmgr.h"
 #include "utils/snapshot.h"
 
 
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1015,6 +1025,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1054,6 +1071,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1710,6 +1735,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1727,6 +1760,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1745,6 +1786,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1761,6 +1809,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce84..5af6df698b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,6 +145,8 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
+extern bool	bsysscan;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
 extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
 extern void TeardownHistoricSnapshot(bool is_error);
-- 
2.23.0

#376Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#368)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jun 16, 2020 at 2:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 15, 2020 at 6:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have few more comments on the patch
0013-Change-buffile-interface-required-for-streaming-.patch:

Review comments on 0014-Worker-tempfile-use-the-shared-buffile-infrastru:
1.
The subxact file is only create if there
+ * are any suxact info under this xid.
+ */
+typedef struct StreamXidHash

Lets slightly reword the part of the comment as "The subxact file is
created iff there is any suxact info under this xid."

Done

2.
@@ -710,6 +740,9 @@ apply_handle_stream_stop(StringInfo s)
subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
stream_close_file();

+ /* Commit the per-stream transaction */
+ CommitTransactionCommand();

Before calling commit, ensure that we are in a valid transaction. I
think we can have an Assert for IsTransactionState().

Done

3.
@@ -761,11 +791,13 @@ apply_handle_stream_abort(StringInfo s)

int64 i;
int64 subidx;
- int fd;
+ BufFile *fd;
bool found = false;
char path[MAXPGPATH];
+ StreamXidHash *ent;

subidx = -1;
+ ensure_transaction();
subxact_info_read(MyLogicalRepWorker->subid, xid);

Why to call ensure_transaction here? Is there any reason that we
won't have a valid transaction by now? If not, then its better to
have an Assert for IsTransactionState().

We are only starting transaction from stream_start to stream_stop, so
at stream_abort we will not have the transaction.

4.
- if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+ if (BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
{
- int save_errno = errno;
+ int save_errno = errno;
- CloseTransientFile(fd);
+ BufFileClose(fd);

On error, won't these files be close automatically? If so, why at
this place and before other errors, we need to close this?

Yes, that's correct. I have fixed those.

5.
if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
{
int save_errno = errno;

BufFileClose(fd);
errno = save_errno;
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not read file \"%s\": %m",

Can we change the error message to "could not read from streaming
transactions file .." or something like that and similarly we can
change the message for failure in reading changes file?

Done

6.
if (BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
{
int save_errno = errno;

BufFileClose(fd);
errno = save_errno;
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not write to file \"%s\": %m",

Similar to previous, can we change it to "could not write to streaming
transactions file

BufFileWrite is not returning failure anymore.

7.
@@ -2855,17 +2844,32 @@ stream_open_file(Oid subid, TransactionId xid,
bool first_segment)
* for writing, in append mode.
*/
if (first_segment)
- flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
- else
- flags = (O_WRONLY | O_APPEND | PG_BINARY);
+ {
+ /*
+ * Shared fileset handle must be allocated in the persistent context.
+ */
+ SharedFileSet *fileset =
+ MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
- stream_fd = OpenTransientFile(path, flags);
+ PrepareTempTablespaces();
+ SharedFileSetInit(fileset, NULL);

Why are we calling PrepareTempTablespaces here? It is already called
in SharedFileSetInit.

My bad, First I tired using SharedFileSetInit but later it got
changed for forgot to remove this part.

8.
+ /*
+ * Start a transaction on stream start, this transaction will be committed
+ * on the stream stop.  We need the transaction for handling the buffile,
+ * used for serializing the streaming data and subxact info.
+ */
+ ensure_transaction();

I think we need this for PrepareTempTablespaces to set the
temptablespaces. Also, isn't it required for a cleanup of buffile
resources at the transaction end? Are there any other reasons for it
as well? The comment should be a bit more clear for why we need a
transaction here.

I am not sure that will it make sense to add a comment here that why
buffile and sharedfileset need a transaction? Do you think that we
should add comment in buffile/shared fileset API that it should be
called under a transaction?

9.
* Open a file for streamed changes from a toplevel transaction identified
* by stream_xid (global variable). If it's the first chunk of streamed
* changes for this transaction, perform cleanup by removing existing
* files after a possible previous crash.
..
stream_open_file(Oid subid, TransactionId xid, bool first_segment)

The above part comment atop stream_open_file needs to be changed after
new implementation.

Done

10.
* enabled. This context is reeset on each stream stop.
*/
LogicalStreamingContext = AllocSetContextCreate(ApplyContext,

/reeset/reset

Done

11.
stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
{
..
+ /* No entry created for this xid so simply return. */
+ if (ent == NULL)
+ return;
..
}

Is there any reason or scenario where this ent can be NULL? If not,
it will be better to have an Assert for the same.

Right, it should be an assert, even if all the changes are ignored for
the top transaction, we should have sent the stream_start.

12.
subxact_info_write(Oid subid, TransactionId xid)
{
..
+ /*
+ * If there is no subtransaction then nothing to do,  but if already have
+ * subxact file then delete that.
+ */
+ if (nsubxacts == 0)
{
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not create file \"%s\": %m",
- path)));
+ if (ent->subxact_fileset)
+ {
+ cleanup_subxact_info();
+ BufFileDeleteShared(ent->subxact_fileset, path);
+ ent->subxact_fileset = NULL;
..
}

Here don't we need to free the subxact_fileset before setting it to NULL?

Yes, done

13.
+ /*
+ * Scan complete hash and delete the underlying files for the the xids.
+ * Also delete the memory for the shared file sets.
+ */

/the the/the. Instead of "delete the memory", it would be better to
say "release the memory".

Done

14.
+ /*
+ * We might not have created the suxact fileset if there is no sub
+ * transaction.
+ */

/suxact/subxact

Done

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#377Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#373)
2 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jun 18, 2020 at 9:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Yes, I have made the changes. Basically, now I am only using the
XLOG_XACT_INVALIDATIONS for generating all the invalidation messages.
So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we
are directly appending it to the txn->invalidations. I have tested
the XLOG_INVALIDATIONS part but while sending this mail I realized
that we could write some automated test for the same.

Can you share how you have tested it?

I will work on
that soon.

Cool, I think having a regression test for this will be a good idea.

@@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb,
TransactionId xid, XLogRecPtr lsn)
if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
txn->invalidations);
- else
- Assert(txn->ninvalidations == 0);

Why this Assert is removed?

Apart from above, I have made a number of changes in
0002-WAL-Log-invalidations-at-command-end-with-wal_le to remove some
unnecessary changes, edited comments, ran pgindent and updated the
commit message. If you are fine with these changes, then do include
them in your next version.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v28-0001-Immediately-WAL-log-subtransaction-and-top-level.amit.patchapplication/octet-stream; name=v28-0001-Immediately-WAL-log-subtransaction-and-top-level.amit.patchDownload
From 834be71cba6087a75f1a43456eaf3d5d29973557 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 5 Jun 2020 09:03:16 +0530
Subject: [PATCH v28 1/2] Immediately WAL-log subtransaction and top-level XID
 association.

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead) only when wal_level=logical.
We can not remove the existing XLOG_XACT_ASSIGNMENT WAL as that is
required for avoiding overflow in the hot standby snapshot.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/transam/xact.c        | 50 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 23 +++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 44 ++++++++++++++--------------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 107 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 905dc7d..a93fb8a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to top-level XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5120,6 +5122,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6022,3 +6025,50 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * IsSubTransactionAssignmentPending
+ *
+ * This is used to decide whether we need to WAL log the top-level XID for
+ * operation in a subtransaction.  We require that for logical decoding, see
+ * LogicalDecodingProcessRecord.
+ *
+ * This returns true if wal_level >= logical and we are inside a valid
+ * subtransaction, for which the assignment was not yet written to any WAL
+ * record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	/* wal_level has to be logical */
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it should not be already 'assigned' */
+	return !CurrentTransactionState->assigned;
+}
+
+/*
+ * MarkSubTransactionAssigned
+ *
+ * Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f..c526bb1 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,19 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogResetInsertion) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cb76be4..a757bac 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1197,6 +1197,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1235,6 +1236,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3a..0c0c371 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -94,11 +94,27 @@ void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
 	XLogRecordBuffer buf;
+	TransactionId txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the top-level xid is valid, we need to assign the subxact to the
+	 * top-level xact. We need to do this for all records, hence we do it
+	 * before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -216,13 +232,8 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	/*
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
-	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
-	 * when the snapshot is being built: it is possible to get later records
-	 * that require subxids to be properly assigned.
 	 */
-	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
 		return;
 
 	switch (info)
@@ -264,22 +275,13 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
 
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			/*
+			 * We assign subxact to the toplevel xact while processing each
+			 * record if required.  So, we don't need to do anything here.
+			 * See LogicalDecodingProcessRecord.
+			 */
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index db19187..aef8555 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 347a38f..a5468c1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of top-level xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index b0f2a6e..b976882 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -191,6 +191,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -304,6 +306,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v28-0002-WAL-Log-invalidations-at-command-end-with-wal_le.amit.patchapplication/octet-stream; name=v28-0002-WAL-Log-invalidations-at-command-end-with-wal_le.amit.patchDownload
From cbd4944fe9bdef59bf1254203fd47be30da18b64 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 6 Jun 2020 09:54:21 +0530
Subject: [PATCH v28 2/2] WAL Log invalidations at command end with
 wal_level=logical.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When wal_level=logical, write invalidations at command end into WAL so
that decoding can use this information.

This patch is required to allow the streaming of in-progress transactions
in logical decoding.  The actual work to allow streaming will be committed
as a separate patch.

We still add the invalidations to the cache and write them to WAL at
commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not need
to be changed.

The invalidations written at command end uses a new xlog record type
XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See
LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures, which
still rely on the invalidations written to commit records.

The invalidations are decoded and accumulated in top-transaction, and then
executed during replay.  This obviates the need to decode the
invalidations as part of a commit record.

LogStandbyInvalidations was accumulating all the invalidations in memory,
and then only wrote them once at commit time, which may reduce the
performance impact by amortizing the overhead and deduplicating the
invalidations.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/rmgrdesc/xactdesc.c          | 40 +++++++++++++++++
 src/backend/access/transam/xact.c               |  7 +++
 src/backend/replication/logical/decode.c        | 58 +++++++++++++++----------
 src/backend/replication/logical/reorderbuffer.c | 54 ++++++++++++++++++-----
 src/backend/utils/cache/inval.c                 | 56 ++++++++++++++++++++++++
 src/include/access/xact.h                       | 13 +++++-
 src/include/replication/reorderbuffer.h         |  3 ++
 7 files changed, 196 insertions(+), 35 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce755..638efc5 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -20,6 +20,9 @@
 #include "storage/standbydefs.h"
 #include "utils/timestamp.h"
 
+static void xact_desc_invalidations(StringInfo buf,
+									int nmsgs, SharedInvalidationMessage *msgs);
+
 /*
  * Parse the WAL format of an xact commit and abort records into an easier to
  * understand format.
@@ -396,6 +399,12 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		xact_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs);
+	}
 }
 
 const char *
@@ -423,7 +432,38 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
 }
+
+static void
+xact_desc_invalidations(StringInfo buf,
+						int nmsgs, SharedInvalidationMessage *msgs)
+{
+	int			i;
+
+	appendStringInfoString(buf, "; inval msgs:");
+	for (i = 0; i < nmsgs; i++)
+	{
+		SharedInvalidationMessage *msg = &msgs[i];
+
+		if (msg->id >= 0)
+			appendStringInfo(buf, " catcache %d", msg->id);
+		else if (msg->id == SHAREDINVALCATALOG_ID)
+			appendStringInfo(buf, " catalog %u", msg->cat.catId);
+		else if (msg->id == SHAREDINVALRELCACHE_ID)
+			appendStringInfo(buf, " relcache %u", msg->rc.relId);
+		else if (msg->id == SHAREDINVALSMGR_ID)
+			appendStringInfoString(buf, " smgr");
+		else if (msg->id == SHAREDINVALRELMAP_ID)
+			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
+		else if (msg->id == SHAREDINVALSNAPSHOT_ID)
+			appendStringInfo(buf, " snapshot %u", msg->sn.relId);
+		else
+			appendStringInfo(buf, " unrecognized id %d", msg->id);
+	}
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a93fb8a..d93b40f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6022,6 +6022,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0c0c371..7153eba 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -278,10 +278,39 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 			/*
 			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here.
-			 * See LogicalDecodingProcessRecord.
+			 * record if required.  So, we don't need to do anything here. See
+			 * LogicalDecodingProcessRecord.
 			 */
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/*
+				 * Execute the invalidations for xid-less transactions,
+				 * otherwise, accumulate them so that they can be processed at
+				 * the commit time.
+				 */
+				if (!ctx->fast_forward)
+				{
+					if (TransactionIdIsValid(xid))
+					{
+						ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+													  invals->nmsgs, invals->msgs);
+						ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+														  buf->origptr);
+					}
+					else
+						ReorderBufferImmediateInvalidation(ctx->reorder,
+														   invals->nmsgs,
+														   invals->msgs);
+				}
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
@@ -334,15 +363,11 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_STANDBY_LOCK:
 			break;
 		case XLOG_INVALIDATIONS:
-			{
-				xl_invalidations *invalidations =
-				(xl_invalidations *) XLogRecGetData(r);
 
-				if (!ctx->fast_forward)
-					ReorderBufferImmediateInvalidation(ctx->reorder,
-													   invalidations->nmsgs,
-													   invalidations->msgs);
-			}
+			/*
+			 * We are processing the invalidations at the command level via
+			 * XLOG_XACT_INVALIDATIONS.  So we don't need to do anything here.
+			 */
 			break;
 		default:
 			elog(ERROR, "unexpected RM_STANDBY_ID record type: %u", info);
@@ -573,19 +598,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
-	/*
-	 * Process invalidation messages, even if we're not interested in the
-	 * transaction's contents, since the various caches need to always be
-	 * consistent.
-	 */
-	if (parsed->nmsgs > 0)
-	{
-		if (!ctx->fast_forward)
-			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
-										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
-	}
-
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 642a1c7..8f81a60 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -860,6 +860,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to top-level transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -2012,8 +2015,6 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
 		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
 										   txn->invalidations);
-	else
-		Assert(txn->ninvalidations == 0);
 
 	/* remove potential on-disk data, and deallocate */
 	ReorderBufferCleanupTXN(rb, txn);
@@ -2205,7 +2206,11 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 /*
  * Setup the invalidation of the toplevel transaction.
  *
- * This needs to be done before ReorderBufferCommit is called!
+ * This needs to be called for each XLOG_XACT_INVALIDATIONS message and
+ * accumulates all the invalidation messages in the toplevel transaction.
+ * This is required because in some cases where we skip processing the
+ * transaction (see ReorderBufferForget), we need to execute all the
+ * invalidations together.
  */
 void
 ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
@@ -2216,17 +2221,35 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	if (txn->ninvalidations != 0)
-		elog(ERROR, "only ever add one set of invalidations");
+	/*
+	 * We collect all the invalidations under the top transaction so that we
+	 * can execute them all together.
+	 */
+	if (txn->toptxn)
+		txn = txn->toptxn;
 
 	Assert(nmsgs > 0);
 
-	txn->ninvalidations = nmsgs;
-	txn->invalidations = (SharedInvalidationMessage *)
-		MemoryContextAlloc(rb->context,
-						   sizeof(SharedInvalidationMessage) * nmsgs);
-	memcpy(txn->invalidations, msgs,
-		   sizeof(SharedInvalidationMessage) * nmsgs);
+	/* Accumulate invalidations. */
+	if (txn->ninvalidations == 0)
+	{
+		txn->ninvalidations = nmsgs;
+		txn->invalidations = (SharedInvalidationMessage *)
+			MemoryContextAlloc(rb->context,
+							   sizeof(SharedInvalidationMessage) * nmsgs);
+		memcpy(txn->invalidations, msgs,
+			   sizeof(SharedInvalidationMessage) * nmsgs);
+	}
+	else
+	{
+		txn->invalidations = (SharedInvalidationMessage *)
+			repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
+					 (txn->ninvalidations + nmsgs));
+
+		memcpy(txn->invalidations + txn->ninvalidations, msgs,
+			   nmsgs * sizeof(SharedInvalidationMessage));
+		txn->ninvalidations += nmsgs;
+	}
 }
 
 /*
@@ -2254,6 +2277,15 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * Mark top-level transaction as having catalog changes too if one of its
+	 * children has so that the ReorderBufferBuildTupleCidHash can
+	 * conveniently check just top-level transaction and decide whether to
+	 * build the hash table or not.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..7d4fd9f 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,9 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transactions.  See
+ *      CommandEndInvalidationMessages.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +107,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +214,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1094,6 +1100,11 @@ CommandEndInvalidationMessages(void)
 
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
+
+	/* WAL Log per-command invalidation messages for wal_level=logical */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
 							   &transInvalInfo->CurrentCmdInvalidMsgs);
 }
@@ -1501,3 +1512,48 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * LogLogicalInvalidations
+ *
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage *invalMessages;
+	int			nmsgs = 0;
+
+	/* Quick exit if we haven't done anything with invalidation messages. */
+	if (transInvalInfo == NULL)
+		return;
+
+	ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+									 MakeSharedInvalidMessagesArray);
+
+	Assert(!(numSharedInvalidMessagesArray > 0 &&
+			 SharedInvalidMessagesArray == NULL));
+
+	invalMessages = SharedInvalidMessagesArray;
+	nmsgs = numSharedInvalidMessagesArray;
+	SharedInvalidMessagesArray = NULL;
+	numSharedInvalidMessagesArray = 0;
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						 nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index aef8555..ac3f5e3 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,17 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4..74ffe78 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -220,6 +220,9 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	end_lsn;
 
+	/* Toplevel transaction for this subxact (NULL for top-level). */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN of the last lsn at which snapshot information reside, so we can
 	 * restart decoding from there and fully recover this transaction from
-- 
1.8.3.1

#378Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#377)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jun 22, 2020 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 18, 2020 at 9:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Yes, I have made the changes. Basically, now I am only using the
XLOG_XACT_INVALIDATIONS for generating all the invalidation messages.
So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we
are directly appending it to the txn->invalidations. I have tested
the XLOG_INVALIDATIONS part but while sending this mail I realized
that we could write some automated test for the same.

Can you share how you have tested it?

I will work on
that soon.

Cool, I think having a regression test for this will be a good idea.

Other than above tests, can we somehow verify that the invalidations
generated at commit time are the same as what we do with this patch?
We have verified with individual commands but it would be great if we
can verify for the regression tests.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#379Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#377)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jun 22, 2020 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 18, 2020 at 9:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Yes, I have made the changes. Basically, now I am only using the
XLOG_XACT_INVALIDATIONS for generating all the invalidation messages.
So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we
are directly appending it to the txn->invalidations. I have tested
the XLOG_INVALIDATIONS part but while sending this mail I realized
that we could write some automated test for the same.

Can you share how you have tested it?

I just ran create index concurrently and decoded the changes.

I will work on
that soon.

Cool, I think having a regression test for this will be a good idea.

ok

@@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb,
TransactionId xid, XLogRecPtr lsn)
if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
txn->invalidations);
- else
- Assert(txn->ninvalidations == 0);

Why this Assert is removed?

Even if the base_snapshot is NULL, now we are collecting the
txn->invalidation. However, we haven't done any activity for that
transaction so we don't need to execute the invalidations same as the
code before, but assert is no more valid.

Apart from above, I have made a number of changes in
0002-WAL-Log-invalidations-at-command-end-with-wal_le to remove some
unnecessary changes, edited comments, ran pgindent and updated the
commit message. If you are fine with these changes, then do include
them in your next version.

Thanks, I will check those.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#380Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#379)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jun 22, 2020 at 4:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jun 22, 2020 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 18, 2020 at 9:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Yes, I have made the changes. Basically, now I am only using the
XLOG_XACT_INVALIDATIONS for generating all the invalidation messages.
So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we
are directly appending it to the txn->invalidations. I have tested
the XLOG_INVALIDATIONS part but while sending this mail I realized
that we could write some automated test for the same.

Can you share how you have tested it?

I just ran create index concurrently and decoded the changes.

Hmm, I think that won't reproduce the exact problem. What I wanted
was to run another command after "create index concurrently" which
depends on that and see if the decoding fails by removing the
XLOG_INVALIDATIONS code. Once you get some failure, you can apply the
0002 patch and see if the test is passed?

@@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb,
TransactionId xid, XLogRecPtr lsn)
if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
txn->invalidations);
- else
- Assert(txn->ninvalidations == 0);

Why this Assert is removed?

Even if the base_snapshot is NULL, now we are collecting the
txn->invalidation.

But there doesn't seem to be any check even before this patch which
directly prohibits accumulating invalidations in DecodeCommit. We
have check for base_snapshot in ReorderBufferCommit. Did you get any
failure with that check?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#381Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#380)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jun 22, 2020 at 5:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 22, 2020 at 4:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jun 22, 2020 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 18, 2020 at 9:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Yes, I have made the changes. Basically, now I am only using the
XLOG_XACT_INVALIDATIONS for generating all the invalidation messages.
So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we
are directly appending it to the txn->invalidations. I have tested
the XLOG_INVALIDATIONS part but while sending this mail I realized
that we could write some automated test for the same.

Can you share how you have tested it?

I just ran create index concurrently and decoded the changes.

Hmm, I think that won't reproduce the exact problem. What I wanted
was to run another command after "create index concurrently" which
depends on that and see if the decoding fails by removing the
XLOG_INVALIDATIONS code. Once you get some failure, you can apply the
0002 patch and see if the test is passed?

Okay, I will test that.

@@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb,
TransactionId xid, XLogRecPtr lsn)
if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
txn->invalidations);
- else
- Assert(txn->ninvalidations == 0);

Why this Assert is removed?

Even if the base_snapshot is NULL, now we are collecting the
txn->invalidation.

But there doesn't seem to be any check even before this patch which
directly prohibits accumulating invalidations in DecodeCommit. We
have check for base_snapshot in ReorderBufferCommit. Did you get any
failure with that check?

Because earlier ReorderBufferForget for toptxn will be called if the
top transaction is aborted and in abort case, we are not logging any
invalidation so that will be 0. However same is not true now.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#382Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#381)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jun 22, 2020 at 6:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jun 22, 2020 at 5:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

@@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb,
TransactionId xid, XLogRecPtr lsn)
if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
txn->invalidations);
- else
- Assert(txn->ninvalidations == 0);

Why this Assert is removed?

Even if the base_snapshot is NULL, now we are collecting the
txn->invalidation.

But there doesn't seem to be any check even before this patch which
directly prohibits accumulating invalidations in DecodeCommit. We
have check for base_snapshot in ReorderBufferCommit. Did you get any
failure with that check?

Because earlier ReorderBufferForget for toptxn will be called if the
top transaction is aborted and in abort case, we are not logging any
invalidation so that will be 0. However same is not true now.

AFAICS, ReorderBufferForget() is called (via DecodeCommit) only when
we need to skip the transaction. It doesn't seem to be called from
Abort path (DecodeAbort/ReorderBufferAbort doesn't use
ReorderBufferForget). I am not sure which code path are you referring
here, can you please share the code flow which you are referring to
here.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#383Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#382)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jun 23, 2020 at 8:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 22, 2020 at 6:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jun 22, 2020 at 5:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

@@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb,
TransactionId xid, XLogRecPtr lsn)
if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
txn->invalidations);
- else
- Assert(txn->ninvalidations == 0);

Why this Assert is removed?

Even if the base_snapshot is NULL, now we are collecting the
txn->invalidation.

But there doesn't seem to be any check even before this patch which
directly prohibits accumulating invalidations in DecodeCommit. We
have check for base_snapshot in ReorderBufferCommit. Did you get any
failure with that check?

Because earlier ReorderBufferForget for toptxn will be called if the
top transaction is aborted and in abort case, we are not logging any
invalidation so that will be 0. However same is not true now.

AFAICS, ReorderBufferForget() is called (via DecodeCommit) only when
we need to skip the transaction. It doesn't seem to be called from
Abort path (DecodeAbort/ReorderBufferAbort doesn't use
ReorderBufferForget). I am not sure which code path are you referring
here, can you please share the code flow which you are referring to
here.

I think you are right, during some intermediate code change, it
crashed on that assert (I guess I might be adding invalidation to the
sub-transaction but not sure what was that state) and I assumed that
is the reason that I explained above but, now I see my assumption was
wrong. I will put back that assert. By testing, I could not hit any
case where we hit that assert even after my changes, still I will put
more thought if by any chance our case is different then the base
code.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#384Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#383)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jun 23, 2020 at 10:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 23, 2020 at 8:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 22, 2020 at 6:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jun 22, 2020 at 5:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

@@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb,
TransactionId xid, XLogRecPtr lsn)
if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
txn->invalidations);
- else
- Assert(txn->ninvalidations == 0);

Why this Assert is removed?

Even if the base_snapshot is NULL, now we are collecting the
txn->invalidation.

But there doesn't seem to be any check even before this patch which
directly prohibits accumulating invalidations in DecodeCommit. We
have check for base_snapshot in ReorderBufferCommit. Did you get any
failure with that check?

Because earlier ReorderBufferForget for toptxn will be called if the
top transaction is aborted and in abort case, we are not logging any
invalidation so that will be 0. However same is not true now.

AFAICS, ReorderBufferForget() is called (via DecodeCommit) only when
we need to skip the transaction. It doesn't seem to be called from
Abort path (DecodeAbort/ReorderBufferAbort doesn't use
ReorderBufferForget). I am not sure which code path are you referring
here, can you please share the code flow which you are referring to
here.

I think you are right, during some intermediate code change, it
crashed on that assert (I guess I might be adding invalidation to the
sub-transaction but not sure what was that state) and I assumed that
is the reason that I explained above but, now I see my assumption was
wrong. I will put back that assert. By testing, I could not hit any
case where we hit that assert even after my changes, still I will put
more thought if by any chance our case is different then the base
code.

Here is the POC patch to discuss the idea of a cleanup of shared
fileset on proc exit. As discussed offlist, here I am maintaining
the list of shared fileset. First time when the list is NULL I am
registering the cleanup function with on_proc_exit routine. After
that for subsequent fileset, I am just appending it to filesetlist.
There is also an interface to unregister the shared file set from the
cleanup list and that is done by the caller whenever we are deleting
the shared fileset manually. While explaining it here, I think there
could be one issue if we delete all the element from the list will
become NULL and on next SharedFileSetInit we will again register the
function. Maybe that is not a problem but we can avoid registering
multiple times by using some flag in the file

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

poc_shared_fileset_cleanup_on_procexit.patchapplication/octet-stream; name=poc_shared_fileset_cleanup_on_procexit.patchDownload
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 14e057c..53c1fe8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1946,49 +1946,6 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 }
 
 /*
- * Cleanup function.
- *
- * Called on logical replication worker exit.
- */
-static void
-worker_onexit(int code, Datum arg)
-{
-	HASH_SEQ_STATUS status;
-	StreamXidHash *ent;
-	char		path[MAXPGPATH];
-
-	/* nothing to clean */
-	if (xidhash == NULL)
-		return;
-
-	/*
-	 * Scan complete hash and delete the underlying files for the xids.
-	 * Also release the memory for the shared file sets.
-	 */
-	hash_seq_init(&status, xidhash);
-	while ((ent = (StreamXidHash *) hash_seq_search(&status)) != NULL)
-	{
-		changes_filename(path, MyLogicalRepWorker->subid, ent->xid);
-		BufFileDeleteShared(ent->stream_fileset, path);
-		pfree(ent->stream_fileset);
-
-		/*
-		 * We might not have created the subxact fileset if there is no sub
-		 * transaction.
-		 */
-		if (ent->subxact_fileset)
-		{
-			subxact_filename(path, MyLogicalRepWorker->subid, ent->xid);
-			BufFileDeleteShared(ent->subxact_fileset, path);
-			pfree(ent->subxact_fileset);
-		}
-	}
-
-	/* Remove the xid hash */
-	hash_destroy(xidhash);
-}
-
-/*
  * Apply main loop.
  */
 static void
@@ -2012,9 +1969,6 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 													"LogicalStreamingContext",
 													ALLOCSET_DEFAULT_SIZES);
 
-	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
-	before_shmem_exit(worker_onexit, (Datum) 0);
-
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -2503,6 +2457,7 @@ subxact_info_write(Oid subid, TransactionId xid)
 		{
 			cleanup_subxact_info();
 			BufFileDeleteShared(ent->subxact_fileset, path);
+			SharedFileSetUnregister(ent->subxact_fileset);
 			pfree(ent->subxact_fileset);
 			ent->subxact_fileset = NULL;
 		}
@@ -2515,10 +2470,13 @@ subxact_info_write(Oid subid, TransactionId xid)
 	 */
 	if (ent->subxact_fileset == NULL)
 	{
-		ent->subxact_fileset =
-			MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
+		MemoryContext oldctx;
 
+		/* Shared fileset handle must be allocated in the persistent context */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
 		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
 		fd = BufFileCreateShared(ent->subxact_fileset, path);
 	}
 	else
@@ -2726,6 +2684,7 @@ stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
 	/* Delete the change file and release the stream fileset memory */
 	changes_filename(path, subid, xid);
 	BufFileDeleteShared(ent->stream_fileset, path);
+	SharedFileSetUnregister(ent->stream_fileset);
 	pfree(ent->stream_fileset);
 
 	/* Delete the subxact file and release the memory, if it exist */
@@ -2733,6 +2692,7 @@ stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
 	{
 		subxact_filename(path, subid, xid);
 		BufFileDeleteShared(ent->subxact_fileset, path);
+		SharedFileSetUnregister(ent->subxact_fileset);
 		pfree(ent->subxact_fileset);
 	}
 }
@@ -2784,13 +2744,15 @@ stream_open_file(Oid subid, TransactionId xid, bool first_segment)
 	 */
 	if (first_segment)
 	{
-		/*
-		 * Shared fileset handle must be allocated in the persistent context.
-		 */
-		SharedFileSet *fileset =
-		MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
+		MemoryContext oldctx;
+		SharedFileSet *fileset;
 
+		/* Shared fileset handle must be allocated in the persistent context */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
 		SharedFileSetInit(fileset, NULL);
+		oldctx = MemoryContextSwitchTo(oldctx);
+
 		stream_fd = BufFileCreateShared(fileset, path);
 
 		/* Remember the fileset for the next stream of the same transaction */
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index c81d298..9bfe71c 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -25,10 +25,14 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List * filesetlist = NULL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
@@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	/* Register our cleanup callback. */
 	if (seg)
 		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		if (filesetlist == NULL)
+			on_proc_exit(SharedFileSetOnProcExit, 0);
+
+		filesetlist = lcons((void *)fileset, filesetlist);
+	}
 }
 
 /*
@@ -214,6 +225,52 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 }
 
 /*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list if all the sharedfileset registered and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending  shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+}
+
+/*
+ * Unregister the shared fileset entry, registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	Assert(filesetlist != NULL);
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* remove the entry from the list and delete the underlying files */
+		if (input_fileset->number == fileset->number)
+		{
+			SharedFileSetDeleteAll(fileset);
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+	Assert(found);
+}
+
+/*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
  */
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index b2f4ba4..d5edb60 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -42,5 +42,6 @@ extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
#385Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#384)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Here is the POC patch to discuss the idea of a cleanup of shared
fileset on proc exit. As discussed offlist, here I am maintaining
the list of shared fileset. First time when the list is NULL I am
registering the cleanup function with on_proc_exit routine. After
that for subsequent fileset, I am just appending it to filesetlist.
There is also an interface to unregister the shared file set from the
cleanup list and that is done by the caller whenever we are deleting
the shared fileset manually. While explaining it here, I think there
could be one issue if we delete all the element from the list will
become NULL and on next SharedFileSetInit we will again register the
function. Maybe that is not a problem but we can avoid registering
multiple times by using some flag in the file

I don't understand what you mean by "using some flag in the file".

Review comments on various patches.

poc_shared_fileset_cleanup_on_procexit
=================================
1.
- ent->subxact_fileset =
- MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
+ MemoryContext oldctx;
+ /* Shared fileset handle must be allocated in the persistent context */
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+ ent->subxact_fileset = palloc(sizeof(SharedFileSet));
  SharedFileSetInit(ent->subxact_fileset, NULL);
+ MemoryContextSwitchTo(oldctx);
  fd = BufFileCreateShared(ent->subxact_fileset, path);

Why is this change required for this patch and why we only cover
SharedFileSetInit in the Apply context and not BufFileCreateShared?
The comment is also not very clear on this point.

2.
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+ bool found = false;
+ ListCell *l;
+
+ Assert(filesetlist != NULL);
+
+ /* Loop over all the pending shared fileset entry */
+ foreach (l, filesetlist)
+ {
+ SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+ /* remove the entry from the list and delete the underlying files */
+ if (input_fileset->number == fileset->number)
+ {
+ SharedFileSetDeleteAll(fileset);
+ filesetlist = list_delete_cell(filesetlist, l);

Why are we calling SharedFileSetDeleteAll here when in the caller we
have already deleted the fileset as per below code?
BufFileDeleteShared(ent->stream_fileset, path);
+ SharedFileSetUnregister(ent->stream_fileset);

I think it will be good if somehow we can remove the fileset from
filesetlist during BufFileDeleteShared. If that is possible, then we
don't need a separate API for SharedFileSetUnregister.

3.
+static List * filesetlist = NULL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid
tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const
char *name);
 static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name);
@@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
  /* Register our cleanup callback. */
  if (seg)
  on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+ else
+ {
+ if (filesetlist == NULL)
+ on_proc_exit(SharedFileSetOnProcExit, 0);

We use NIL for list initialization and comparison. See lock_files usage.

4.
+SharedFileSetOnProcExit(int status, Datum arg)
+{
+ ListCell *l;
+
+ /* Loop over all the pending  shared fileset entry */
+ foreach (l, filesetlist)
+ {
+ SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+ SharedFileSetDeleteAll(fileset);
+ }

We can initialize filesetlist as NIL after the for loop as it will
make the code look clean.

Comments on other patches:
=========================
5.

3. On concurrent abort we are truncating all the changes including
some incomplete changes, so later when we get the complete changes we
don't have the previous changes, e.g, if we had specinsert in the
last stream and due to concurrent abort detection if we delete that
changes later we will get spec_confirm without spec insert. We could
have simply avoided deleting all the changes, but I think the better
fix is once we detect the concurrent abort for any transaction, then
why do we need to collect the changes for that, we can simply avoid
that. So I have put that fix. (0006)

On similar lines, I think we need to skip processing message, see else
part of code in ReorderBufferQueueMessage.

6.
In v29-0002-Issue-individual-invalidations-with-wal_level-lo,
xact_desc_invalidations seems to be a subset of
standby_desc_invalidations, can we have a common code for them?

7.
I think we can avoid sending v29-0007-Track-statistics-for-streaming
this each time. We can do this after the main patch is complete.
Also, we might need to change how and where these stats will be
tracked. See the related discussion [1]/messages/by-id/CA+fd4k5_pPAYRTDrO2PbtTOe0eHQpBvuqmCr8ic39uTNmR49Eg@mail.gmail.com.

8. In v29-0005-Implement-streaming-mode-in-ReorderBuffer,
* Return oldest transaction in reorderbuffer
@@ -863,6 +909,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb,
TransactionId xid,
/* set the reference to top-level transaction */
subtxn->toptxn = txn;

+ /* set the reference to toplevel transaction */
+ subtxn->toptxn = txn;
+

There is a double initialization of subtxn->toptxn. You need to
remove this line from 0005 patch as we have now added it in an earlier
patch.

9. I think you forgot to update the patch to execute invalidations in
Abort case or I might be missing something. I don't see any changes
in ReorderBufferAbort. You have agreed in one of the emails above [2]/messages/by-id/CAFiTN-t7WZZjFrAjSYj4fu=FZ2JKENN8ZHCUZaw-srnrHMWMrg@mail.gmail.com
about handling the same.

10. In v29-0008-Add-support-for-streaming-to-built-in-replicatio,
 apply_handle_stream_commit(StringInfo s)
 {
 ..
 + /*
 + * send feedback to upstream
 + *
 + * XXX Probably should send a valid LSN. But which one?
 + */
 + send_feedback(InvalidXLogRecPtr, false, false);
 ..
 }

I have given a comment on this code that we don't need this feedback
and you mentioned on June 02 [3]/messages/by-id/CAFiTN-tHpd+zXVemo9WqQUJS50p9m8jD=AWjsugKZQ4F-K8Pbw@mail.gmail.com that you will think on it and let me
know your opinion but I don't see a response from you yet. Can you
get back to me regarding this point?

11. Add some comments as to why we have used Shared BufFile interface
instead of Temp BufFile interface?

12. In v29-0013-Change-buffile-interface-required-for-streaming,
+ * Initialize a space for temporary files that can be opened other backends.

/opened other backends/opened for access by other backends

[1]: /messages/by-id/CA+fd4k5_pPAYRTDrO2PbtTOe0eHQpBvuqmCr8ic39uTNmR49Eg@mail.gmail.com
[2]: /messages/by-id/CAFiTN-t7WZZjFrAjSYj4fu=FZ2JKENN8ZHCUZaw-srnrHMWMrg@mail.gmail.com
[3]: /messages/by-id/CAFiTN-tHpd+zXVemo9WqQUJS50p9m8jD=AWjsugKZQ4F-K8Pbw@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#386Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#376)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jun 22, 2020 at 11:56 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 16, 2020 at 2:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

8.
+ /*
+ * Start a transaction on stream start, this transaction will be committed
+ * on the stream stop.  We need the transaction for handling the buffile,
+ * used for serializing the streaming data and subxact info.
+ */
+ ensure_transaction();

I think we need this for PrepareTempTablespaces to set the
temptablespaces. Also, isn't it required for a cleanup of buffile
resources at the transaction end? Are there any other reasons for it
as well? The comment should be a bit more clear for why we need a
transaction here.

I am not sure that will it make sense to add a comment here that why
buffile and sharedfileset need a transaction?

You can say usage of BufFile interface expects us to be in the
transaction for so and so reason....

Do you think that we

should add comment in buffile/shared fileset API that it should be
called under a transaction?

I am fine with that as well.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#387Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#385)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

iOn Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Here is the POC patch to discuss the idea of a cleanup of shared
fileset on proc exit. As discussed offlist, here I am maintaining
the list of shared fileset. First time when the list is NULL I am
registering the cleanup function with on_proc_exit routine. After
that for subsequent fileset, I am just appending it to filesetlist.
There is also an interface to unregister the shared file set from the
cleanup list and that is done by the caller whenever we are deleting
the shared fileset manually. While explaining it here, I think there
could be one issue if we delete all the element from the list will
become NULL and on next SharedFileSetInit we will again register the
function. Maybe that is not a problem but we can avoid registering
multiple times by using some flag in the file

I don't understand what you mean by "using some flag in the file".

Basically, in POC as shown in below code snippet, We are checking
that if the "filesetlist" is NULL then only register the on_proc_exit
function. But, as described above if all the items are deleted the
list will be NULL. So I told that instead of checking the filesetlist
is NULL, we can have just a boolean variable that if we have
registered the callback then don't do it again.

@@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
  /* Register our cleanup callback. */
  if (seg)
  on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+ else
+ {
+ if (filesetlist == NULL)
+ on_proc_exit(SharedFileSetOnProcExit, 0);
+
+ filesetlist = lcons((void *)fileset, filesetlist);
+ }
 }

Review comments on various patches.

poc_shared_fileset_cleanup_on_procexit
=================================
1.
- ent->subxact_fileset =
- MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
+ MemoryContext oldctx;
+ /* Shared fileset handle must be allocated in the persistent context */
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+ ent->subxact_fileset = palloc(sizeof(SharedFileSet));
SharedFileSetInit(ent->subxact_fileset, NULL);
+ MemoryContextSwitchTo(oldctx);
fd = BufFileCreateShared(ent->subxact_fileset, path);

Why is this change required for this patch and why we only cover
SharedFileSetInit in the Apply context and not BufFileCreateShared?
The comment is also not very clear on this point.

Because only the sharedfileset and the filesetlist which is allocated
under SharedFileSetInit, are required in the permanent context.
BufFileCreateShared, only creates the Buffile and VFD which will be
required only within the current stream so transaction context is
enough.

2.
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+ bool found = false;
+ ListCell *l;
+
+ Assert(filesetlist != NULL);
+
+ /* Loop over all the pending shared fileset entry */
+ foreach (l, filesetlist)
+ {
+ SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+ /* remove the entry from the list and delete the underlying files */
+ if (input_fileset->number == fileset->number)
+ {
+ SharedFileSetDeleteAll(fileset);
+ filesetlist = list_delete_cell(filesetlist, l);

Why are we calling SharedFileSetDeleteAll here when in the caller we
have already deleted the fileset as per below code?
BufFileDeleteShared(ent->stream_fileset, path);
+ SharedFileSetUnregister(ent->stream_fileset);

I think it will be good if somehow we can remove the fileset from
filesetlist during BufFileDeleteShared. If that is possible, then we
don't need a separate API for SharedFileSetUnregister.

But the filesetlist is maintained at the sharedfileset level, so even
if we delete from BufFileDeleteShared, we need to call an API from the
sharedfileset layer to unregister the fileset. Am I missing
something?

3.
+static List * filesetlist = NULL;
+
static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetOnProcExit(int status, Datum arg);
static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid
tablespace);
static void SharedFilePath(char *path, SharedFileSet *fileset, const
char *name);
static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name);
@@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
/* Register our cleanup callback. */
if (seg)
on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+ else
+ {
+ if (filesetlist == NULL)
+ on_proc_exit(SharedFileSetOnProcExit, 0);

We use NIL for list initialization and comparison. See lock_files usage.

Right.

4.
+SharedFileSetOnProcExit(int status, Datum arg)
+{
+ ListCell *l;
+
+ /* Loop over all the pending  shared fileset entry */
+ foreach (l, filesetlist)
+ {
+ SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+ SharedFileSetDeleteAll(fileset);
+ }

We can initialize filesetlist as NIL after the for loop as it will
make the code look clean.

ok

Thanks for your feedback on this. I will reply to other comments separately.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#388Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#387)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jun 24, 2020 at 4:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

iOn Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Here is the POC patch to discuss the idea of a cleanup of shared
fileset on proc exit. As discussed offlist, here I am maintaining
the list of shared fileset. First time when the list is NULL I am
registering the cleanup function with on_proc_exit routine. After
that for subsequent fileset, I am just appending it to filesetlist.
There is also an interface to unregister the shared file set from the
cleanup list and that is done by the caller whenever we are deleting
the shared fileset manually. While explaining it here, I think there
could be one issue if we delete all the element from the list will
become NULL and on next SharedFileSetInit we will again register the
function. Maybe that is not a problem but we can avoid registering
multiple times by using some flag in the file

I don't understand what you mean by "using some flag in the file".

Basically, in POC as shown in below code snippet, We are checking
that if the "filesetlist" is NULL then only register the on_proc_exit
function. But, as described above if all the items are deleted the
list will be NULL. So I told that instead of checking the filesetlist
is NULL, we can have just a boolean variable that if we have
registered the callback then don't do it again.

Check if there is any precedent of the same in the code?

Review comments on various patches.

poc_shared_fileset_cleanup_on_procexit
=================================
1.
- ent->subxact_fileset =
- MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
+ MemoryContext oldctx;
+ /* Shared fileset handle must be allocated in the persistent context */
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+ ent->subxact_fileset = palloc(sizeof(SharedFileSet));
SharedFileSetInit(ent->subxact_fileset, NULL);
+ MemoryContextSwitchTo(oldctx);
fd = BufFileCreateShared(ent->subxact_fileset, path);

Why is this change required for this patch and why we only cover
SharedFileSetInit in the Apply context and not BufFileCreateShared?
The comment is also not very clear on this point.

Because only the sharedfileset and the filesetlist which is allocated
under SharedFileSetInit, are required in the permanent context.
BufFileCreateShared, only creates the Buffile and VFD which will be
required only within the current stream so transaction context is
enough.

Okay, then add some more comments to explain it or if you have
explained it elsewhere, then add a reference for the same.

2.
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+ bool found = false;
+ ListCell *l;
+
+ Assert(filesetlist != NULL);
+
+ /* Loop over all the pending shared fileset entry */
+ foreach (l, filesetlist)
+ {
+ SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+ /* remove the entry from the list and delete the underlying files */
+ if (input_fileset->number == fileset->number)
+ {
+ SharedFileSetDeleteAll(fileset);
+ filesetlist = list_delete_cell(filesetlist, l);

Why are we calling SharedFileSetDeleteAll here when in the caller we
have already deleted the fileset as per below code?
BufFileDeleteShared(ent->stream_fileset, path);
+ SharedFileSetUnregister(ent->stream_fileset);

I think it will be good if somehow we can remove the fileset from
filesetlist during BufFileDeleteShared. If that is possible, then we
don't need a separate API for SharedFileSetUnregister.

But the filesetlist is maintained at the sharedfileset level, so even
if we delete from BufFileDeleteShared, we need to call an API from the
sharedfileset layer to unregister the fileset.

Sure, but isn't it better if we can call such an API from BufFileDeleteShared?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#389Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#385)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Here is the POC patch to discuss the idea of a cleanup of shared
fileset on proc exit. As discussed offlist, here I am maintaining
the list of shared fileset. First time when the list is NULL I am
registering the cleanup function with on_proc_exit routine. After
that for subsequent fileset, I am just appending it to filesetlist.
There is also an interface to unregister the shared file set from the
cleanup list and that is done by the caller whenever we are deleting
the shared fileset manually. While explaining it here, I think there
could be one issue if we delete all the element from the list will
become NULL and on next SharedFileSetInit we will again register the
function. Maybe that is not a problem but we can avoid registering
multiple times by using some flag in the file

I don't understand what you mean by "using some flag in the file".

Review comments on various patches.

poc_shared_fileset_cleanup_on_procexit
=================================
1.
- ent->subxact_fileset =
- MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
+ MemoryContext oldctx;
+ /* Shared fileset handle must be allocated in the persistent context */
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+ ent->subxact_fileset = palloc(sizeof(SharedFileSet));
SharedFileSetInit(ent->subxact_fileset, NULL);
+ MemoryContextSwitchTo(oldctx);
fd = BufFileCreateShared(ent->subxact_fileset, path);

Why is this change required for this patch and why we only cover
SharedFileSetInit in the Apply context and not BufFileCreateShared?
The comment is also not very clear on this point.

Added the comments for the same.

2.
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+ bool found = false;
+ ListCell *l;
+
+ Assert(filesetlist != NULL);
+
+ /* Loop over all the pending shared fileset entry */
+ foreach (l, filesetlist)
+ {
+ SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+ /* remove the entry from the list and delete the underlying files */
+ if (input_fileset->number == fileset->number)
+ {
+ SharedFileSetDeleteAll(fileset);
+ filesetlist = list_delete_cell(filesetlist, l);

Why are we calling SharedFileSetDeleteAll here when in the caller we
have already deleted the fileset as per below code?
BufFileDeleteShared(ent->stream_fileset, path);
+ SharedFileSetUnregister(ent->stream_fileset);

That's wrong I have removed this.

I think it will be good if somehow we can remove the fileset from
filesetlist during BufFileDeleteShared. If that is possible, then we
don't need a separate API for SharedFileSetUnregister.

I have done as discussed on later replies, basically called
SharedFileSetUnregister from BufFileDeleteShared.

3.
+static List * filesetlist = NULL;
+
static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetOnProcExit(int status, Datum arg);
static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid
tablespace);
static void SharedFilePath(char *path, SharedFileSet *fileset, const
char *name);
static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name);
@@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
/* Register our cleanup callback. */
if (seg)
on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+ else
+ {
+ if (filesetlist == NULL)
+ on_proc_exit(SharedFileSetOnProcExit, 0);

We use NIL for list initialization and comparison. See lock_files usage.

Done

4.
+SharedFileSetOnProcExit(int status, Datum arg)
+{
+ ListCell *l;
+
+ /* Loop over all the pending  shared fileset entry */
+ foreach (l, filesetlist)
+ {
+ SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+ SharedFileSetDeleteAll(fileset);
+ }

We can initialize filesetlist as NIL after the for loop as it will
make the code look clean.

Right.

Comments on other patches:
=========================
5.

3. On concurrent abort we are truncating all the changes including
some incomplete changes, so later when we get the complete changes we
don't have the previous changes, e.g, if we had specinsert in the
last stream and due to concurrent abort detection if we delete that
changes later we will get spec_confirm without spec insert. We could
have simply avoided deleting all the changes, but I think the better
fix is once we detect the concurrent abort for any transaction, then
why do we need to collect the changes for that, we can simply avoid
that. So I have put that fix. (0006)

On similar lines, I think we need to skip processing message, see else
part of code in ReorderBufferQueueMessage.

Basically, ReorderBufferQueueMessage also calls the
ReorderBufferQueueChange internally for transactional changes. But,
having said that, I realize the idea of skipping the changes in
ReorderBufferQueueChange is not good, because by then we have already
allocated the memory for the change and the tuple and it's not a
correct to ReturnChanges because it will update the memory accounting.
So I think we can do it at a more centralized place and before we
process the change, maybe in LogicalDecodingProcessRecord, before
going to the switch we can call a function from the reorderbuffer.c
layer to see whether this transaction is detected as aborted or not.
But I have to think more on this line that can we skip all the
processing of that record or not.

Your other comments look fine to me so I will send in the next patch
set and reply on them individually.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

poc_shared_fileset_cleanup_on_procexit_v1.patchapplication/octet-stream; name=poc_shared_fileset_cleanup_on_procexit_v1.patchDownload
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 14e057cfff..99cb43b1e4 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1945,49 +1945,6 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 	}
 }
 
-/*
- * Cleanup function.
- *
- * Called on logical replication worker exit.
- */
-static void
-worker_onexit(int code, Datum arg)
-{
-	HASH_SEQ_STATUS status;
-	StreamXidHash *ent;
-	char		path[MAXPGPATH];
-
-	/* nothing to clean */
-	if (xidhash == NULL)
-		return;
-
-	/*
-	 * Scan complete hash and delete the underlying files for the xids.
-	 * Also release the memory for the shared file sets.
-	 */
-	hash_seq_init(&status, xidhash);
-	while ((ent = (StreamXidHash *) hash_seq_search(&status)) != NULL)
-	{
-		changes_filename(path, MyLogicalRepWorker->subid, ent->xid);
-		BufFileDeleteShared(ent->stream_fileset, path);
-		pfree(ent->stream_fileset);
-
-		/*
-		 * We might not have created the subxact fileset if there is no sub
-		 * transaction.
-		 */
-		if (ent->subxact_fileset)
-		{
-			subxact_filename(path, MyLogicalRepWorker->subid, ent->xid);
-			BufFileDeleteShared(ent->subxact_fileset, path);
-			pfree(ent->subxact_fileset);
-		}
-	}
-
-	/* Remove the xid hash */
-	hash_destroy(xidhash);
-}
-
 /*
  * Apply main loop.
  */
@@ -2012,9 +1969,6 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 													"LogicalStreamingContext",
 													ALLOCSET_DEFAULT_SIZES);
 
-	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
-	before_shmem_exit(worker_onexit, (Datum) 0);
-
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -2515,10 +2469,18 @@ subxact_info_write(Oid subid, TransactionId xid)
 	 */
 	if (ent->subxact_fileset == NULL)
 	{
-		ent->subxact_fileset =
-			MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
+		MemoryContext oldctx;
 
+		/*
+		 * Shared fileset handle must be allocated in the persistent context.
+		 * Also, SharedFileSetInit allocate the memory for sharefileset list
+		 * so we need to allocate that in the long term meemory context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
 		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
 		fd = BufFileCreateShared(ent->subxact_fileset, path);
 	}
 	else
@@ -2784,13 +2746,15 @@ stream_open_file(Oid subid, TransactionId xid, bool first_segment)
 	 */
 	if (first_segment)
 	{
-		/*
-		 * Shared fileset handle must be allocated in the persistent context.
-		 */
-		SharedFileSet *fileset =
-		MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
+		MemoryContext oldctx;
+		SharedFileSet *fileset;
 
+		/* Shared fileset handle must be allocated in the persistent context */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
 		SharedFileSetInit(fileset, NULL);
+		oldctx = MemoryContextSwitchTo(oldctx);
+
 		stream_fd = BufFileCreateShared(fileset, path);
 
 		/* Remember the fileset for the next stream of the same transaction */
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index bde6fa1ef3..502875a09c 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -364,6 +364,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 		CHECK_FOR_INTERRUPTS();
 	}
 
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
+
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
 }
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index c81d298fc3..3361a8274f 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -25,10 +25,14 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
@@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	/* Register our cleanup callback. */
 	if (seg)
 		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		if (filesetlist == NIL)
+			on_proc_exit(SharedFileSetOnProcExit, 0);
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -213,6 +224,57 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 		SharedFileSetDeleteAll(fileset);
 }
 
+/*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list if all the sharedfileset registered and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry, registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NULL)
+		return;
+
+	/* Loop over all the shared fileset entries to find the input fileset */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+	Assert(found);
+}
+
 /*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index b2f4ba4bd8..d5edb600af 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -42,5 +42,6 @@ extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
#390Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#389)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jun 25, 2020 at 7:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Here is the POC patch to discuss the idea of a cleanup of shared
fileset on proc exit. As discussed offlist, here I am maintaining
the list of shared fileset. First time when the list is NULL I am
registering the cleanup function with on_proc_exit routine. After
that for subsequent fileset, I am just appending it to filesetlist.
There is also an interface to unregister the shared file set from the
cleanup list and that is done by the caller whenever we are deleting
the shared fileset manually. While explaining it here, I think there
could be one issue if we delete all the element from the list will
become NULL and on next SharedFileSetInit we will again register the
function. Maybe that is not a problem but we can avoid registering
multiple times by using some flag in the file

I don't understand what you mean by "using some flag in the file".

Review comments on various patches.

poc_shared_fileset_cleanup_on_procexit
=================================
1.
- ent->subxact_fileset =
- MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
+ MemoryContext oldctx;
+ /* Shared fileset handle must be allocated in the persistent context */
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+ ent->subxact_fileset = palloc(sizeof(SharedFileSet));
SharedFileSetInit(ent->subxact_fileset, NULL);
+ MemoryContextSwitchTo(oldctx);
fd = BufFileCreateShared(ent->subxact_fileset, path);

Why is this change required for this patch and why we only cover
SharedFileSetInit in the Apply context and not BufFileCreateShared?
The comment is also not very clear on this point.

Added the comments for the same.

2.
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+ bool found = false;
+ ListCell *l;
+
+ Assert(filesetlist != NULL);
+
+ /* Loop over all the pending shared fileset entry */
+ foreach (l, filesetlist)
+ {
+ SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+ /* remove the entry from the list and delete the underlying files */
+ if (input_fileset->number == fileset->number)
+ {
+ SharedFileSetDeleteAll(fileset);
+ filesetlist = list_delete_cell(filesetlist, l);

Why are we calling SharedFileSetDeleteAll here when in the caller we
have already deleted the fileset as per below code?
BufFileDeleteShared(ent->stream_fileset, path);
+ SharedFileSetUnregister(ent->stream_fileset);

That's wrong I have removed this.

I think it will be good if somehow we can remove the fileset from
filesetlist during BufFileDeleteShared. If that is possible, then we
don't need a separate API for SharedFileSetUnregister.

I have done as discussed on later replies, basically called
SharedFileSetUnregister from BufFileDeleteShared.

3.
+static List * filesetlist = NULL;
+
static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetOnProcExit(int status, Datum arg);
static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid
tablespace);
static void SharedFilePath(char *path, SharedFileSet *fileset, const
char *name);
static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name);
@@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
/* Register our cleanup callback. */
if (seg)
on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+ else
+ {
+ if (filesetlist == NULL)
+ on_proc_exit(SharedFileSetOnProcExit, 0);

We use NIL for list initialization and comparison. See lock_files usage.

Done

4.
+SharedFileSetOnProcExit(int status, Datum arg)
+{
+ ListCell *l;
+
+ /* Loop over all the pending  shared fileset entry */
+ foreach (l, filesetlist)
+ {
+ SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+ SharedFileSetDeleteAll(fileset);
+ }

We can initialize filesetlist as NIL after the for loop as it will
make the code look clean.

Right.

Comments on other patches:
=========================
5.

3. On concurrent abort we are truncating all the changes including
some incomplete changes, so later when we get the complete changes we
don't have the previous changes, e.g, if we had specinsert in the
last stream and due to concurrent abort detection if we delete that
changes later we will get spec_confirm without spec insert. We could
have simply avoided deleting all the changes, but I think the better
fix is once we detect the concurrent abort for any transaction, then
why do we need to collect the changes for that, we can simply avoid
that. So I have put that fix. (0006)

On similar lines, I think we need to skip processing message, see else
part of code in ReorderBufferQueueMessage.

Basically, ReorderBufferQueueMessage also calls the
ReorderBufferQueueChange internally for transactional changes. But,
having said that, I realize the idea of skipping the changes in
ReorderBufferQueueChange is not good, because by then we have already
allocated the memory for the change and the tuple and it's not a
correct to ReturnChanges because it will update the memory accounting.
So I think we can do it at a more centralized place and before we
process the change, maybe in LogicalDecodingProcessRecord, before
going to the switch we can call a function from the reorderbuffer.c
layer to see whether this transaction is detected as aborted or not.
But I have to think more on this line that can we skip all the
processing of that record or not.

Your other comments look fine to me so I will send in the next patch
set and reply on them individually.

I think we can not put this check, in the higher-level functions like
LogicalDecodingProcessRecord or DecodeXXXOp because we need to process
that xid at least for abort, so I think it is good to keep the check,
inside ReorderBufferQueueChange only and we can free the memory of the
change if the abort is detected. Also, if just skip those changes in
ReorderBufferQueueChange then the effect will be localized to that
particular transaction which is already aborted.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#391Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#390)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Jun 26, 2020 at 10:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jun 25, 2020 at 7:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Comments on other patches:
=========================
5.

3. On concurrent abort we are truncating all the changes including
some incomplete changes, so later when we get the complete changes we
don't have the previous changes, e.g, if we had specinsert in the
last stream and due to concurrent abort detection if we delete that
changes later we will get spec_confirm without spec insert. We could
have simply avoided deleting all the changes, but I think the better
fix is once we detect the concurrent abort for any transaction, then
why do we need to collect the changes for that, we can simply avoid
that. So I have put that fix. (0006)

On similar lines, I think we need to skip processing message, see else
part of code in ReorderBufferQueueMessage.

Basically, ReorderBufferQueueMessage also calls the
ReorderBufferQueueChange internally for transactional changes.

Yes, that is correct but I was thinking about the non-transactional
part due to the below code there.

else
{
ReorderBufferTXN *txn = NULL;
volatile Snapshot snapshot_now = snapshot;

if (xid != InvalidTransactionId)
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);

Even though we are using txn here but I think we don't need to skip it
for aborted xacts because without patch as well such messages get
decoded irrespective of transaction status. What do you think?

But,
having said that, I realize the idea of skipping the changes in
ReorderBufferQueueChange is not good, because by then we have already
allocated the memory for the change and the tuple and it's not a
correct to ReturnChanges because it will update the memory accounting.
So I think we can do it at a more centralized place and before we
process the change, maybe in LogicalDecodingProcessRecord, before
going to the switch we can call a function from the reorderbuffer.c
layer to see whether this transaction is detected as aborted or not.
But I have to think more on this line that can we skip all the
processing of that record or not.

Your other comments look fine to me so I will send in the next patch
set and reply on them individually.

I think we can not put this check, in the higher-level functions like
LogicalDecodingProcessRecord or DecodeXXXOp because we need to process
that xid at least for abort, so I think it is good to keep the check,
inside ReorderBufferQueueChange only and we can free the memory of the
change if the abort is detected. Also, if just skip those changes in
ReorderBufferQueueChange then the effect will be localized to that
particular transaction which is already aborted.

Fair enough and for cases like non-transactional part of
ReorderBufferQueueMessage, I think we anyway need to process the
message irrespective of transaction status.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#392Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#389)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jun 25, 2020 at 7:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Review comments on various patches.

poc_shared_fileset_cleanup_on_procexit
=================================
1.
- ent->subxact_fileset =
- MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
+ MemoryContext oldctx;
+ /* Shared fileset handle must be allocated in the persistent context */
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+ ent->subxact_fileset = palloc(sizeof(SharedFileSet));
SharedFileSetInit(ent->subxact_fileset, NULL);
+ MemoryContextSwitchTo(oldctx);
fd = BufFileCreateShared(ent->subxact_fileset, path);

Why is this change required for this patch and why we only cover
SharedFileSetInit in the Apply context and not BufFileCreateShared?
The comment is also not very clear on this point.

Added the comments for the same.

1.
+ /*
+ * Shared fileset handle must be allocated in the persistent context.
+ * Also, SharedFileSetInit allocate the memory for sharefileset list
+ * so we need to allocate that in the long term meemory context.
+ */

How about "We need to maintain shared fileset across multiple stream
open/close calls. So, we allocate it in a persistent context."

2.
+ /*
+ * If the caller is following the dsm based cleanup then we don't
+ * maintain the filesetlist so return.
+ */
+ if (filesetlist == NULL)
+ return;

The check here should use 'NIL' instead of 'NULL'

Other than that the changes in this particular patch looks good to me.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#393Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#385)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Comments on other patches:
=========================

Replying to the pending comments.

6.
In v29-0002-Issue-individual-invalidations-with-wal_level-lo,
xact_desc_invalidations seems to be a subset of
standby_desc_invalidations, can we have a common code for them?

Done

7.
I think we can avoid sending v29-0007-Track-statistics-for-streaming
this each time. We can do this after the main patch is complete.
Also, we might need to change how and where these stats will be
tracked. See the related discussion [1].

Removed

8. In v29-0005-Implement-streaming-mode-in-ReorderBuffer,
* Return oldest transaction in reorderbuffer
@@ -863,6 +909,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb,
TransactionId xid,
/* set the reference to top-level transaction */
subtxn->toptxn = txn;

+ /* set the reference to toplevel transaction */
+ subtxn->toptxn = txn;
+

There is a double initialization of subtxn->toptxn. You need to
remove this line from 0005 patch as we have now added it in an earlier
patch.

Done

9. I think you forgot to update the patch to execute invalidations in
Abort case or I might be missing something. I don't see any changes
in ReorderBufferAbort. You have agreed in one of the emails above [2]
about handling the same.

Done, check 0005

10. In v29-0008-Add-support-for-streaming-to-built-in-replicatio,
apply_handle_stream_commit(StringInfo s)
{
..
+ /*
+ * send feedback to upstream
+ *
+ * XXX Probably should send a valid LSN. But which one?
+ */
+ send_feedback(InvalidXLogRecPtr, false, false);
..
}

I have given a comment on this code that we don't need this feedback
and you mentioned on June 02 [3] that you will think on it and let me
know your opinion but I don't see a response from you yet. Can you
get back to me regarding this point?

Yeah, I have analyzed this and this seems we don't need this. Because
like non-streaming mode here also sending feedback mechanisms shall be
the same. I don't see any reason for sending extra feedback on
commit.

11. Add some comments as to why we have used Shared BufFile interface
instead of Temp BufFile interface?

Done

12. In v29-0013-Change-buffile-interface-required-for-streaming,
+ * Initialize a space for temporary files that can be opened other backends.

/opened other backends/opened for access by other backends

Done

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v30.tarapplication/x-tar; name=v30.tarDownload
v30/000755 000765 000024 00000000000 13675307444 013007 5ustar00dilipkumarstaff000000 000000 v30/v30-0002-WAL-Log-invalidations-at-command-end-with-wal_le.patch000644 000765 000024 00000033707 13676127246 026030 0ustar00dilipkumarstaff000000 000000 From 04375139573c36184bd837b016dc2f586d953773 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 6 Jun 2020 09:54:21 +0530
Subject: [PATCH v30 02/14] WAL Log invalidations at command end with
 wal_level=logical.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When wal_level=logical, write invalidations at command end into WAL so
that decoding can use this information.

This patch is required to allow the streaming of in-progress transactions
in logical decoding.  The actual work to allow streaming will be committed
as a separate patch.

We still add the invalidations to the cache and write them to WAL at
commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not need
to be changed.

The invalidations written at command end uses a new xlog record type
XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See
LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures, which
still rely on the invalidations written to commit records.

The invalidations are decoded and accumulated in top-transaction, and then
executed during replay.  This obviates the need to decode the
invalidations as part of a commit record.

LogStandbyInvalidations was accumulating all the invalidations in memory,
and then only wrote them once at commit time, which may reduce the
performance impact by amortizing the overhead and deduplicating the
invalidations.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/rmgrdesc/xactdesc.c        | 10 ++++
 src/backend/access/transam/xact.c             |  7 +++
 src/backend/replication/logical/decode.c      | 58 +++++++++++--------
 .../replication/logical/reorderbuffer.c       | 52 ++++++++++++++---
 src/backend/utils/cache/inval.c               | 56 ++++++++++++++++++
 src/include/access/xact.h                     | 13 ++++-
 src/include/replication/reorderbuffer.h       |  3 +
 7 files changed, 166 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce75565f..68aa994c9e 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -396,6 +396,13 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		standby_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs, InvalidOid,
+								   InvalidOid, false);
+	}
 }
 
 const char *
@@ -423,6 +430,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a93fb8a4f0..d93b40f2f8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6022,6 +6022,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0c0c371739..7153ebaa96 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -278,10 +278,39 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 			/*
 			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here.
-			 * See LogicalDecodingProcessRecord.
+			 * record if required.  So, we don't need to do anything here. See
+			 * LogicalDecodingProcessRecord.
 			 */
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/*
+				 * Execute the invalidations for xid-less transactions,
+				 * otherwise, accumulate them so that they can be processed at
+				 * the commit time.
+				 */
+				if (!ctx->fast_forward)
+				{
+					if (TransactionIdIsValid(xid))
+					{
+						ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+													  invals->nmsgs, invals->msgs);
+						ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+														  buf->origptr);
+					}
+					else
+						ReorderBufferImmediateInvalidation(ctx->reorder,
+														   invals->nmsgs,
+														   invals->msgs);
+				}
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
@@ -334,15 +363,11 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_STANDBY_LOCK:
 			break;
 		case XLOG_INVALIDATIONS:
-			{
-				xl_invalidations *invalidations =
-				(xl_invalidations *) XLogRecGetData(r);
 
-				if (!ctx->fast_forward)
-					ReorderBufferImmediateInvalidation(ctx->reorder,
-													   invalidations->nmsgs,
-													   invalidations->msgs);
-			}
+			/*
+			 * We are processing the invalidations at the command level via
+			 * XLOG_XACT_INVALIDATIONS.  So we don't need to do anything here.
+			 */
 			break;
 		default:
 			elog(ERROR, "unexpected RM_STANDBY_ID record type: %u", info);
@@ -573,19 +598,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
-	/*
-	 * Process invalidation messages, even if we're not interested in the
-	 * transaction's contents, since the various caches need to always be
-	 * consistent.
-	 */
-	if (parsed->nmsgs > 0)
-	{
-		if (!ctx->fast_forward)
-			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
-										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
-	}
-
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 642a1c767f..4b277fe6f9 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -860,6 +860,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to top-level transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -2205,7 +2208,11 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 /*
  * Setup the invalidation of the toplevel transaction.
  *
- * This needs to be done before ReorderBufferCommit is called!
+ * This needs to be called for each XLOG_XACT_INVALIDATIONS message and
+ * accumulates all the invalidation messages in the toplevel transaction.
+ * This is required because in some cases where we skip processing the
+ * transaction (see ReorderBufferForget), we need to execute all the
+ * invalidations together.
  */
 void
 ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
@@ -2216,17 +2223,35 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	if (txn->ninvalidations != 0)
-		elog(ERROR, "only ever add one set of invalidations");
+	/*
+	 * We collect all the invalidations under the top transaction so that we
+	 * can execute them all together.
+	 */
+	if (txn->toptxn)
+		txn = txn->toptxn;
 
 	Assert(nmsgs > 0);
 
-	txn->ninvalidations = nmsgs;
-	txn->invalidations = (SharedInvalidationMessage *)
-		MemoryContextAlloc(rb->context,
-						   sizeof(SharedInvalidationMessage) * nmsgs);
-	memcpy(txn->invalidations, msgs,
-		   sizeof(SharedInvalidationMessage) * nmsgs);
+	/* Accumulate invalidations. */
+	if (txn->ninvalidations == 0)
+	{
+		txn->ninvalidations = nmsgs;
+		txn->invalidations = (SharedInvalidationMessage *)
+			MemoryContextAlloc(rb->context,
+							   sizeof(SharedInvalidationMessage) * nmsgs);
+		memcpy(txn->invalidations, msgs,
+			   sizeof(SharedInvalidationMessage) * nmsgs);
+	}
+	else
+	{
+		txn->invalidations = (SharedInvalidationMessage *)
+			repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
+					 (txn->ninvalidations + nmsgs));
+
+		memcpy(txn->invalidations + txn->ninvalidations, msgs,
+			   nmsgs * sizeof(SharedInvalidationMessage));
+		txn->ninvalidations += nmsgs;
+	}
 }
 
 /*
@@ -2254,6 +2279,15 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * Mark top-level transaction as having catalog changes too if one of its
+	 * children has so that the ReorderBufferBuildTupleCidHash can
+	 * conveniently check just top-level transaction and decide whether to
+	 * build the hash table or not.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33be6..7d4fd9fd72 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,9 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transactions.  See
+ *      CommandEndInvalidationMessages.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +107,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +214,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1094,6 +1100,11 @@ CommandEndInvalidationMessages(void)
 
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
+
+	/* WAL Log per-command invalidation messages for wal_level=logical */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
 							   &transInvalInfo->CurrentCmdInvalidMsgs);
 }
@@ -1501,3 +1512,48 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * LogLogicalInvalidations
+ *
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage *invalMessages;
+	int			nmsgs = 0;
+
+	/* Quick exit if we haven't done anything with invalidation messages. */
+	if (transInvalInfo == NULL)
+		return;
+
+	ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+									 MakeSharedInvalidMessagesArray);
+
+	Assert(!(numSharedInvalidMessagesArray > 0 &&
+			 SharedInvalidMessagesArray == NULL));
+
+	invalMessages = SharedInvalidMessagesArray;
+	nmsgs = numSharedInvalidMessagesArray;
+	SharedInvalidMessagesArray = NULL;
+	numSharedInvalidMessagesArray = 0;
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						 nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index aef8555367..ac3f5e3b60 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -197,6 +197,17 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+/*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4dc9..74ffe7852f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -220,6 +220,9 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	end_lsn;
 
+	/* Toplevel transaction for this subxact (NULL for top-level). */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN of the last lsn at which snapshot information reside, so we can
 	 * restart decoding from there and fully recover this transaction from
-- 
2.23.0

v30/v30-0009-Add-TAP-test-for-streaming-vs.-DDL.patch000644 000765 000024 00000010620 13676127246 022773 0ustar00dilipkumarstaff000000 000000 From 22e7c22930a5c359bf9c1bb401256a2504471664 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v30 09/14] Add TAP test for streaming vs. DDL

---
 .../subscription/t/014_stream_through_ddl.pl  | 98 +++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000000..b8d78b1972
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v30/v30-0007-Add-support-for-streaming-to-built-in-replicatio.patch000644 000765 000024 00000263043 13676127246 026302 0ustar00dilipkumarstaff000000 000000 From b6685198b72923e0f797a77fcad3c8b6b124223c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 11 Jun 2020 15:34:29 +0530
Subject: [PATCH v30 07/14] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml      |    4 +-
 doc/src/sgml/ref/create_subscription.sgml     |   11 +
 src/backend/catalog/pg_subscription.c         |    1 +
 src/backend/commands/subscriptioncmds.c       |   45 +-
 src/backend/postmaster/pgstat.c               |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |    3 +
 src/backend/replication/logical/proto.c       |  140 ++-
 src/backend/replication/logical/worker.c      | 1012 +++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c   |  318 +++++-
 src/backend/replication/slotfuncs.c           |    6 +
 src/backend/replication/walsender.c           |    6 +
 src/include/catalog/pg_subscription.h         |    3 +
 src/include/pgstat.h                          |    6 +-
 src/include/replication/logicalproto.h        |   42 +-
 src/include/replication/walreceiver.h         |    1 +
 src/test/subscription/t/009_stream_simple.pl  |   86 ++
 src/test/subscription/t/010_stream_subxact.pl |  102 ++
 src/test/subscription/t/011_stream_ddl.pl     |   95 ++
 .../t/012_stream_subxact_abort.pl             |   82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |   84 ++
 20 files changed, 2019 insertions(+), 40 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index c24ace14d1..d8de56c928 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 5bbc165f70..c25b7c5962 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731115..f28482f0f4 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9ebb026187..9065a1be1b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c022597bc0..a55ccc0c03 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4138,6 +4138,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9bb6..5257ab0394 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd171..83d0642cf3 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a1224d..d2d9469999 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,6 +54,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -64,6 +86,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +94,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -100,6 +124,7 @@ typedef struct SlotErrCallbackArg
 } SlotErrCallbackArg;
 
 static MemoryContext ApplyMessageContext = NULL;
+static MemoryContext LogicalStreamingContext = NULL;
 MemoryContext ApplyContext = NULL;
 
 WalReceiverConn *wrconn = NULL;
@@ -110,12 +135,58 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool	in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;						/* XID of the subxact */
+	off_t           offset;					/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +258,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -552,6 +659,326 @@ apply_handle_origin(StringInfo s)
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+	{
+		MemoryContext oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+
+		/* Read the subxacts info in per-stream context. */
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+		MemoryContextSwitchTo(oldctx);
+	}
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		int			fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = OpenTransientFile(path, O_WRONLY | PG_BINARY);
+		if (fd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+		}
+
+		/* OK, truncate the file at the right offset. */
+		if (ftruncate(fd, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+		CloseTransientFile(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	ensure_transaction();
+
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -565,6 +992,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1010,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1049,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1167,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1312,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1685,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1826,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1477,6 +1938,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 	}
 }
 
+/*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
 /*
  * Apply main loop.
  */
@@ -1493,6 +1970,17 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reeset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													 ALLOCSET_DEFAULT_SIZES);
+
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1941,6 +2429,529 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Let the caller decide which memory context it will be allocated.
+	 * Ideally, during stream start it will be allocated in the
+	 * LogicalStreamingContext which will be reset on stream stop, and
+	 * during the stream abort we need this memory only for short term so
+	 * it will be allocated in ApplyMessageContext.
+	 */
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd >= 0);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.subxacts",
+			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.changes",
+			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		/* Need to allocate this in permanent context */
+		oldcxt = MemoryContextSwitchTo(ApplyContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3117,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 15379e3118..1509f9b826 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +122,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +148,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +211,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +240,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +263,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +373,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +423,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +465,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +486,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +518,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +538,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +562,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +582,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +607,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +639,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -587,6 +719,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -623,6 +840,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -753,11 +1002,44 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -793,7 +1075,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 06e4955de7..5f74ca1eed 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -157,6 +157,12 @@ create_logical_replication_slot(char *name, char *plugin,
 											   .segment_close = wal_segment_close),
 									NULL, NULL, NULL);
 
+	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
 	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e2477c47e0..1abf243356 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndPrepareWrite, WalSndWriteData,
 										WalSndUpdateProgress);
 
+		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
 		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d42d8..617b9094d4 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..6352ff945a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -981,7 +981,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561be9..89158ed46f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index ac1acbb27e..95132062c6 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v30/v30-0013-Worker-tempfile-use-the-shared-buffile-infrastru.patch000644 000765 000024 00000075112 13676127246 026343 0ustar00dilipkumarstaff000000 000000 From 5a797521ccba4912dae9877fc25f2b95a2ab05f7 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 11 Jun 2020 16:42:07 +0530
Subject: [PATCH v30 13/14] Worker tempfile use the shared buffile
 infrastructure

Tobe merged with 0008, kept separate to make it easy for the
review.
---
 src/backend/replication/logical/worker.c | 630 ++++++++++-------------
 1 file changed, 281 insertions(+), 349 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index d2d9469999..a543ee973b 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -32,9 +32,12 @@
  * to truncate the file with serialized changes.
  *
  * The files are placed in tmp file directory by default, and the filenames
- * include both the XID of the toplevel transaction and OID of the subscription.
- * This is necessary so that different workers processing a remote transaction
- * with the same XID don't interfere.
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use buffiles instead of using normal temporary files because the buffile
+ * infrastructure supports temporary files that exceed the OS file size limit.
  *
  *-------------------------------------------------------------------------
  */
@@ -56,6 +59,7 @@
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -85,6 +89,7 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -123,10 +128,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a xid we create this entry in the
+ * xidhash and we also create the streaming file and store the fileset handle.
+ * So that on the subsequent stream for the xid we can search the entry in the
+ * hash and get the fileset handle.  The subxact file is created iff there is
+ * any suxact info under this xid.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
-static MemoryContext LogicalStreamingContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -136,15 +157,26 @@ bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
 /* fields valid only when processing streamed transaction */
-bool	in_streamed_transaction = false;
+bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
-static int	stream_fd = -1;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
 
 typedef struct SubXactInfo
 {
-	TransactionId xid;						/* XID of the subxact */
-	off_t           offset;					/* offset in the file */
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
 } SubXactInfo;
 
 static uint32 nsubxacts = 0;
@@ -171,13 +203,6 @@ static void stream_open_file(Oid subid, TransactionId xid, bool first);
 static void stream_write_change(char action, StringInfo s);
 static void stream_close_file(void);
 
-/*
- * Array of serialized XIDs.
- */
-static int	nxids = 0;
-static int	maxnxids = 0;
-static TransactionId	*xids = NULL;
-
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
@@ -275,7 +300,7 @@ handle_streamed_transaction(const char action, StringInfo s)
 	if (!in_streamed_transaction)
 		return false;
 
-	Assert(stream_fd != -1);
+	Assert(stream_fd != NULL);
 	Assert(TransactionIdIsValid(stream_xid));
 
 	/*
@@ -666,31 +691,39 @@ static void
 apply_handle_stream_start(StringInfo s)
 {
 	bool		first_segment;
+	HASHCTL		hash_ctl;
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
 	/* notify handle methods we're processing a remote transaction */
 	in_streamed_transaction = true;
 
 	/* extract XID of the top-level transaction */
 	stream_xid = logicalrep_read_stream_start(s, &first_segment);
 
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
 	/* open the spool file for this transaction */
 	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
 
-	/*
-	 * if this is not the first segment, open existing file
-	 *
-	 * XXX Note that the cleanup is performed by stream_open_file.
-	 */
+	/* if this is not the first segment, open existing file */
 	if (!first_segment)
-	{
-		MemoryContext oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
-
-		/* Read the subxacts info in per-stream context. */
 		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
-		MemoryContextSwitchTo(oldctx);
-	}
 
 	pgstat_report_activity(STATE_RUNNING, NULL);
 }
@@ -710,6 +743,12 @@ apply_handle_stream_stop(StringInfo s)
 	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
 	stream_close_file();
 
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
 	in_streamed_transaction = false;
 
 	/* Reset per-stream context */
@@ -736,10 +775,7 @@ apply_handle_stream_abort(StringInfo s)
 	 * just delete the files with serialized info.
 	 */
 	if (xid == subxid)
-	{
 		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
-		return;
-	}
 	else
 	{
 		/*
@@ -761,11 +797,13 @@ apply_handle_stream_abort(StringInfo s)
 
 		int64		i;
 		int64		subidx;
-		int			fd;
+		BufFile    *fd;
 		bool		found = false;
 		char		path[MAXPGPATH];
+		StreamXidHash *ent;
 
 		subidx = -1;
+		ensure_transaction();
 		subxact_info_read(MyLogicalRepWorker->subid, xid);
 
 		/* XXX optimize the search by bsearch on sorted data */
@@ -787,33 +825,32 @@ apply_handle_stream_abort(StringInfo s)
 		{
 			/* Cleanup the subxact info */
 			cleanup_subxact_info();
+			CommitTransactionCommand();
 			return;
 		}
 
 		Assert((subidx >= 0) && (subidx < nsubxacts));
 
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
 		changes_filename(path, MyLogicalRepWorker->subid, xid);
-		fd = OpenTransientFile(path, O_WRONLY | PG_BINARY);
-		if (fd < 0)
-		{
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not open file \"%s\": %m",
-							path)));
-		}
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
 
-		/* OK, truncate the file at the right offset. */
-		if (ftruncate(fd, subxacts[subidx].offset))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not truncate file \"%s\": %m", path)));
-		CloseTransientFile(fd);
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
 
 		/* discard the subxacts added later */
 		nsubxacts = subidx;
 
 		/* write the updated subxact list */
 		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
 	}
 }
 
@@ -823,16 +860,16 @@ apply_handle_stream_abort(StringInfo s)
 static void
 apply_handle_stream_commit(StringInfo s)
 {
-	int			fd;
 	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
-
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
+	bool		found;
 	LogicalRepCommitData commit_data;
-
+	StreamXidHash *ent;
 	MemoryContext oldcxt;
+	BufFile    *fd;
 
 	Assert(!in_streamed_transaction);
 
@@ -840,25 +877,20 @@ apply_handle_stream_commit(StringInfo s)
 
 	elog(DEBUG1, "received commit for streamed transaction %u", xid);
 
-	/* open the spool file for the committed transaction */
-	changes_filename(path, MyLogicalRepWorker->subid, xid);
-
-	elog(DEBUG1, "replaying changes from file '%s'", path);
-
-	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (fd < 0)
-	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
-	}
-
 	ensure_transaction();
-
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	buffer = palloc(8192);
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
 	initStringInfo(&s2);
 
 	MemoryContextSwitchTo(oldcxt);
@@ -881,9 +913,7 @@ apply_handle_stream_commit(StringInfo s)
 		int			len;
 
 		/* read length of the on-disk record */
-		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
-		nbytes = read(fd, &len, sizeof(len));
-		pgstat_report_wait_end();
+		nbytes = BufFileRead(fd, &len, sizeof(len));
 
 		/* have we reached end of the file? */
 		if (nbytes == 0)
@@ -891,16 +921,9 @@ apply_handle_stream_commit(StringInfo s)
 
 		/* do we have a correct length? */
 		if (nbytes != sizeof(len))
-		{
-			int			save_errno = errno;
-
-			CloseTransientFile(fd);
-			errno = save_errno;
 			ereport(ERROR,
 					(errcode_for_file_access(),
-					 errmsg("could not read file: %m")));
-			return;
-		}
+					 errmsg("could not read from streaming transaction's changes file: %m")));
 
 		Assert(len > 0);
 
@@ -908,19 +931,10 @@ apply_handle_stream_commit(StringInfo s)
 		buffer = repalloc(buffer, len);
 
 		/* and finally read the data into the buffer */
-		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
-		if (read(fd, buffer, len) != len)
-		{
-			int			save_errno = errno;
-
-			CloseTransientFile(fd);
-			errno = save_errno;
+		if (BufFileRead(fd, buffer, len) != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
-					 errmsg("could not read file: %m")));
-			return;
-		}
-		pgstat_report_wait_end();
+					 errmsg("could not read from streaming transaction's changes file: %m")));
 
 		/* copy the buffer to the stringinfo and call apply_dispatch */
 		resetStringInfo(&s2);
@@ -948,15 +962,11 @@ apply_handle_stream_commit(StringInfo s)
 		 */
 		send_feedback(InvalidXLogRecPtr, false, false);
 	}
-
-	if (CloseTransientFile(fd) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not close file \"%s\": %m", path)));
+	BufFileClose(fd);
 
 	/*
-	 * Update origin state so we can restart streaming from correct
-	 * position in case of crash.
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
 	 */
 	replorigin_session_origin_lsn = commit_data.end_lsn;
 	replorigin_session_origin_timestamp = commit_data.committime;
@@ -1946,12 +1956,39 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 static void
 worker_onexit(int code, Datum arg)
 {
-	int	i;
+	HASH_SEQ_STATUS status;
+	StreamXidHash *ent;
+	char		path[MAXPGPATH];
+
+	/* nothing to clean */
+	if (xidhash == NULL)
+		return;
+
+	/*
+	 * Scan complete hash and delete the underlying files for the xids.
+	 * Also release the memory for the shared file sets.
+	 */
+	hash_seq_init(&status, xidhash);
+	while ((ent = (StreamXidHash *) hash_seq_search(&status)) != NULL)
+	{
+		changes_filename(path, MyLogicalRepWorker->subid, ent->xid);
+		BufFileDeleteShared(ent->stream_fileset, path);
+		pfree(ent->stream_fileset);
 
-	elog(LOG, "cleanup files for %d transactions", nxids);
+		/*
+		 * We might not have created the subxact fileset if there is no sub
+		 * transaction.
+		 */
+		if (ent->subxact_fileset)
+		{
+			subxact_filename(path, MyLogicalRepWorker->subid, ent->xid);
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+		}
+	}
 
-	for (i = nxids-1; i >= 0; i--)
-		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+	/* Remove the xid hash */
+	hash_destroy(xidhash);
 }
 
 /*
@@ -1972,11 +2009,11 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 
 	/*
 	 * This memory context used for per stream data when streaming mode is
-	 * enabled.  This context is reeset on each stream stop.
+	 * enabled.  This context is reset on each stream stop.
 	 */
 	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
 													"LogicalStreamingContext",
-													 ALLOCSET_DEFAULT_SIZES);
+													ALLOCSET_DEFAULT_SIZES);
 
 	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
 	before_shmem_exit(worker_onexit, (Datum) 0);
@@ -2085,7 +2122,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -2441,64 +2478,62 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 static void
 subxact_info_write(Oid subid, TransactionId xid)
 {
-	int			fd;
 	char		path[MAXPGPATH];
+	bool		found;
 	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
 
 	Assert(TransactionIdIsValid(xid));
 
 	subxact_filename(path, subid, xid);
 
-	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
-	if (fd < 0)
-	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create file \"%s\": %m",
-						path)));
-		return;
-	}
-
-	len = sizeof(SubXactInfo) * nsubxacts;
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
 
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
-
-	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
 	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not write to file \"%s\": %m",
-						path)));
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
 		return;
 	}
 
-	if ((len > 0) && (write(fd, subxacts, len) != len))
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
 	{
-		int			save_errno = errno;
+		ent->subxact_fileset =
+			MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
 
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not write to file \"%s\": %m",
-						path)));
-		return;
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
 	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
 
-	pgstat_report_wait_end();
+	len = sizeof(SubXactInfo) * nsubxacts;
 
-	/*
-	 * We don't need to fsync or anything, as we'll recreate the files after a
-	 * crash from scratch. So just close the file.
-	 */
-	if (CloseTransientFile(fd) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not close file \"%s\": %m", path)));
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
 
 	/*
 	 * But we free the memory allocated for subxact info. There might be one
@@ -2513,50 +2548,45 @@ subxact_info_write(Oid subid, TransactionId xid)
  *	  Restore information about subxacts of a streamed transaction.
  *
  * Read information about subxacts into the global variables.
- *
- * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
  */
 static void
 subxact_info_read(Oid subid, TransactionId xid)
 {
-	int			fd;
 	char		path[MAXPGPATH];
+	bool		found;
 	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
 
 	Assert(TransactionIdIsValid(xid));
 	Assert(!subxacts);
 	Assert(nsubxacts == 0);
 	Assert(nsubxacts_max == 0);
 
-	subxact_filename(path, subid, xid);
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
 
-	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (fd < 0)
-	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
 		return;
-	}
 
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+	subxact_filename(path, subid, xid);
 
-	/* read number of subxact items */
-	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
-	{
-		int			save_errno = errno;
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
 
-		CloseTransientFile(fd);
-		errno = save_errno;
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
 		ereport(ERROR,
 				(errcode_for_file_access(),
-				 errmsg("could not read file \"%s\": %m",
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
 						path)));
-		return;
-	}
-
-	pgstat_report_wait_end();
 
 	len = sizeof(SubXactInfo) * nsubxacts;
 
@@ -2564,35 +2594,23 @@ subxact_info_read(Oid subid, TransactionId xid)
 	nsubxacts_max = 1 << my_log2(nsubxacts);
 
 	/*
-	 * Let the caller decide which memory context it will be allocated.
-	 * Ideally, during stream start it will be allocated in the
-	 * LogicalStreamingContext which will be reset on stream stop, and
-	 * during the stream abort we need this memory only for short term so
-	 * it will be allocated in ApplyMessageContext.
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
 	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
 	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
 
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
-
-	if ((len > 0) && ((read(fd, subxacts, len)) != len))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-		errno = save_errno;
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
 		ereport(ERROR,
 				(errcode_for_file_access(),
-				 errmsg("could not read file \"%s\": %m",
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
 						path)));
-		return;
-	}
-
-	pgstat_report_wait_end();
 
-	if (CloseTransientFile(fd) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not close file \"%s\": %m", path)));
+	BufFileClose(fd);
 }
 
 /*
@@ -2606,7 +2624,7 @@ subxact_info_add(TransactionId xid)
 
 	/* We must have a valid top level stream xid and a stream fd. */
 	Assert(TransactionIdIsValid(stream_xid));
-	Assert(stream_fd >= 0);
+	Assert(stream_fd != NULL);
 
 	/*
 	 * If the XID matches the toplevel transaction, we don't want to add it.
@@ -2658,7 +2676,13 @@ subxact_info_add(TransactionId xid)
 	}
 
 	subxacts[nsubxacts].xid = xid;
-	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
 
 	nsubxacts++;
 }
@@ -2667,44 +2691,14 @@ subxact_info_add(TransactionId xid)
 static void
 subxact_filename(char *path, Oid subid, TransactionId xid)
 {
-	char		tempdirpath[MAXPGPATH];
-
-	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
-
-	/*
-	 * We might need to create the tablespace's tempfile directory, if no
-	 * one has yet done so.
-	 */
-	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create directory \"%s\": %m",
-						tempdirpath)));
-
-	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.subxacts",
-			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
 }
 
 /* format filename for file containing serialized changes */
-static void
+static inline void
 changes_filename(char *path, Oid subid, TransactionId xid)
 {
-	char		tempdirpath[MAXPGPATH];
-
-	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
-
-	/*
-	 * We might need to create the tablespace's tempfile directory, if no
-	 * one has yet done so.
-	 */
-	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create directory \"%s\": %m",
-						tempdirpath)));
-
-	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.changes",
-			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
 }
 
 /*
@@ -2721,60 +2715,29 @@ changes_filename(char *path, Oid subid, TransactionId xid)
 static void
 stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
 {
-	int			i;
 	char		path[MAXPGPATH];
-	bool		found = false;
+	StreamXidHash *ent;
 
-	subxact_filename(path, subid, xid);
-
-	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not remove file \"%s\": %m", path)));
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
 
+	/* Delete the change file and release the stream fileset memory */
 	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
 
-	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not remove file \"%s\": %m", path)));
-
-	/*
-	 * Cleanup the XID from the array - find the XID in the array and
-	 * remove it by shifting all the remaining elements. The array is
-	 * bound to be fairly small (maximum number of in-progress xacts,
-	 * so max_connections + max_prepared_transactions) so simply loop
-	 * through the array and find index of the XID. Then move the rest
-	 * of the array by one element to the left.
-	 *
-	 * Notice we also call this from stream_open_file for first segment
-	 * of each transaction, to deal with possible left-overs after a
-	 * crash, so it's entirely possible not to find the XID in the
-	 * array here. In that case we don't remove anything.
-	 *
-	 * XXX Perhaps it'd be better to handle this automatically after a
-	 * restart, instead of doing it over and over for each transaction.
-	 */
-	for (i = 0; i < nxids; i++)
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
 	{
-		if (xids[i] == xid)
-		{
-			found = true;
-			break;
-		}
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
 	}
-
-	if (!found)
-		return;
-
-	/*
-	 * Move the last entry from the array to the place. We don't keep
-	 * the streamed transactions sorted or anything - we only expect
-	 * a few of them in progress (max_connections + max_prepared_xacts)
-	 * so linear search is just fine.
-	 */
-	xids[i] = xids[nxids-1];
-	nxids--;
 }
 
 /*
@@ -2783,8 +2746,8 @@ stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
  *
  * Open a file for streamed changes from a toplevel transaction identified
  * by stream_xid (global variable). If it's the first chunk of streamed
- * changes for this transaction, perform cleanup by removing existing
- * files after a possible previous crash.
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
  *
  * This can only be called at the beginning of a "streaming" block, i.e.
  * between stream_start/stream_stop messages from the upstream.
@@ -2793,79 +2756,61 @@ static void
 stream_open_file(Oid subid, TransactionId xid, bool first_segment)
 {
 	char		path[MAXPGPATH];
-	int			flags;
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
 
 	Assert(in_streamed_transaction);
 	Assert(OidIsValid(subid));
 	Assert(TransactionIdIsValid(xid));
-	Assert(stream_fd == -1);
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
 
 	/*
-	 * If this is the first segment for this transaction, try removing
-	 * existing files (if there are any, possibly after a crash).
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
 	 */
 	if (first_segment)
 	{
-		MemoryContext	oldcxt;
-
-		/* XXX make sure there are no previous files for this transaction */
-		stream_cleanup_files(subid, xid, true);
-
-		/* Need to allocate this in permanent context */
-		oldcxt = MemoryContextSwitchTo(ApplyContext);
-
 		/*
-		 * We need to remember the XIDs we spilled to files, so that we can
-		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
-		 *
-		 * The number of XIDs we may need to track is fairly small, because
-		 * we can only stream toplevel xacts (so limited by max_connections
-		 * and max_prepared_transactions), and we only stream the large ones.
-		 * So we simply keep the XIDs in an unsorted array. If the number of
-		 * xacts gets large for some reason (e.g. very high max_connections),
-		 * a more elaborate approach might be better - e.g. sorted array, to
-		 * speed-up the lookups.
+		 * Shared fileset handle must be allocated in the persistent context.
 		 */
-		if (nxids == maxnxids)	/* array of XIDs is full */
-		{
-			if (!xids)
-			{
-				maxnxids = 64;
-				xids = palloc(maxnxids * sizeof(TransactionId));
-			}
-			else
-			{
-				maxnxids = 2 * maxnxids;
-				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
-			}
-		}
+		SharedFileSet *fileset =
+		MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
 
-		xids[nxids++] = xid;
+		SharedFileSetInit(fileset, NULL);
+		stream_fd = BufFileCreateShared(fileset, path);
 
-		MemoryContextSwitchTo(oldcxt);
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
 	}
-
-	changes_filename(path, subid, xid);
-
-	elog(DEBUG1, "opening file '%s' for streamed changes", path);
-
-	/*
-	 * If this is the first streamed segment, the file must not exist, so
-	 * make sure we're the ones creating it. Otherwise just open the file
-	 * for writing, in append mode.
-	 */
-	if (first_segment)
-		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
 	else
-		flags = (O_WRONLY | O_APPEND | PG_BINARY);
-
-	stream_fd = OpenTransientFile(path, flags);
-
-	if (stream_fd < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+	MemoryContextSwitchTo(oldcxt);
 }
 
 /*
@@ -2880,12 +2825,12 @@ stream_close_file(void)
 {
 	Assert(in_streamed_transaction);
 	Assert(TransactionIdIsValid(stream_xid));
-	Assert(stream_fd != -1);
+	Assert(stream_fd != NULL);
 
-	CloseTransientFile(stream_fd);
+	BufFileClose(stream_fd);
 
 	stream_xid = InvalidTransactionId;
-	stream_fd = -1;
+	stream_fd = NULL;
 }
 
 /*
@@ -2907,34 +2852,21 @@ stream_write_change(char action, StringInfo s)
 
 	Assert(in_streamed_transaction);
 	Assert(TransactionIdIsValid(stream_xid));
-	Assert(stream_fd != -1);
+	Assert(stream_fd != NULL);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
-
 	/* first write the size */
-	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not serialize streamed change to file: %m")));
+	BufFileWrite(stream_fd, &len, sizeof(len));
 
 	/* then the action */
-	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not serialize streamed change to file: %m")));
+	BufFileWrite(stream_fd, &action, sizeof(action));
 
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
-	if (write(stream_fd, &s->data[s->cursor], len) != len)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not serialize streamed change to file: %m")));
-
-	pgstat_report_wait_end();
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
 }
 
 /*
-- 
2.23.0

v30/v30-0014-POC-On_procexit_cleanup.patch000644 000765 000024 00000016201 13676127246 021440 0ustar00dilipkumarstaff000000 000000 From a56acb4277cb4979464abdc25f00b5cfb6dcd66c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Fri, 26 Jun 2020 11:30:13 +0530
Subject: [PATCH v30 14/14] POC On_procexit_cleanup

---
 src/backend/replication/logical/worker.c | 70 ++++++------------------
 src/backend/storage/file/buffile.c       |  3 +
 src/backend/storage/file/sharedfileset.c | 62 +++++++++++++++++++++
 src/include/storage/sharedfileset.h      |  1 +
 4 files changed, 84 insertions(+), 52 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a543ee973b..a6d52a8793 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1948,49 +1948,6 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 	}
 }
 
-/*
- * Cleanup function.
- *
- * Called on logical replication worker exit.
- */
-static void
-worker_onexit(int code, Datum arg)
-{
-	HASH_SEQ_STATUS status;
-	StreamXidHash *ent;
-	char		path[MAXPGPATH];
-
-	/* nothing to clean */
-	if (xidhash == NULL)
-		return;
-
-	/*
-	 * Scan complete hash and delete the underlying files for the xids.
-	 * Also release the memory for the shared file sets.
-	 */
-	hash_seq_init(&status, xidhash);
-	while ((ent = (StreamXidHash *) hash_seq_search(&status)) != NULL)
-	{
-		changes_filename(path, MyLogicalRepWorker->subid, ent->xid);
-		BufFileDeleteShared(ent->stream_fileset, path);
-		pfree(ent->stream_fileset);
-
-		/*
-		 * We might not have created the subxact fileset if there is no sub
-		 * transaction.
-		 */
-		if (ent->subxact_fileset)
-		{
-			subxact_filename(path, MyLogicalRepWorker->subid, ent->xid);
-			BufFileDeleteShared(ent->subxact_fileset, path);
-			pfree(ent->subxact_fileset);
-		}
-	}
-
-	/* Remove the xid hash */
-	hash_destroy(xidhash);
-}
-
 /*
  * Apply main loop.
  */
@@ -2015,9 +1972,6 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 													"LogicalStreamingContext",
 													ALLOCSET_DEFAULT_SIZES);
 
-	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
-	before_shmem_exit(worker_onexit, (Datum) 0);
-
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -2518,10 +2472,17 @@ subxact_info_write(Oid subid, TransactionId xid)
 	 */
 	if (ent->subxact_fileset == NULL)
 	{
-		ent->subxact_fileset =
-			MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
+		MemoryContext oldctx;
 
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, we allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
 		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
 		fd = BufFileCreateShared(ent->subxact_fileset, path);
 	}
 	else
@@ -2787,13 +2748,18 @@ stream_open_file(Oid subid, TransactionId xid, bool first_segment)
 	 */
 	if (first_segment)
 	{
+		MemoryContext oldctx;
+		SharedFileSet *fileset;
+
 		/*
-		 * Shared fileset handle must be allocated in the persistent context.
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, we allocate it in a persistent context.
 		 */
-		SharedFileSet *fileset =
-		MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
-
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
 		SharedFileSetInit(fileset, NULL);
+		oldctx = MemoryContextSwitchTo(oldctx);
+
 		stream_fd = BufFileCreateShared(fileset, path);
 
 		/* Remember the fileset for the next stream of the same transaction */
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index bde6fa1ef3..502875a09c 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -364,6 +364,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 		CHECK_FOR_INTERRUPTS();
 	}
 
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
+
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
 }
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 0907f796e3..c9ccb84c0a 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -25,10 +25,14 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
@@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	/* Register our cleanup callback. */
 	if (seg)
 		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		if (filesetlist == NIL)
+			on_proc_exit(SharedFileSetOnProcExit, 0);
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -213,6 +224,57 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 		SharedFileSetDeleteAll(fileset);
 }
 
+/*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list if all the sharedfileset registered and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry, registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	/* Loop over all the shared fileset entries to find the input fileset */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+	Assert(found);
+}
+
 /*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index b2f4ba4bd8..d5edb600af 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -42,5 +42,6 @@ extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
2.23.0

v30/v30-0006-Bugfix-handling-of-incomplete-toast-spec-insert.patch000644 000765 000024 00000071230 13676127246 026154 0ustar00dilipkumarstaff000000 000000 From 61cf89a5151678b58b51b28a76a1eab15a0bf45b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Wed, 17 Jun 2020 18:22:35 +0530
Subject: [PATCH v30 06/14] Bugfix handling of incomplete toast/spec insert

---
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/replication/logical/decode.c      |  17 +-
 .../replication/logical/reorderbuffer.c       | 449 +++++++++++++-----
 src/include/access/heapam_xlog.h              |   1 +
 src/include/replication/reorderbuffer.h       |  50 +-
 5 files changed, 398 insertions(+), 122 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 287a185d9c..95dec05047 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1955,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7153ebaa96..2010d5a786 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3349d26447..0a591273fc 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -179,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -237,7 +252,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool partial_truncate);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -436,62 +452,71 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 /*
  * Free an ReorderBufferChange.
  */
-void
-ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+static void
+ReorderBufferFreeChange(ReorderBuffer *rb, ReorderBufferChange *change)
 {
-	/* update memory accounting info */
-	ReorderBufferChangeMemoryUpdate(rb, change, false);
-
 	/* free contained data */
 	switch (change->action)
 	{
-		case REORDER_BUFFER_CHANGE_INSERT:
-		case REORDER_BUFFER_CHANGE_UPDATE:
-		case REORDER_BUFFER_CHANGE_DELETE:
-		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
-			if (change->data.tp.newtuple)
-			{
-				ReorderBufferReturnTupleBuf(rb, change->data.tp.newtuple);
-				change->data.tp.newtuple = NULL;
-			}
+	case REORDER_BUFFER_CHANGE_INSERT:
+	case REORDER_BUFFER_CHANGE_UPDATE:
+	case REORDER_BUFFER_CHANGE_DELETE:
+	case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
+		if (change->data.tp.newtuple)
+		{
+			ReorderBufferReturnTupleBuf(rb, change->data.tp.newtuple);
+			change->data.tp.newtuple = NULL;
+		}
 
-			if (change->data.tp.oldtuple)
-			{
-				ReorderBufferReturnTupleBuf(rb, change->data.tp.oldtuple);
-				change->data.tp.oldtuple = NULL;
-			}
-			break;
-		case REORDER_BUFFER_CHANGE_MESSAGE:
-			if (change->data.msg.prefix != NULL)
-				pfree(change->data.msg.prefix);
-			change->data.msg.prefix = NULL;
-			if (change->data.msg.message != NULL)
-				pfree(change->data.msg.message);
-			change->data.msg.message = NULL;
-			break;
-		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
-			if (change->data.snapshot)
-			{
-				ReorderBufferFreeSnap(rb, change->data.snapshot);
-				change->data.snapshot = NULL;
-			}
-			break;
-			/* no data in addition to the struct itself */
-		case REORDER_BUFFER_CHANGE_TRUNCATE:
-			if (change->data.truncate.relids != NULL)
-			{
-				ReorderBufferReturnRelids(rb, change->data.truncate.relids);
-				change->data.truncate.relids = NULL;
-			}
-			break;
-		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
-		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
-		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
-			break;
+		if (change->data.tp.oldtuple)
+		{
+			ReorderBufferReturnTupleBuf(rb, change->data.tp.oldtuple);
+			change->data.tp.oldtuple = NULL;
+		}
+		break;
+	case REORDER_BUFFER_CHANGE_MESSAGE:
+		if (change->data.msg.prefix != NULL)
+			pfree(change->data.msg.prefix);
+		change->data.msg.prefix = NULL;
+		if (change->data.msg.message != NULL)
+			pfree(change->data.msg.message);
+		change->data.msg.message = NULL;
+		break;
+	case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
+		if (change->data.snapshot)
+		{
+			ReorderBufferFreeSnap(rb, change->data.snapshot);
+			change->data.snapshot = NULL;
+		}
+		break;
+		/* no data in addition to the struct itself */
+	case REORDER_BUFFER_CHANGE_TRUNCATE:
+		if (change->data.truncate.relids != NULL)
+		{
+			ReorderBufferReturnRelids(rb, change->data.truncate.relids);
+			change->data.truncate.relids = NULL;
+		}
+		break;
+	case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+	case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
+	case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		break;
 	}
 
 	pfree(change);
 }
+/*
+ * Free an ReorderBufferChange and update memory accounting.
+ */
+void
+ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+{
+	/* update memory accounting info */
+	ReorderBufferChangeMemoryUpdate(rb, change, false);
+
+	/* free contained data */
+	ReorderBufferFreeChange(rb, change);
+}
 
 /*
  * Get a fresh ReorderBufferTupleBuf fitting at least a tuple of size
@@ -641,17 +666,105 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 	return txn;
 }
 
+/*
+ * Handle incomplete tuple during streaming.  If streaming is enabled then we
+ * might need to stream the in-progress transaction.  So the problem is that
+ * sometime we might get some incomplete changes which we can not stream
+ * until we get the complete change. e.g.  toast table insert without the main
+ * table insert.  So this function remember the lsn of the last complete change
+ * and the complete size upto last complete lsn so that if we need to stream
+ * we can only stream upto last complete lsn.
+ */
+static void
+ReorderBufferHandleIncompleteTuple(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   ReorderBufferChange *change,
+								   bool toast_insert, Size total_size)
+{
+	ReorderBufferTXN *toptxn;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * If this is a first incomplete change then set the size of the complete
+	 * change.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		(toast_insert || IsSpecInsert(change->action)))
+		toptxn->complete_size = total_size;
+
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Basically,
+	 * both update and insert will do the insert in the toast table.  And as
+	 * explained in the function header we can not stream the only toast
+	 * changes.  So whenever we get the toast insert we set the flag and clear
+	 * the same whenever we get the next insert or update on the main table.
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial tuple and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+
+	/*
+	 * If we don't have any incomplete change after this change then set this
+	 * LSN as last complete lsn.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(toptxn)))
+	{
+		toptxn->last_complete_lsn = change->lsn;
+
+		/*
+		 * If the transaction is serialized and the the changes are complete in
+		 * the top level transaction then immediately stream the transaction.
+		 * The reason for not waiting for memory limit to get full is that in
+		 * the streaming mode, if the transaction serialized that means we have
+		 * already reached the memory limit but that time we could not stream
+		 * this due to incomplete tuple so now stream it as soon as the tuple
+		 * is complete.  Also, if we don't stream the serialized changes then
+		 * if we get some more incomplete changes in this transaction then we
+		 * don't have a way to partly truncate the serialized changes.
+		 */
+		if (rbtxn_is_serialized(txn))
+			ReorderBufferStreamTXN(rb, toptxn);
+	}
+}
+
 /*
  * Queue a change into a transaction so it can be replayed upon commit.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
+	Size	total_size = 0;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the last changes we have detected that the transaction
+	 * is aborted.  So there is no point in collecting further changes for the
+	 * transaction.
+	 */
+	if (txn->concurrent_abort)
+	{
+		ReorderBufferFreeChange(rb, change);
+		return;
+	}
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -660,9 +773,28 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * Get the total size of the top transaction before updating the size for
+	 * current change so that if this is the incomplete tuple we know the size
+	 * prior to this change.  That will be used for updating the size of the
+	 * complete changes in the top transaction for streaming.
+	 */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			total_size = txn->toptxn->total_size;
+		else
+			total_size = txn->total_size;
+	}
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* Handle the incomplete tuple, If streaming is enabled */
+	if (ReorderBufferCanStream(rb))
+		ReorderBufferHandleIncompleteTuple(rb, txn, change, toast_insert,
+										   total_size);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -692,7 +824,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1402,11 +1534,45 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 /*
  * Discard changes from a transaction (and subtransactions), after streaming
  * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ * If partial_truncate is false we completely truncate the transaction,
+ * otherwise we truncate upto last_complete_lsn
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 bool partial_truncate)
 {
 	dlist_mutable_iter iter;
+	ReorderBufferTXN *toptxn;
+
+	/* Get the top transaction */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * The serialized transaction should never be partly truncated, because if
+	 * it is serialized then we stream it as soon as its changes get completed.
+	 */
+	Assert(!(rbtxn_is_serialized(txn) && partial_truncate));
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
 
 	/* cleanup subtransactions & their changes */
 	dlist_foreach_modify(iter, &txn->subtxns)
@@ -1423,7 +1589,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, partial_truncate);
 	}
 
 	/* cleanup changes in the toplevel txn */
@@ -1433,30 +1599,19 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		change = dlist_container(ReorderBufferChange, node, iter.cur);
 
+		/* We have truncated upto last complete lsn so stop. */
+		if (partial_truncate && (change->lsn > toptxn->last_complete_lsn))
+		{
+			/* The transaction must have incomplete changes. */
+			Assert(rbtxn_has_incomplete_tuple(toptxn));
+			break;
+		}
+
 		/* remove the change from it's containing list */
 		dlist_delete(&change->node);
-
 		ReorderBufferReturnChange(rb, change);
 	}
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The toplevel transaction, identified by (toptxn==NULL), is marked
-	 * as streamed always, even if it does not contain any changes (that
-	 * is, when all the changes are in subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts
-	 * for XIDs the downstream is not aware of. And of course, it always
-	 * knows about the toplevel xact (we send the XID in all messages),
-	 * but we never stream XIDs of empty subxacts.
-	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
 	 * any memory. We could also keep the hash table and update it with
@@ -1468,9 +1623,39 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
-	/* also reset the number of entries in the transaction */
-	txn->nentries_mem = 0;
-	txn->nentries = 0;
+	/*
+	 * Adjust nentries/nentries_mem based on the changes processed.  See
+	 * comments where nprocessed is declared.
+	 */
+	if (partial_truncate)
+	{
+		txn->nentries -= txn->nprocessed;
+		txn->nentries_mem -= txn->nprocessed;
+	}
+	else
+	{
+		txn->nentries = 0;
+		txn->nentries_mem = 0;
+	}
+	txn->nprocessed = 0;
+
+	/*
+	 * If this is a top transaction then we can reset the
+	 * last_complete_lsn and complete_size, because by now we would
+	 * have stream all the changes upto last_complete_lsn.
+	 */
+	if (partial_truncate && (txn->toptxn == NULL))
+	{
+		toptxn->last_complete_lsn = InvalidXLogRecPtr;
+		toptxn->complete_size = 0;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
 }
 
 /*
@@ -1757,7 +1942,7 @@ ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
 								   ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed. */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Stop the stream. */
 	rb->stream_stop(rb, txn, last_lsn);
@@ -1789,6 +1974,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
 	ReorderBufferChange *volatile specinsert = NULL;
 	volatile bool	stream_started = false;
+	volatile bool	partial_truncate = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1847,7 +2034,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			/* Set the xid for concurrent abort check. */
 			if (streaming)
-				SetupCheckXidLive(change->txn->xid);
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
 
 			switch (change->action)
 			{
@@ -2104,6 +2294,27 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
 			}
+
+			if (streaming)
+			{
+				/*
+				 * Increment the nprocessed count.  See the detailed comment
+				 * for usage of this in ReorderBufferTXN structure.
+				 */
+				curtxn->nprocessed++;
+
+				/*
+				 * If the transaction contains incomplete tuple and this is the
+				 * last complete change then stop further processing of the
+				 * transaction.  And, set the partial truncate flag to true.
+				 */
+				if (rbtxn_has_incomplete_tuple(txn) &&
+					prev_lsn == txn->last_complete_lsn)
+				{
+					partial_truncate = true;
+					break;
+				}
+			}
 		}
 
 		/*
@@ -2123,7 +2334,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * Done with current changes, call stream_stop callback for streaming
-		 * transaction, commit callback otherwise.  If we have sent
+		 * transaction, commit callback otherwise.  Only If we have sent
 		 * start/begin.
 		 */
 		if (stream_started)
@@ -2174,7 +2385,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming)
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, partial_truncate);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
@@ -2234,6 +2445,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
+			curtxn->concurrent_abort = true;
 
 			/* Handle the concurrent abort. */
 			ReorderBufferHandleConcurrentAbort(rb, txn, snapshot_now,
@@ -2521,7 +2733,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2570,7 +2782,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2593,6 +2805,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2607,8 +2820,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2616,12 +2834,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2863,18 +3089,28 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 	dlist_foreach(iter, &rb->toplevel_by_lsn)
 	{
 		ReorderBufferTXN *txn;
+		Size	size = 0;
+		Size	largest_size = 0;
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
+		/*
+		 * If this transaction have some incomplete changes then only consider
+		 * the size upto last complete lsn.
+		 */
+		if (rbtxn_has_incomplete_tuple(txn))
+			size = txn->complete_size;
+		else
+			size = txn->total_size;
+
+		/* If the current transaction is larger then remember it. */
+		if ((largest != NULL || size > largest_size) && size > 0)
+		{
 			largest = txn;
+			largest_size = size;
+		}
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2912,27 +3148,22 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 		 * Pick the largest transaction (or subtransaction) and evict it from
 		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		if (ReorderBufferCanStream(rb))
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
 		{
-			/*
-			* Pick the largest toplevel transaction and evict it from memory by
-			* streaming the already decoded part.
-			*/
-			txn = ReorderBufferLargestTopTXN(rb);
-
 			/* we know there has to be one, because the size is not zero */
 			Assert(txn && !txn->toptxn);
-			Assert(txn->size > 0);
-			Assert(rb->size >= txn->size);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
 			ReorderBufferStreamTXN(rb, txn);
 		}
 		else
 		{
 			/*
-			* Pick the largest transaction (or subtransaction) and evict it from
-			* memory by serializing it to disk.
-			*/
+			 * Pick the largest transaction (or subtransaction) and evict it from
+			 * memory by serializing it to disk.
+			 */
 			txn = ReorderBufferLargestTXN(rb);
 
 			/* we know there has to be one, because the size is not zero */
@@ -2941,14 +3172,14 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(rb->size >= txn->size);
 
 			ReorderBufferSerializeTXN(rb, txn);
-		}
 
-		/*
-		 * After eviction, the transaction should have no entries in memory,
-		 * and should use 0 bytes for changes.
-		 */
-		Assert(txn->size == 0);
-		Assert(txn->nentries_mem == 0);
+			/*
+			 * After eviction, the transaction should have no entries in memory, and
+			 * should use 0 bytes for changes.
+			 */
+			Assert(txn->size == 0);
+			Assert(txn->nentries_mem == 0);
+		}
 	}
 
 	/* We must be under the memory limit now. */
@@ -3330,10 +3561,6 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 */
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
-
-	Assert(dlist_is_empty(&txn->changes));
-	Assert(txn->nentries == 0);
-	Assert(txn->nentries_mem == 0);
 }
 
 /*
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cdb12..aa17f7df84 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 07df0cb0f6..e0116cd2d5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -163,6 +163,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -182,6 +184,26 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -190,10 +212,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -339,6 +357,26 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* Size of the complete changes. */
+	Size		complete_size;
+
+	/* LSN of the last complete change. */
+	XLogRecPtr	last_complete_lsn;
+
+	/*
+	 * Number of changes processed.  This is used to keep track of changes that
+	 * remained to be streamed.  As of now, this can happen either due to toast
+	 * tuples or speculative insertions where we need to wait for multiple
+	 * changes before we can send them.
+	 */
+	uint64		nprocessed;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -526,7 +564,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
2.23.0

v30/v30-0001-Immediately-WAL-log-subtransaction-and-top-level.patch000644 000765 000024 00000026154 13676127246 026172 0ustar00dilipkumarstaff000000 000000 From 582810d3f9b0f5ad345e8ca8faff02c2295292db Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 5 Jun 2020 09:03:16 +0530
Subject: [PATCH v30 01/14] Immediately WAL-log subtransaction and top-level
 XID association.

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead) only when wal_level=logical.
We can not remove the existing XLOG_XACT_ASSIGNMENT WAL as that is
required for avoiding overflow in the hot standby snapshot.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/transam/xact.c        | 50 ++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 23 ++++++++++-
 src/backend/access/transam/xlogreader.c  |  5 +++
 src/backend/replication/logical/decode.c | 44 +++++++++++----------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 107 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 905dc7d8d3..a93fb8a4f0 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to top-level XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5120,6 +5122,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6022,3 +6025,50 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * IsSubTransactionAssignmentPending
+ *
+ * This is used to decide whether we need to WAL log the top-level XID for
+ * operation in a subtransaction.  We require that for logical decoding, see
+ * LogicalDecodingProcessRecord.
+ *
+ * This returns true if wal_level >= logical and we are inside a valid
+ * subtransaction, for which the assignment was not yet written to any WAL
+ * record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	/* wal_level has to be logical */
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it should not be already 'assigned' */
+	return !CurrentTransactionState->assigned;
+}
+
+/*
+ * MarkSubTransactionAssigned
+ *
+ * Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f09e..c526bb1928 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,19 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogResetInsertion) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cb76be4f46..a757baccfc 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1197,6 +1197,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1235,6 +1236,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3abf8..0c0c371739 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -94,11 +94,27 @@ void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
 	XLogRecordBuffer buf;
+	TransactionId txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the top-level xid is valid, we need to assign the subxact to the
+	 * top-level xact. We need to do this for all records, hence we do it
+	 * before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -216,13 +232,8 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	/*
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
-	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
-	 * when the snapshot is being built: it is possible to get later records
-	 * that require subxids to be properly assigned.
 	 */
-	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
 		return;
 
 	switch (info)
@@ -264,22 +275,13 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
 
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			/*
+			 * We assign subxact to the toplevel xact while processing each
+			 * record if required.  So, we don't need to do anything here.
+			 * See LogicalDecodingProcessRecord.
+			 */
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index db191879b9..aef8555367 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 347a38f57c..a5468c1037 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of top-level xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index b0f2a6ed43..b976882229 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -191,6 +191,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -304,6 +306,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0194..2f0c8bf589 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
2.23.0

v30/v30-0008-Enable-streaming-for-all-subscription-TAP-tests.patch000644 000765 000024 00000035517 13676127246 026052 0ustar00dilipkumarstaff000000 000000 From 009a7b1b44106dd2846a3c1eb6733847d896a8be Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v30 08/14] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318fc7c..6f7bedc130 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2fbc..94c71f8ae2 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f871a..4ba80869b9 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9181..a6fae9c3f1 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17a78..202871a658 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10a19..70c86b22ac 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6d63..f9c8d1d348 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334ed89..cdf9b8e7bb 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bdba9..21f50c7012 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133f69..30561d8f96 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae38592b..9a6bac6822 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..ed56fbf96c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cba4c..4df1ddef63 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1a8a..c3caff6149 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7c2f..c62eb521e7 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30f59..2be7542831 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0578..2da9607a7d 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9435..96ffc091b0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
2.23.0

v30/v30-0003-Extend-the-output-plugin-API-with-stream-methods.patch000644 000765 000024 00000106360 13676127246 026176 0ustar00dilipkumarstaff000000 000000 From be2c57b588be7cf85bf924d95a4a4c5973187d4c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 17:26:31 +0200
Subject: [PATCH v30 03/14] Extend the output plugin API with stream methods

This adds four methods the output plugin API, adding support
for streaming changes for large transactions.

* stream_message
* stream_change
* stream_truncate
* stream_abort
* stream_commit
* stream_start
* stream_stop

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate the a chunk
of changes streamed for a particular toplevel transaction.
---
 contrib/test_decoding/test_decoding.c     | 100 ++++++
 doc/src/sgml/logicaldecoding.sgml         | 213 +++++++++++++
 src/backend/replication/logical/logical.c | 365 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++
 src/include/replication/reorderbuffer.h   |  59 ++++
 6 files changed, 811 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c948856e..64f651fa72 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr apply_lsn);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
 }
 
 
@@ -540,3 +569,74 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+					 transactional, prefix, sz);
+	appendBinaryStringInfo(ctx->out, message, sz);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr apply_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index c89f93cf6b..50cfd6fa47 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,6 +389,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -401,6 +408,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -679,6 +695,112 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -747,4 +869,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and one optional callback
+    (<function>stream_message_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be3b0..26d461effb 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn);
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require change/commit/abort callbacks. The
+	 * message callback is optional, similar to regular output plugins. We
+	 * however enable streaming when at least one of the methods is enabled,
+	 * so that we can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional,
+	 * so we do not fail with ERROR when missing, but the wrappers
+	 * simply do nothing. We must set the ReorderBuffer callbacks
+	 * to something, otherwise the calls from there will crash (we
+	 * don't want to move the checks there).
+	 */
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,321 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_change_cb callback.")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_abort_cb callback.")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr apply_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = apply_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = apply_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_commit_cb callback.")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, apply_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_start_cb callback.")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errmsg("Output plugin supports streaming, but has not registered "
+						"stream_stop_cb callback.")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475e5d..deef31825d 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -79,6 +79,11 @@ typedef struct LogicalDecodingContext
 	 */
 	void	   *output_writer_private;
 
+	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236c57..0d0a94a648 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 int nrelations,
+										 Relation relations[],
+										 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
 /*
  * Output plugin callbacks
  */
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 74ffe7852f..f80e05edb2 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -348,6 +348,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  int nrelations,
+											  Relation relations[],
+											  ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
 struct ReorderBuffer
 {
 	/*
@@ -386,6 +434,17 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamTruncateCB stream_truncate;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+
 	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
-- 
2.23.0

v30/v30-0011-Add-streaming-option-in-pg_dump.patch000644 000765 000024 00000005337 13676127246 023050 0ustar00dilipkumarstaff000000 000000 From f4245560c1f039df119a0e21d9ccd93dc534e98a Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v30 11/14] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index a41a3db876..d0fb24e5f8 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4235,8 +4236,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4249,6 +4250,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4265,6 +4267,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4342,6 +4345,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0c2fcfb3a9..af64270c55 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
2.23.0

v30/v30-0012-Change-buffile-interface-required-for-streaming-.patch000644 000765 000024 00000031560 13676127246 026227 0ustar00dilipkumarstaff000000 000000 From 1a1b32465d806951b8fde55ee769703b20824e1f Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 11 Jun 2020 16:40:25 +0530
Subject: [PATCH v30 12/14] Change buffile interface required for streaming
 transaction

Implement the BuffileTruncate and SEEK_END.  And, also add an
option to provide a mode while opening the shared buffiles, instead
of always opening in readonly mode
---
 src/backend/postmaster/pgstat.c           |  3 +
 src/backend/storage/file/buffile.c        | 81 ++++++++++++++++++++---
 src/backend/storage/file/fd.c             |  9 ++-
 src/backend/storage/file/sharedfileset.c  | 21 ++++--
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  3 +-
 10 files changed, 103 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a55ccc0c03..a9fbe41f8e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 3907349b69..bde6fa1ef3 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,12 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.  Buffile
+ * infrasturcture can be used in the single backend as well if the files need
+ * to be survived across the transaction as well as files needs to be opened
+ * and closed multiple times.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +279,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +303,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +323,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -666,11 +668,22 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
+			break;
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +851,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 7dc6dd2f15..060811ca78 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1741,18 +1741,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index f7206c9175..0907f796e3 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -34,16 +34,22 @@ static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name)
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
  *
  * Under the covers the set is one or more directories which will eventually
  * be deleted when there are no backends attached.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * one backend but the files needs to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases dsm segment should be passed NULL so that the files will be
+ * deleted on the proc exit.
  */
 void
 SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
@@ -68,7 +74,8 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
 }
 
 /*
@@ -131,13 +138,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59c50..788815cdab 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a4303b..b83fb50dac 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6352ff945a..0dfbac46b4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752bab0d..fc34c49522 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d7df..e209f047e8 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf077e5..b2f4ba4bd8 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,7 +37,8 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
-- 
2.23.0

v30/v30-0010-Provide-new-api-to-get-the-streaming-changes.patch000644 000765 000024 00000014524 13676127246 025341 0ustar00dilipkumarstaff000000 000000 From 31161b0c1f109910ebf2a14f25b2679b5c7167f2 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Sat, 2 May 2020 11:41:59 +0530
Subject: [PATCH v30 10/14] Provide new api to get the streaming changes

---
 .gitignore                                    |  1 +
 doc/src/sgml/test-decoding.sgml               | 22 ++++++++++++++++++
 src/backend/catalog/system_views.sql          |  8 +++++++
 .../replication/logical/logicalfuncs.c        | 23 +++++++++++++++----
 src/include/catalog/pg_proc.dat               |  9 ++++++++
 5 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/.gitignore b/.gitignore
index 794e35b73c..6083744c07 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d67b..eed6e9d134 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_streaming_changes('test_slot', NULL, NULL);
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5314e9348f..98d3ad0458 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1240,6 +1240,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
 
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_peek_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index b99c94e848..70c28ffa91 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -108,7 +108,8 @@ check_permissions(void)
  * Helper function for the various SQL callable logical decoding functions.
  */
 static Datum
-pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool binary)
+pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
+								 bool binary, bool streaming)
 {
 	Name		name;
 	XLogRecPtr	upto_lsn;
@@ -252,6 +253,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or has been invalidated.")));
 
+		/* If called has not asked for streaming changes then disable it. */
+		ctx->streaming &= streaming;
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
@@ -362,7 +366,16 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 Datum
 pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, false);
+}
+
+/*
+ * SQL function to get the streaming changes as text, consuming the data.
+ */
+Datum
+pg_logical_slot_get_streaming_changes(PG_FUNCTION_ARGS)
+{
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, true);
 }
 
 /*
@@ -371,7 +384,7 @@ pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, false, false);
 }
 
 /*
@@ -380,7 +393,7 @@ pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, true, false);
 }
 
 /*
@@ -389,7 +402,7 @@ pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, true, false);
 }
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 61f2c2f5b4..1b77a0f83d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10115,6 +10115,15 @@
   proargmodes => '{i,i,i,v,o,o,o}',
   proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
   prosrc => 'pg_logical_slot_get_binary_changes' },
+{ oid => '6150', descr => 'get streaming changes from replication slot',
+  proname => 'pg_logical_slot_get_streaming_changes', procost => '1000',
+  prorows => '1000', provariadic => 'text', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'name pg_lsn int4 _text',
+  proallargtypes => '{name,pg_lsn,int4,_text,pg_lsn,xid,text}',
+  proargmodes => '{i,i,i,v,o,o,o}',
+  proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
+  prosrc => 'pg_logical_slot_get_streaming_changes' },
 { oid => '3784', descr => 'peek at changes from replication slot',
   proname => 'pg_logical_slot_peek_changes', procost => '1000',
   prorows => '1000', provariadic => 'text', proisstrict => 'f',
-- 
2.23.0

v30/v30-0005-Implement-streaming-mode-in-ReorderBuffer.patch000644 000765 000024 00000112051 13676127246 025022 0ustar00dilipkumarstaff000000 000000 From a6b6837b9c40d562ad60b1a41e1a599439c005c3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Wed, 17 Jun 2020 18:20:30 +0530
Subject: [PATCH v30 05/14] Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes
we have in memory and invoke new stream API methods. This happens
in ReorderBufferStreamTXN() using about the same logic as in
ReorderBufferCommit() logic.  However, sometime if we have incomplete
toast or speculative insert we spill to the disk because we can not
generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

It also adds ReorderBufferTXN pointer to two places:

* ReorderBufferChange, so that we know which xact it belongs to
* ReorderBufferTXN, pointing to toplevel xact (from subxact)

The output plugin can use this to decide which changes to discard
in case of stream_abort_cb (e.g. when a subxact gets discarded).
---
 src/backend/access/heap/heapam_visibility.c   |  38 +-
 .../replication/logical/reorderbuffer.c       | 763 ++++++++++++++++--
 src/include/replication/reorderbuffer.h       |  26 +
 3 files changed, 751 insertions(+), 76 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aa..160b167adb 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the
+		 * tuple yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1657,23 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means
+		 * we have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions.
+		 * In regular logical decoding we only execute this code at commit
+		 * time, at which point we should have seen all relevant combocids.
+		 * So we should error out in this case.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4b277fe6f9..3349d26447 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -236,6 +237,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +246,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +382,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -767,6 +781,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+/*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the toplevel transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -1022,6 +1068,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1036,6 +1085,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1313,6 +1365,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		dlist_delete(&txn->base_snapshot_node);
 	}
 
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
 	/*
 	 * Remove TXN from its containing list.
 	 *
@@ -1338,6 +1399,80 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferReturnTXN(rb, txn);
 }
 
+/*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids and snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+	 * any memory. We could also keep the hash table and update it with
+	 * new ctid values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
 /*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1489,57 +1624,171 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+	 * aborted. That will happen during catalog access.  Also, reset the
+	 * bsysscan flag.
+	 */
+	if (!TransactionIdDidCommit(xid))
+	{
+		CheckXidAlive = xid;
+		bsysscan = false;
 	}
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
-	snapshot_now = txn->base_snapshot;
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction.
+ */
+static void
+ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   Snapshot snapshot_now,
+								   CommandId command_id,
+								   XLogRecPtr last_lsn,
+								   ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+
+	ReorderBufferToastReset(rb, txn);
+	if (specinsert != NULL)
+		ReorderBufferReturnChange(rb, specinsert);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr	prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool	stream_started = false;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1562,21 +1811,44 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
-
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
 		{
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * Start stream or begin transaction for the first change in the
+			 * current stream.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+					rb->stream_start(rb, txn, change->lsn);
+				else
+					rb->begin(rb, txn);
+				stream_started = true;
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1653,7 +1925,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1693,7 +1966,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1751,7 +2024,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1760,10 +2036,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1794,7 +2067,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1849,14 +2121,34 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, call stream_stop callback for streaming
+		 * transaction, commit callback otherwise.  If we have sent
+		 * start/begin.
+		 */
+		if (stream_started)
+		{
+			if (streaming)
+				rb->stream_stop(rb, txn, prev_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+			stream_started = false;
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot if transaction is streaming
+		 * otherwise free the snapshot if we have copied it.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1874,14 +2166,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as streamed
+		 * (if they contained changes). Otherwise, remove all the changes and
+		 * deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1900,17 +2205,122 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/* Reset the CheckXidAlive */
+		if (streaming)
+			CheckXidAlive = InvalidTransactionId;
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * If the error code is ERRCODE_TRANSACTION_ROLLBACK,  that means we
+		 * have detected a concurrent abort of the (sub)transaction we are
+		 * streaming.  So just do the cleanup and return gracefully.
+		 * Otherwise, Re-throw the error.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * Only in the streaming mode we can get this error, because only
+			 * in the streaming mode we send in-progress transaction.
+			 */
+			Assert(streaming);
 
-		PG_RE_THROW();
+			/*
+			 * In the TRY block we only stop the stream after we have send
+			 * all the changes.  So we have detected the concurrent abort then
+			 * the stream should have not been stopped yet.
+			 */
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+
+			/* Handle the concurrent abort. */
+			ReorderBufferHandleConcurrentAbort(rb, txn, snapshot_now,
+											   command_id, prev_lsn,
+											   specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
 	}
 	PG_END_TRY();
 }
 
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.  We iterate over the top and
+ * subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot snapshot_now;
+	CommandId command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
+	}
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
@@ -1935,6 +2345,24 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_abort(rb, txn, lsn);
+
+		/*
+		 * If this is a streaming transaction then we might have decoded some
+		 * changes for the transaction.  So execution all the invalidations
+		 * messages to clear any cache pollution.
+		 */
+		if (txn->ninvalidations > 0)
+			ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+											   txn->invalidations);
+	}
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2004,6 +2432,13 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/*
+	 * If the (sub)transaction was streamed, notify the remote node about the
+	 * abort.
+	 */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2139,8 +2574,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2148,6 +2592,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2159,19 +2604,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2200,6 +2654,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2391,6 +2846,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction  at-a-time to evict and spill its changes to
@@ -2423,11 +2910,38 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if supported. Otherwise spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStream(rb))
+		{
+			/*
+			* Pick the largest toplevel transaction and evict it from memory by
+			* streaming the already decoded part.
+			*/
+			txn = ReorderBufferLargestTopTXN(rb);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			* Pick the largest transaction (or subtransaction) and evict it from
+			* memory by serializing it to disk.
+			*/
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2725,6 +3239,103 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ *
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot snapshot_now;
+	CommandId command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * XXX Not sure if we can make any assumptions about base snapshot here,
+	 * similarly to what ReorderBufferCommit() does. That relies on
+	 * base_snapshot getting transferred from subxact in
+	 * ReorderBufferCommitChild(), but that was not yet called as the
+	 * transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/*
+	 * Access the main routine to decode the changes and send to output plugin.
+	 */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -3824,6 +4435,16 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases we assume the CID is from the future command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index f80e05edb2..07df0cb0f6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -162,6 +162,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -181,6 +182,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -248,6 +267,13 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
+	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
 	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
-- 
2.23.0

v30/v30-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch000644 000765 000024 00000032536 13676127246 026446 0ustar00dilipkumarstaff000000 000000 From 7c5425d49e0db8f13d0357736314d7fea685e428 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 9 Apr 2020 10:55:19 +0530
Subject: [PATCH v30 04/14] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions, this may cause failures when the
output plugin consults catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml  |  9 +++--
 src/backend/access/heap/heapam.c   | 10 ++++++
 src/backend/access/index/genam.c   | 53 ++++++++++++++++++++++++++++
 src/backend/access/table/tableam.c |  8 +++++
 src/backend/utils/time/snapmgr.c   | 13 +++++++
 src/include/access/tableam.h       | 55 ++++++++++++++++++++++++++++++
 src/include/utils/snapmgr.h        |  2 ++
 7 files changed, 147 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 50cfd6fa47..ab689f8d19 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 537913d1bb..287a185d9c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments at snapmgr.c
+	 * where these variables are declared.  Normally we have such a check at
+	 * tableam level API but this is called from many places so we need to
+	 * ensure it here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae39a..446b8cbc86 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,9 +430,36 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments at snapmgr.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
+/*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments at snapmgr.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
 /*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments at snapmgr.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index c814733b22..2f52b407c6 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -230,6 +230,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	Relation	rel = scan->rs_rd;
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
+	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
 	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..9f1ecd123f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -153,6 +153,19 @@ static Snapshot SecondarySnapshot = NULL;
 static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool bsysscan = false;
+
 /*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index eb18739c36..2b7d3df617 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
+#include "utils/snapmgr.h"
 #include "utils/snapshot.h"
 
 
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1015,6 +1025,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1054,6 +1071,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1710,6 +1735,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1727,6 +1760,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1745,6 +1786,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1761,6 +1809,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments at
+	 * snapmgr.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce84..5af6df698b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -145,6 +145,8 @@ extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
+extern bool	bsysscan;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
 extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
 extern void TeardownHistoricSnapshot(bool is_error);
-- 
2.23.0

#394Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#392)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Jun 26, 2020 at 11:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 25, 2020 at 7:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Review comments on various patches.

poc_shared_fileset_cleanup_on_procexit
=================================
1.
- ent->subxact_fileset =
- MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
+ MemoryContext oldctx;
+ /* Shared fileset handle must be allocated in the persistent context */
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+ ent->subxact_fileset = palloc(sizeof(SharedFileSet));
SharedFileSetInit(ent->subxact_fileset, NULL);
+ MemoryContextSwitchTo(oldctx);
fd = BufFileCreateShared(ent->subxact_fileset, path);

Why is this change required for this patch and why we only cover
SharedFileSetInit in the Apply context and not BufFileCreateShared?
The comment is also not very clear on this point.

Added the comments for the same.

1.
+ /*
+ * Shared fileset handle must be allocated in the persistent context.
+ * Also, SharedFileSetInit allocate the memory for sharefileset list
+ * so we need to allocate that in the long term meemory context.
+ */

How about "We need to maintain shared fileset across multiple stream
open/close calls. So, we allocate it in a persistent context."

Done

2.
+ /*
+ * If the caller is following the dsm based cleanup then we don't
+ * maintain the filesetlist so return.
+ */
+ if (filesetlist == NULL)
+ return;

The check here should use 'NIL' instead of 'NULL'

Done

Other than that the changes in this particular patch looks good to me.

Added as a last patch in the series, in the next version I will merge
this to 0012 and 0013.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#395Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#378)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jun 22, 2020 at 4:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 22, 2020 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 18, 2020 at 9:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Yes, I have made the changes. Basically, now I am only using the
XLOG_XACT_INVALIDATIONS for generating all the invalidation messages.
So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we
are directly appending it to the txn->invalidations. I have tested
the XLOG_INVALIDATIONS part but while sending this mail I realized
that we could write some automated test for the same.

Can you share how you have tested it?

I will work on
that soon.

Cool, I think having a regression test for this will be a good idea.

Other than above tests, can we somehow verify that the invalidations
generated at commit time are the same as what we do with this patch?
We have verified with individual commands but it would be great if we
can verify for the regression tests.

I have verified this using a few random test cases. For verifying
this I have made some temporary code changes with an assert as shown
below. Basically, on DecodeCommit we call
ReorderBufferAddInvalidations function only for an assert checking.

-void
ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
XLogRecPtr
lsn, Size nmsgs,
-
SharedInvalidationMessage *msgs)
+
SharedInvalidationMessage *msgs, bool commit)
{
ReorderBufferTXN *txn;

        txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
-
+       if (commit)
+       {
+               Assert(txn->ninvalidations == nmsgs);
+               return;
+       }

The result is that for a normal local test it works fine. But with
regression suit, it hit an assert at many places because if the
rollback of the subtransaction is involved then at commit time
invalidation messages those are not logged whereas with command time
invalidation those are logged.

As of now, I have only put assert on the count, if we need to verify
the exact messages then we might need to somehow categories the
invalidation messages because the ordering of the messages will not be
the same. For testing this we will have to arrange them by category
i.e relcahce, catcache and then we can compare them.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#396Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#395)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jun 29, 2020 at 4:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jun 22, 2020 at 4:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Other than above tests, can we somehow verify that the invalidations
generated at commit time are the same as what we do with this patch?
We have verified with individual commands but it would be great if we
can verify for the regression tests.

I have verified this using a few random test cases. For verifying
this I have made some temporary code changes with an assert as shown
below. Basically, on DecodeCommit we call
ReorderBufferAddInvalidations function only for an assert checking.

-void
ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
XLogRecPtr
lsn, Size nmsgs,
-
SharedInvalidationMessage *msgs)
+
SharedInvalidationMessage *msgs, bool commit)
{
ReorderBufferTXN *txn;

txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
-
+       if (commit)
+       {
+               Assert(txn->ninvalidations == nmsgs);
+               return;
+       }

The result is that for a normal local test it works fine. But with
regression suit, it hit an assert at many places because if the
rollback of the subtransaction is involved then at commit time
invalidation messages those are not logged whereas with command time
invalidation those are logged.

Yeah, somehow, we need to ignore rollback to savepoint tests and
verify for others.

As of now, I have only put assert on the count, if we need to verify
the exact messages then we might need to somehow categories the
invalidation messages because the ordering of the messages will not be
the same. For testing this we will have to arrange them by category
i.e relcahce, catcache and then we can compare them.

Can't we do this by verifying that each message at commit time exists
in the list of invalidation messages we have collected via processing
XLOG_XACT_INVALIDATIONS?

One additional question on patch
v30-0003-Extend-the-output-plugin-API-with-stream-methods:
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr apply_lsn)
{
..
..
+ state.report_location = apply_lsn;
..
..
+ ctx->write_location = apply_lsn;
..
}

Can't we name the last parameter as 'commit_lsn' as that is how
documentation in the patch spells it and it sounds more appropriate?
Also, is there a reason for assigning report_location and
write_location differently than what we do in commit_cb_wrapper?
Basically, assign those as txn->final_lsn and txn->end_lsn
respectively.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#397Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#396)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jun 30, 2020 at 9:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 29, 2020 at 4:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jun 22, 2020 at 4:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Other than above tests, can we somehow verify that the invalidations
generated at commit time are the same as what we do with this patch?
We have verified with individual commands but it would be great if we
can verify for the regression tests.

I have verified this using a few random test cases. For verifying
this I have made some temporary code changes with an assert as shown
below. Basically, on DecodeCommit we call
ReorderBufferAddInvalidations function only for an assert checking.

-void
ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
XLogRecPtr
lsn, Size nmsgs,
-
SharedInvalidationMessage *msgs)
+
SharedInvalidationMessage *msgs, bool commit)
{
ReorderBufferTXN *txn;

txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
-
+       if (commit)
+       {
+               Assert(txn->ninvalidations == nmsgs);
+               return;
+       }

The result is that for a normal local test it works fine. But with
regression suit, it hit an assert at many places because if the
rollback of the subtransaction is involved then at commit time
invalidation messages those are not logged whereas with command time
invalidation those are logged.

Yeah, somehow, we need to ignore rollback to savepoint tests and
verify for others.

Yeah, I have run the regression suite, I can see a lot of failure
maybe we can somehow see the diff and confirm that all the failures
are due to rollback to savepoint only. I will work on this.

As of now, I have only put assert on the count, if we need to verify
the exact messages then we might need to somehow categories the
invalidation messages because the ordering of the messages will not be
the same. For testing this we will have to arrange them by category
i.e relcahce, catcache and then we can compare them.

Can't we do this by verifying that each message at commit time exists
in the list of invalidation messages we have collected via processing
XLOG_XACT_INVALIDATIONS?

Let me try what is the easiest way to test this.

One additional question on patch
v30-0003-Extend-the-output-plugin-API-with-stream-methods:
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr apply_lsn)
{
..
..
+ state.report_location = apply_lsn;
..
..
+ ctx->write_location = apply_lsn;
..
}

Can't we name the last parameter as 'commit_lsn' as that is how
documentation in the patch spells it and it sounds more appropriate?

You are right commit_lsn seems more appropriate here.

Also, is there a reason for assigning report_location and
write_location differently than what we do in commit_cb_wrapper?
Basically, assign those as txn->final_lsn and txn->end_lsn
respectively.

Yes, I think it should be handled in same way as commit_cb_wrapper.
Because before calling ReorderBufferStreamCommit in
ReorderBufferCommit, we are properly updating the final_lsn as well as
the end_lsn.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#398Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#397)
4 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jun 30, 2020 at 10:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 30, 2020 at 9:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Can't we name the last parameter as 'commit_lsn' as that is how
documentation in the patch spells it and it sounds more appropriate?

You are right commit_lsn seems more appropriate here.

Also, is there a reason for assigning report_location and
write_location differently than what we do in commit_cb_wrapper?
Basically, assign those as txn->final_lsn and txn->end_lsn
respectively.

Yes, I think it should be handled in same way as commit_cb_wrapper.
Because before calling ReorderBufferStreamCommit in
ReorderBufferCommit, we are properly updating the final_lsn as well as
the end_lsn.

Okay, I have made these changes in the attached patch and there are
few more changes in
0003-Extend-the-output-plugin-API-with-stream-methods.
1. In pg_decode_stream_message, for transactional messages, we were
displaying message contents which is different from other streaming
APIs. I have changed it so that streaming API doesn't display message
contents for transactional messages.

2.
+ /* in streaming mode, stream_change_cb is required */
+ if (ctx->callbacks.stream_change_cb == NULL)
+ ereport(ERROR,
+ (errmsg("Output plugin supports streaming, but has not registered "
+ "stream_change_cb callback.")));

The error messages seem a bit weird. (a) doesn't include error code,
(b) not in PG style. I have changed all the error messages to fix
these two issues and change the message as well

3. Rearranged the functions stream_* so that the optional functions
are at the end and also arranged other functions in a way that looks
more logical to me.

4. Updated comments, commit message, and edited docs in the patch.

I have made a few changes in
0004-Gracefully-handle-concurrent-aborts-of-transacti as well.
1. The variable bsysscan was not being reset in case of error. I have
introduced a new function to reset both bsysscan and CheckXidAlive
during transaction abort. Also, snapmgr.c doesn't seem right place
for these variables, so I moved them to xact.c. I think this will
make the initialization of CheckXidAlive during catch in
ReorderBufferProcessTXN redundant.

2. Updated comments and commit message.

Let me know what you think about the above changes.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v31-0001-Immediately-WAL-log-subtransaction-and-top-level.patchapplication/octet-stream; name=v31-0001-Immediately-WAL-log-subtransaction-and-top-level.patchDownload
From b6df6bb72e7e7daafb364f338a7044a2f22c79cc Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 5 Jun 2020 09:03:16 +0530
Subject: [PATCH v31 1/4] Immediately WAL-log subtransaction and top-level XID
 association.

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead) only when wal_level=logical.
We can not remove the existing XLOG_XACT_ASSIGNMENT WAL as that is
required for avoiding overflow in the hot standby snapshot.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/transam/xact.c        | 50 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 23 +++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 44 ++++++++++++++--------------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 107 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 905dc7d..a93fb8a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to top-level XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5120,6 +5122,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6022,3 +6025,50 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * IsSubTransactionAssignmentPending
+ *
+ * This is used to decide whether we need to WAL log the top-level XID for
+ * operation in a subtransaction.  We require that for logical decoding, see
+ * LogicalDecodingProcessRecord.
+ *
+ * This returns true if wal_level >= logical and we are inside a valid
+ * subtransaction, for which the assignment was not yet written to any WAL
+ * record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	/* wal_level has to be logical */
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it should not be already 'assigned' */
+	return !CurrentTransactionState->assigned;
+}
+
+/*
+ * MarkSubTransactionAssigned
+ *
+ * Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f..c526bb1 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,19 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogResetInsertion) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cb76be4..a757bac 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1197,6 +1197,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1235,6 +1236,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3a..0c0c371 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -94,11 +94,27 @@ void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
 	XLogRecordBuffer buf;
+	TransactionId txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the top-level xid is valid, we need to assign the subxact to the
+	 * top-level xact. We need to do this for all records, hence we do it
+	 * before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -216,13 +232,8 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	/*
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
-	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
-	 * when the snapshot is being built: it is possible to get later records
-	 * that require subxids to be properly assigned.
 	 */
-	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
 		return;
 
 	switch (info)
@@ -264,22 +275,13 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
 
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			/*
+			 * We assign subxact to the toplevel xact while processing each
+			 * record if required.  So, we don't need to do anything here.
+			 * See LogicalDecodingProcessRecord.
+			 */
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index db19187..aef8555 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77ac4e7..8058eef 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of top-level xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index b0f2a6e..b976882 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -191,6 +191,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -304,6 +306,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v31-0004-Gracefully-handle-concurrent-aborts-of-transacti.patchapplication/octet-stream; name=v31-0004-Gracefully-handle-concurrent-aborts-of-transacti.patchDownload
From 42e2cc8a1cdc92d0bca9f83fabc511670bee6bad Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:49:40 +0530
Subject: [PATCH v31 4/4] Gracefully handle concurrent aborts of transactions
 being decoded.

When decoding committed transactions this is not an issue, and we never
decode transactions that abort before the decoding starts.

But for an upcoming patch that allows decoding of in-progress
transactions, this may cause failures when the output plugin consults
catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.

Author: Dilip Kumar, Nikhil Sontakke, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 doc/src/sgml/logicaldecoding.sgml         |  9 +++--
 src/backend/access/heap/heapam.c          | 10 ++++++
 src/backend/access/index/genam.c          | 53 +++++++++++++++++++++++++++++
 src/backend/access/table/tableam.c        |  8 +++++
 src/backend/access/transam/xact.c         | 19 +++++++++++
 src/backend/replication/logical/logical.c | 10 ++++++
 src/include/access/tableam.h              | 55 +++++++++++++++++++++++++++++++
 src/include/access/xact.h                 |  4 +++
 src/include/replication/logical.h         |  1 +
 9 files changed, 166 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 18116c8..98b47b0 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 537913d..0022b31 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments in xact.c where
+	 * these variables are declared.  Normally we have such a check at tableam
+	 * level API but this is called from many places so we need to ensure it
+	 * here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..9d9a70a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,10 +430,37 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments in xact.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments in xact.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments in xact.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 4b2bb29..a61e279 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -235,6 +235,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
 	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
+	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d93b40f..c7f1877 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -83,6 +83,19 @@ bool		XactDeferrable;
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
 /*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool		bsysscan = false;
+
+/*
  * When running as a parallel worker, we place only a single
  * TransactionStateData on the parallel worker's state stack, and the XID
  * reflected there will be that of the *innermost* currently-active
@@ -2670,6 +2683,9 @@ AbortTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* If in parallel mode, clean up workers and exit parallel mode. */
 	if (IsInParallelMode())
 	{
@@ -4972,6 +4988,9 @@ AbortSubTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* Exit from parallel mode, if necessary. */
 	if (IsInParallelMode())
 	{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6ee59bd..8deff89 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1442,3 +1442,13 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Clear logical streaming state during (sub)transaction abort.
+ */
+void
+ResetLogicalStreamingState(void)
+{
+	CheckXidAlive = InvalidTransactionId;
+	bsysscan = false;
+}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b3d2a6d..acb6c38 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -19,6 +19,7 @@
 
 #include "access/relscan.h"
 #include "access/sdir.h"
+#include "access/xact.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1017,6 +1027,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1056,6 +1073,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with
+	 * valid CheckXidAlive for catalog or regular tables.  See detailed
+	 * comments in xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1712,6 +1737,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1729,6 +1762,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1747,6 +1788,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1763,6 +1811,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index ac3f5e3..5f767eb 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,10 @@ typedef enum
 /* Synchronous commit level */
 extern int	synchronous_commit;
 
+/* used during logical streaming of a transaction */
+extern TransactionId CheckXidAlive;
+extern bool bsysscan;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index deef318..b0fae98 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -121,5 +121,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+extern void ResetLogicalStreamingState(void);
 
 #endif
-- 
1.8.3.1

v31-0002-WAL-Log-invalidations-at-command-end-with-wal_le.patchapplication/octet-stream; name=v31-0002-WAL-Log-invalidations-at-command-end-with-wal_le.patchDownload
From bcfa4182f128a54db15117aa79483905e9b2f69a Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 6 Jun 2020 09:54:21 +0530
Subject: [PATCH v31 2/4] WAL Log invalidations at command end with
 wal_level=logical.

When wal_level=logical, write invalidations at command end into WAL so
that decoding can use this information.

This patch is required to allow the streaming of in-progress transactions
in logical decoding.  The actual work to allow streaming will be committed
as a separate patch.

We still add the invalidations to the cache and write them to WAL at
commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not need
to be changed.

The invalidations written at command end uses a new xlog record type
XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See
LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures, which
still rely on the invalidations written to commit records.

The invalidations are decoded and accumulated in top-transaction, and then
executed during replay.  This obviates the need to decode the
invalidations as part of a commit record.

LogStandbyInvalidations was accumulating all the invalidations in memory,
and then only wrote them once at commit time, which may reduce the
performance impact by amortizing the overhead and deduplicating the
invalidations.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/rmgrdesc/xactdesc.c          | 10 +++++
 src/backend/access/transam/xact.c               |  7 +++
 src/backend/replication/logical/decode.c        | 58 +++++++++++++++----------
 src/backend/replication/logical/reorderbuffer.c | 52 ++++++++++++++++++----
 src/backend/utils/cache/inval.c                 | 56 ++++++++++++++++++++++++
 src/include/access/xact.h                       | 13 +++++-
 src/include/replication/reorderbuffer.h         |  3 ++
 7 files changed, 166 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce755..68aa994 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -396,6 +396,13 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		standby_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs, InvalidOid,
+								   InvalidOid, false);
+	}
 }
 
 const char *
@@ -423,6 +430,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a93fb8a..d93b40f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6022,6 +6022,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0c0c371..7153eba 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -278,10 +278,39 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 			/*
 			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here.
-			 * See LogicalDecodingProcessRecord.
+			 * record if required.  So, we don't need to do anything here. See
+			 * LogicalDecodingProcessRecord.
 			 */
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/*
+				 * Execute the invalidations for xid-less transactions,
+				 * otherwise, accumulate them so that they can be processed at
+				 * the commit time.
+				 */
+				if (!ctx->fast_forward)
+				{
+					if (TransactionIdIsValid(xid))
+					{
+						ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+													  invals->nmsgs, invals->msgs);
+						ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+														  buf->origptr);
+					}
+					else
+						ReorderBufferImmediateInvalidation(ctx->reorder,
+														   invals->nmsgs,
+														   invals->msgs);
+				}
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
@@ -334,15 +363,11 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_STANDBY_LOCK:
 			break;
 		case XLOG_INVALIDATIONS:
-			{
-				xl_invalidations *invalidations =
-				(xl_invalidations *) XLogRecGetData(r);
 
-				if (!ctx->fast_forward)
-					ReorderBufferImmediateInvalidation(ctx->reorder,
-													   invalidations->nmsgs,
-													   invalidations->msgs);
-			}
+			/*
+			 * We are processing the invalidations at the command level via
+			 * XLOG_XACT_INVALIDATIONS.  So we don't need to do anything here.
+			 */
 			break;
 		default:
 			elog(ERROR, "unexpected RM_STANDBY_ID record type: %u", info);
@@ -573,19 +598,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
-	/*
-	 * Process invalidation messages, even if we're not interested in the
-	 * transaction's contents, since the various caches need to always be
-	 * consistent.
-	 */
-	if (parsed->nmsgs > 0)
-	{
-		if (!ctx->fast_forward)
-			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
-										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
-	}
-
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 642a1c7..4b277fe 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -860,6 +860,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to top-level transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -2205,7 +2208,11 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 /*
  * Setup the invalidation of the toplevel transaction.
  *
- * This needs to be done before ReorderBufferCommit is called!
+ * This needs to be called for each XLOG_XACT_INVALIDATIONS message and
+ * accumulates all the invalidation messages in the toplevel transaction.
+ * This is required because in some cases where we skip processing the
+ * transaction (see ReorderBufferForget), we need to execute all the
+ * invalidations together.
  */
 void
 ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
@@ -2216,17 +2223,35 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	if (txn->ninvalidations != 0)
-		elog(ERROR, "only ever add one set of invalidations");
+	/*
+	 * We collect all the invalidations under the top transaction so that we
+	 * can execute them all together.
+	 */
+	if (txn->toptxn)
+		txn = txn->toptxn;
 
 	Assert(nmsgs > 0);
 
-	txn->ninvalidations = nmsgs;
-	txn->invalidations = (SharedInvalidationMessage *)
-		MemoryContextAlloc(rb->context,
-						   sizeof(SharedInvalidationMessage) * nmsgs);
-	memcpy(txn->invalidations, msgs,
-		   sizeof(SharedInvalidationMessage) * nmsgs);
+	/* Accumulate invalidations. */
+	if (txn->ninvalidations == 0)
+	{
+		txn->ninvalidations = nmsgs;
+		txn->invalidations = (SharedInvalidationMessage *)
+			MemoryContextAlloc(rb->context,
+							   sizeof(SharedInvalidationMessage) * nmsgs);
+		memcpy(txn->invalidations, msgs,
+			   sizeof(SharedInvalidationMessage) * nmsgs);
+	}
+	else
+	{
+		txn->invalidations = (SharedInvalidationMessage *)
+			repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
+					 (txn->ninvalidations + nmsgs));
+
+		memcpy(txn->invalidations + txn->ninvalidations, msgs,
+			   nmsgs * sizeof(SharedInvalidationMessage));
+		txn->ninvalidations += nmsgs;
+	}
 }
 
 /*
@@ -2254,6 +2279,15 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * Mark top-level transaction as having catalog changes too if one of its
+	 * children has so that the ReorderBufferBuildTupleCidHash can
+	 * conveniently check just top-level transaction and decide whether to
+	 * build the hash table or not.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..7d4fd9f 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,9 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transactions.  See
+ *      CommandEndInvalidationMessages.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +107,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +214,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1094,6 +1100,11 @@ CommandEndInvalidationMessages(void)
 
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
+
+	/* WAL Log per-command invalidation messages for wal_level=logical */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
 							   &transInvalInfo->CurrentCmdInvalidMsgs);
 }
@@ -1501,3 +1512,48 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * LogLogicalInvalidations
+ *
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage *invalMessages;
+	int			nmsgs = 0;
+
+	/* Quick exit if we haven't done anything with invalidation messages. */
+	if (transInvalInfo == NULL)
+		return;
+
+	ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+									 MakeSharedInvalidMessagesArray);
+
+	Assert(!(numSharedInvalidMessagesArray > 0 &&
+			 SharedInvalidMessagesArray == NULL));
+
+	invalMessages = SharedInvalidMessagesArray;
+	nmsgs = numSharedInvalidMessagesArray;
+	SharedInvalidMessagesArray = NULL;
+	numSharedInvalidMessagesArray = 0;
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						 nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index aef8555..ac3f5e3 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,17 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4..74ffe78 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -220,6 +220,9 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	end_lsn;
 
+	/* Toplevel transaction for this subxact (NULL for top-level). */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN of the last lsn at which snapshot information reside, so we can
 	 * restart decoding from there and fully recover this transaction from
-- 
1.8.3.1

v31-0003-Extend-the-logical-decoding-output-plugin-API-wi.patchapplication/octet-stream; name=v31-0003-Extend-the-logical-decoding-output-plugin-API-wi.patchDownload
From 77b7c314eae0082c734d69f2288587bc33e5f773 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:01:24 +0530
Subject: [PATCH v31 3/4] Extend the logical decoding output plugin API with
 stream methods.

This adds seven methods to the output plugin API, adding support for
streaming changes for large transactions.

* stream_start
* stream_stop
* stream_abort
* stream_commit
* stream_change
* stream_message
* stream_truncate

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate a chunk of changes
streamed for a particular toplevel transaction.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/test_decoding.c     | 123 +++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 218 +++++++++++++++++++
 src/backend/replication/logical/logical.c | 351 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  59 +++++
 6 files changed, 825 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..4a71826 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
 }
 
 
@@ -540,3 +569,97 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the changes as the transaction can abort
+ * at a later point in time.  We don't want users to see the changes until the
+ * transaction is committed.
+ */
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the contents for transactional messages
+ * as the transaction can abort at a later point in time.  We don't want users to
+ * see the message contents until the transaction is committed.
+ */
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (transactional)
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu",
+						 transactional, prefix, sz);
+	}
+	else
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+						 transactional, prefix, sz);
+		appendBinaryStringInfo(ctx->out, message, sz);
+	}
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the detailed information of Truncate.
+ * See pg_decode_stream_change.
+ */
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index c89f93c..18116c8 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,6 +389,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -401,6 +408,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -679,6 +695,117 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+    
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The actual changes are not displayed as the transaction can abort at a later
+      point in time and we don't decode changes for aborted transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The message contents for transactional messages are not displayed as the transaction
+      can abort at a later point in time and we don't decode changes for aborted
+      transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -747,4 +874,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and two optional callbacks
+    (<function>stream_message_cb</function>) and (<function>stream_truncate_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be..6ee59bd 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr message_lsn, bool transactional,
+									  const char *prefix, Size message_size, const char *message);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require start/stop/abort/commit/change
+	 * callbacks. The message and truncate callback are optional, similar to
+	 * regular output plugins. We however enable streaming when at least one
+	 * of the methods is enabled, so that we can easily identify missing
+	 * methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional, so we do not
+	 * fail with ERROR when missing, but the wrappers simply do nothing. We
+	 * must set the ReorderBuffer callbacks to something, otherwise the calls
+	 * from there will crash (we don't want to move the checks there).
+	 */
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,307 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_start_cb callback")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_stop_cb callback")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_abort_cb callback")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_commit_cb callback")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_change_cb callback")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475..deef318 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..2d9aa11 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.commit_lsn
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 74ffe78..9d60ed8 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -348,6 +348,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											   ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
 struct ReorderBuffer
 {
 	/*
@@ -387,6 +435,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamTruncateCB stream_truncate;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

#399Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#398)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jun 30, 2020 at 5:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Let me know what you think about the above changes.

I went ahead and made few changes in
0005-Implement-streaming-mode-in-ReorderBuffer which are explained
below. I have few questions and suggestions for the patch as well
which are also covered in below points.

1.
+ if (prev_lsn == InvalidXLogRecPtr)
+ {
+ if (streaming)
+ rb->stream_start(rb, txn, change->lsn);
+ else
+ rb->begin(rb, txn);
+ stream_started = true;
+ }

I don't think we want to move begin callback here that will change the
existing semantics, so it is better to move begin at its original
position. I have made the required changes in the attached patch.

2.
ReorderBufferTruncateTXN()
{
..
+ dlist_foreach_modify(iter, &txn->changes)
+ {
+ ReorderBufferChange *change;
+
+ change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+ /* remove the change from it's containing list */
+ dlist_delete(&change->node);
+
+ ReorderBufferReturnChange(rb, change);
+ }
..
}

I think here we can add an Assert that we're not mixing changes from
different transactions. See the changes in the patch.

3.
SetupCheckXidLive()
{
..
+ /*
+ * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+ * aborted. That will happen during catalog access.  Also, reset the
+ * bsysscan flag.
+ */
+ if (!TransactionIdDidCommit(xid))
+ {
+ CheckXidAlive = xid;
+ bsysscan = false;
..
}

What is the need to reset bsysscan flag here if we are already
resetting on error (like in the previous patch sent by me)?

4.
ReorderBufferProcessTXN()
{
..
..
+ /* Reset the CheckXidAlive */
+ if (streaming)
+ CheckXidAlive = InvalidTransactionId;
..
}

Similar to the previous point, we don't need this as well because
AbortCurrentTransaction would have taken care of this.

5.
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)

The above comment doesn't make much sense to me, so I have removed it.
Basically, if there are no changes before commit, we still need to
send commit and anyway if there are no more changes
ReorderBufferProcessTXN will not do anything.

6.
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
if (txn->snapshot_now == NULL)
+ {
+ dlist_iter subxact_i;
+
+ /* make sure this transaction is streamed for the first time */
+ Assert(!rbtxn_is_streamed(txn));
+
+ /* at the beginning we should have invalid command ID */
+ Assert(txn->command_id == InvalidCommandId);
+
+ dlist_foreach(subxact_i, &txn->subtxns)
+ {
+ ReorderBufferTXN *subtxn;
+
+ subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+ ReorderBufferTransferSnapToParent(txn, subtxn);
+ }
..
}

Here, it is possible that there is no base_snapshot for txn, so we
need a check for that similar to ReorderBufferCommit.

7. Apart from the above, I made few changes in comments and ran pgindent.

8. We can't stream the transaction before we reach the
SNAPBUILD_CONSISTENT state because some other output plugin can apply
those changes unlike what we do with pgoutput plugin (which writes to
file). And, I think applying the transactions without reaching a
consistent state would be anyway wrong. So, we should avoid that and
if do that then we should have an Assert for streamed txns rather than
sending abort for them in ReorderBufferForget.

9.
+ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
{
..
+ ReorderBufferToastReset(rb, txn);
+ if (specinsert != NULL)
+ ReorderBufferReturnChange(rb, specinsert);
..
}

Why do we need to do these here when we wouldn't have been done for
any exception other than ERRCODE_TRANSACTION_ROLLBACK?

10. I have got the below failure once. I have not investigated this
in detail as the patch is still under progress. See, if you have any
idea?
# Failed test 'check extra columns contain local defaults'
# at t/013_stream_subxact_ddl_abort.pl line 81.
# got: '2|0'
# expected: '1000|500'
# Looks like you failed 1 test of 2.
make[2]: *** [check] Error 1
make[1]: *** [check-subscription-recurse] Error 2
make[1]: *** Waiting for unfinished jobs....
make: *** [check-world-src/test-recurse] Error 2

11. Can we test by introducing a new GUC such that all the
transactions (at least in existing tests) start to stream? Basically,
it will allow us to disregard logical_decoding_work_mem and ensure
that all regression tests pass through new-code. Note, I am
suggesting this just for testing purposes, not for actual integration
in the code.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v31.tarapplication/x-tar; name=v31.tarDownload
v31-0001-Immediately-WAL-log-subtransaction-and-top-level.patch000664 000765 000024 00000026121 13700013167 025512 0ustar00amitkapilastaff000000 000000 From 7d2f1d04749edda35d09f08c8f9fd41828688ab8 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 5 Jun 2020 09:03:16 +0530
Subject: [PATCH v31 01/14] Immediately WAL-log subtransaction and top-level
 XID association.

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead) only when wal_level=logical.
We can not remove the existing XLOG_XACT_ASSIGNMENT WAL as that is
required for avoiding overflow in the hot standby snapshot.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/transam/xact.c        | 50 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 23 +++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 44 ++++++++++++++--------------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 107 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 905dc7d..a93fb8a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to top-level XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5120,6 +5122,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6022,3 +6025,50 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * IsSubTransactionAssignmentPending
+ *
+ * This is used to decide whether we need to WAL log the top-level XID for
+ * operation in a subtransaction.  We require that for logical decoding, see
+ * LogicalDecodingProcessRecord.
+ *
+ * This returns true if wal_level >= logical and we are inside a valid
+ * subtransaction, for which the assignment was not yet written to any WAL
+ * record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	/* wal_level has to be logical */
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it should not be already 'assigned' */
+	return !CurrentTransactionState->assigned;
+}
+
+/*
+ * MarkSubTransactionAssigned
+ *
+ * Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f..c526bb1 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,19 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogResetInsertion) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cb76be4..a757bac 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1197,6 +1197,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1235,6 +1236,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3a..0c0c371 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -94,11 +94,27 @@ void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
 	XLogRecordBuffer buf;
+	TransactionId txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the top-level xid is valid, we need to assign the subxact to the
+	 * top-level xact. We need to do this for all records, hence we do it
+	 * before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -216,13 +232,8 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	/*
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
-	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
-	 * when the snapshot is being built: it is possible to get later records
-	 * that require subxids to be properly assigned.
 	 */
-	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
 		return;
 
 	switch (info)
@@ -264,22 +275,13 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
 
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			/*
+			 * We assign subxact to the toplevel xact while processing each
+			 * record if required.  So, we don't need to do anything here.
+			 * See LogicalDecodingProcessRecord.
+			 */
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index db19187..aef8555 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77ac4e7..8058eef 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of top-level xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index b0f2a6e..b976882 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -191,6 +191,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -304,6 +306,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v31-0002-WAL-Log-invalidations-at-command-end-with-wal_le.patch000664 000765 000024 00000033736 13700013167 025360 0ustar00amitkapilastaff000000 000000 From 3dcf8f8242c6a5517c587a205389c29388c4a61d Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 6 Jun 2020 09:54:21 +0530
Subject: [PATCH v31 02/14] WAL Log invalidations at command end with
 wal_level=logical.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When wal_level=logical, write invalidations at command end into WAL so
that decoding can use this information.

This patch is required to allow the streaming of in-progress transactions
in logical decoding.  The actual work to allow streaming will be committed
as a separate patch.

We still add the invalidations to the cache and write them to WAL at
commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not need
to be changed.

The invalidations written at command end uses a new xlog record type
XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See
LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures, which
still rely on the invalidations written to commit records.

The invalidations are decoded and accumulated in top-transaction, and then
executed during replay.  This obviates the need to decode the
invalidations as part of a commit record.

LogStandbyInvalidations was accumulating all the invalidations in memory,
and then only wrote them once at commit time, which may reduce the
performance impact by amortizing the overhead and deduplicating the
invalidations.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/rmgrdesc/xactdesc.c          | 10 +++++
 src/backend/access/transam/xact.c               |  7 +++
 src/backend/replication/logical/decode.c        | 58 +++++++++++++++----------
 src/backend/replication/logical/reorderbuffer.c | 52 ++++++++++++++++++----
 src/backend/utils/cache/inval.c                 | 56 ++++++++++++++++++++++++
 src/include/access/xact.h                       | 13 +++++-
 src/include/replication/reorderbuffer.h         |  3 ++
 7 files changed, 166 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce755..68aa994 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -396,6 +396,13 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		standby_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs, InvalidOid,
+								   InvalidOid, false);
+	}
 }
 
 const char *
@@ -423,6 +430,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a93fb8a..d93b40f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -6022,6 +6022,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0c0c371..7153eba 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -278,10 +278,39 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 			/*
 			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here.
-			 * See LogicalDecodingProcessRecord.
+			 * record if required.  So, we don't need to do anything here. See
+			 * LogicalDecodingProcessRecord.
 			 */
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/*
+				 * Execute the invalidations for xid-less transactions,
+				 * otherwise, accumulate them so that they can be processed at
+				 * the commit time.
+				 */
+				if (!ctx->fast_forward)
+				{
+					if (TransactionIdIsValid(xid))
+					{
+						ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+													  invals->nmsgs, invals->msgs);
+						ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+														  buf->origptr);
+					}
+					else
+						ReorderBufferImmediateInvalidation(ctx->reorder,
+														   invals->nmsgs,
+														   invals->msgs);
+				}
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
@@ -334,15 +363,11 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_STANDBY_LOCK:
 			break;
 		case XLOG_INVALIDATIONS:
-			{
-				xl_invalidations *invalidations =
-				(xl_invalidations *) XLogRecGetData(r);
 
-				if (!ctx->fast_forward)
-					ReorderBufferImmediateInvalidation(ctx->reorder,
-													   invalidations->nmsgs,
-													   invalidations->msgs);
-			}
+			/*
+			 * We are processing the invalidations at the command level via
+			 * XLOG_XACT_INVALIDATIONS.  So we don't need to do anything here.
+			 */
 			break;
 		default:
 			elog(ERROR, "unexpected RM_STANDBY_ID record type: %u", info);
@@ -573,19 +598,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
-	/*
-	 * Process invalidation messages, even if we're not interested in the
-	 * transaction's contents, since the various caches need to always be
-	 * consistent.
-	 */
-	if (parsed->nmsgs > 0)
-	{
-		if (!ctx->fast_forward)
-			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
-										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
-	}
-
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 642a1c7..4b277fe 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -860,6 +860,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to top-level transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -2205,7 +2208,11 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 /*
  * Setup the invalidation of the toplevel transaction.
  *
- * This needs to be done before ReorderBufferCommit is called!
+ * This needs to be called for each XLOG_XACT_INVALIDATIONS message and
+ * accumulates all the invalidation messages in the toplevel transaction.
+ * This is required because in some cases where we skip processing the
+ * transaction (see ReorderBufferForget), we need to execute all the
+ * invalidations together.
  */
 void
 ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
@@ -2216,17 +2223,35 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	if (txn->ninvalidations != 0)
-		elog(ERROR, "only ever add one set of invalidations");
+	/*
+	 * We collect all the invalidations under the top transaction so that we
+	 * can execute them all together.
+	 */
+	if (txn->toptxn)
+		txn = txn->toptxn;
 
 	Assert(nmsgs > 0);
 
-	txn->ninvalidations = nmsgs;
-	txn->invalidations = (SharedInvalidationMessage *)
-		MemoryContextAlloc(rb->context,
-						   sizeof(SharedInvalidationMessage) * nmsgs);
-	memcpy(txn->invalidations, msgs,
-		   sizeof(SharedInvalidationMessage) * nmsgs);
+	/* Accumulate invalidations. */
+	if (txn->ninvalidations == 0)
+	{
+		txn->ninvalidations = nmsgs;
+		txn->invalidations = (SharedInvalidationMessage *)
+			MemoryContextAlloc(rb->context,
+							   sizeof(SharedInvalidationMessage) * nmsgs);
+		memcpy(txn->invalidations, msgs,
+			   sizeof(SharedInvalidationMessage) * nmsgs);
+	}
+	else
+	{
+		txn->invalidations = (SharedInvalidationMessage *)
+			repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
+					 (txn->ninvalidations + nmsgs));
+
+		memcpy(txn->invalidations + txn->ninvalidations, msgs,
+			   nmsgs * sizeof(SharedInvalidationMessage));
+		txn->ninvalidations += nmsgs;
+	}
 }
 
 /*
@@ -2254,6 +2279,15 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * Mark top-level transaction as having catalog changes too if one of its
+	 * children has so that the ReorderBufferBuildTupleCidHash can
+	 * conveniently check just top-level transaction and decide whether to
+	 * build the hash table or not.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..7d4fd9f 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,9 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transactions.  See
+ *      CommandEndInvalidationMessages.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +107,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -210,6 +214,8 @@ static struct RELCACHECALLBACK
 
 static int	relcache_callback_count = 0;
 
+static void LogLogicalInvalidations(void);
+
 /* ----------------------------------------------------------------
  *				Invalidation list support functions
  *
@@ -1094,6 +1100,11 @@ CommandEndInvalidationMessages(void)
 
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
+
+	/* WAL Log per-command invalidation messages for wal_level=logical */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
 							   &transInvalInfo->CurrentCmdInvalidMsgs);
 }
@@ -1501,3 +1512,48 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * LogLogicalInvalidations
+ *
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end.
+ */
+static void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage *invalMessages;
+	int			nmsgs = 0;
+
+	/* Quick exit if we haven't done anything with invalidation messages. */
+	if (transInvalInfo == NULL)
+		return;
+
+	ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+									 MakeSharedInvalidMessagesArray);
+
+	Assert(!(numSharedInvalidMessagesArray > 0 &&
+			 SharedInvalidMessagesArray == NULL));
+
+	invalMessages = SharedInvalidMessagesArray;
+	nmsgs = numSharedInvalidMessagesArray;
+	SharedInvalidMessagesArray = NULL;
+	numSharedInvalidMessagesArray = 0;
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						 nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index aef8555..ac3f5e3 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,17 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4..74ffe78 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -220,6 +220,9 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	end_lsn;
 
+	/* Toplevel transaction for this subxact (NULL for top-level). */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN of the last lsn at which snapshot information reside, so we can
 	 * restart decoding from there and fully recover this transaction from
-- 
1.8.3.1

v31-0003-Extend-the-logical-decoding-output-plugin-API-wi.patch000664 000765 000024 00000110356 13700013167 025420 0ustar00amitkapilastaff000000 000000 From 87c0c70a5db45c968461bb1486bf167017d89a6b Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:01:24 +0530
Subject: [PATCH v31 03/14] Extend the logical decoding output plugin API with
 stream methods.

This adds seven methods to the output plugin API, adding support for
streaming changes for large transactions.

* stream_start
* stream_stop
* stream_abort
* stream_commit
* stream_change
* stream_message
* stream_truncate

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate a chunk of changes
streamed for a particular toplevel transaction.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/test_decoding.c     | 123 +++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 218 +++++++++++++++++++
 src/backend/replication/logical/logical.c | 351 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  59 +++++
 6 files changed, 825 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..4a71826 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
 }
 
 
@@ -540,3 +569,97 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the changes as the transaction can abort
+ * at a later point in time.  We don't want users to see the changes until the
+ * transaction is committed.
+ */
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the contents for transactional messages
+ * as the transaction can abort at a later point in time.  We don't want users to
+ * see the message contents until the transaction is committed.
+ */
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (transactional)
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu",
+						 transactional, prefix, sz);
+	}
+	else
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+						 transactional, prefix, sz);
+		appendBinaryStringInfo(ctx->out, message, sz);
+	}
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the detailed information of Truncate.
+ * See pg_decode_stream_change.
+ */
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index c89f93c..18116c8 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,6 +389,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -401,6 +408,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -679,6 +695,117 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+    
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The actual changes are not displayed as the transaction can abort at a later
+      point in time and we don't decode changes for aborted transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The message contents for transactional messages are not displayed as the transaction
+      can abort at a later point in time and we don't decode changes for aborted
+      transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -747,4 +874,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and two optional callbacks
+    (<function>stream_message_cb</function>) and (<function>stream_truncate_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be..6ee59bd 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr message_lsn, bool transactional,
+									  const char *prefix, Size message_size, const char *message);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require start/stop/abort/commit/change
+	 * callbacks. The message and truncate callback are optional, similar to
+	 * regular output plugins. We however enable streaming when at least one
+	 * of the methods is enabled, so that we can easily identify missing
+	 * methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional, so we do not
+	 * fail with ERROR when missing, but the wrappers simply do nothing. We
+	 * must set the ReorderBuffer callbacks to something, otherwise the calls
+	 * from there will crash (we don't want to move the checks there).
+	 */
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,307 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_start_cb callback")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_stop_cb callback")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_abort_cb callback")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_commit_cb callback")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_change_cb callback")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475..deef318 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..2d9aa11 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.commit_lsn
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 74ffe78..9d60ed8 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -348,6 +348,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											   ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
 struct ReorderBuffer
 {
 	/*
@@ -387,6 +435,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamTruncateCB stream_truncate;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v31-0004-Gracefully-handle-concurrent-aborts-of-transacti.patch000664 000765 000024 00000035744 13700013167 025751 0ustar00amitkapilastaff000000 000000 From 12eea4983277665640843d9e0d41b213e3a6fa5a Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:49:40 +0530
Subject: [PATCH v31 04/14] Gracefully handle concurrent aborts of transactions
 being decoded.

When decoding committed transactions this is not an issue, and we never
decode transactions that abort before the decoding starts.

But for an upcoming patch that allows decoding of in-progress
transactions, this may cause failures when the output plugin consults
catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.

Author: Dilip Kumar, Nikhil Sontakke, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 doc/src/sgml/logicaldecoding.sgml         |  9 +++--
 src/backend/access/heap/heapam.c          | 10 ++++++
 src/backend/access/index/genam.c          | 53 +++++++++++++++++++++++++++++
 src/backend/access/table/tableam.c        |  8 +++++
 src/backend/access/transam/xact.c         | 19 +++++++++++
 src/backend/replication/logical/logical.c | 10 ++++++
 src/include/access/tableam.h              | 55 +++++++++++++++++++++++++++++++
 src/include/access/xact.h                 |  4 +++
 src/include/replication/logical.h         |  1 +
 9 files changed, 166 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 18116c8..98b47b0 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 537913d..0022b31 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments in xact.c where
+	 * these variables are declared.  Normally we have such a check at tableam
+	 * level API but this is called from many places so we need to ensure it
+	 * here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..9d9a70a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,10 +430,37 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments in xact.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments in xact.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments in xact.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 4b2bb29..a61e279 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -235,6 +235,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
 	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
+	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d93b40f..c7f1877 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -83,6 +83,19 @@ bool		XactDeferrable;
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
 /*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool		bsysscan = false;
+
+/*
  * When running as a parallel worker, we place only a single
  * TransactionStateData on the parallel worker's state stack, and the XID
  * reflected there will be that of the *innermost* currently-active
@@ -2670,6 +2683,9 @@ AbortTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* If in parallel mode, clean up workers and exit parallel mode. */
 	if (IsInParallelMode())
 	{
@@ -4972,6 +4988,9 @@ AbortSubTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* Exit from parallel mode, if necessary. */
 	if (IsInParallelMode())
 	{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6ee59bd..8deff89 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1442,3 +1442,13 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Clear logical streaming state during (sub)transaction abort.
+ */
+void
+ResetLogicalStreamingState(void)
+{
+	CheckXidAlive = InvalidTransactionId;
+	bsysscan = false;
+}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b3d2a6d..acb6c38 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -19,6 +19,7 @@
 
 #include "access/relscan.h"
 #include "access/sdir.h"
+#include "access/xact.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1017,6 +1027,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1056,6 +1073,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with
+	 * valid CheckXidAlive for catalog or regular tables.  See detailed
+	 * comments in xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1712,6 +1737,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1729,6 +1762,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1747,6 +1788,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1763,6 +1811,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index ac3f5e3..5f767eb 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,10 @@ typedef enum
 /* Synchronous commit level */
 extern int	synchronous_commit;
 
+/* used during logical streaming of a transaction */
+extern TransactionId CheckXidAlive;
+extern bool bsysscan;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index deef318..b0fae98 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -121,5 +121,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+extern void ResetLogicalStreamingState(void);
 
 #endif
-- 
1.8.3.1

v31-0005-Implement-streaming-mode-in-ReorderBuffer.patch000664 000765 000024 00000113244 13700013167 024355 0ustar00amitkapilastaff000000 000000 From 0e78ba340826cd673b642876e74c1d2bb9fe6bc7 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 4 Jul 2020 10:01:53 +0530
Subject: [PATCH v31 05/14] Implement streaming mode in ReorderBuffer.

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke new stream API methods. However, sometimes if we have
incomplete toast or speculative insert we spill to the disk because we can
not generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/heap/heapam_visibility.c     |  42 +-
 src/backend/replication/logical/reorderbuffer.c | 765 +++++++++++++++++++++---
 src/include/replication/reorderbuffer.h         |  26 +
 3 files changed, 756 insertions(+), 77 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..c771280 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the tuple
+		 * yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1659,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4b277fe..332a8ca 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -236,6 +237,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +246,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +382,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -768,6 +782,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the (sub)transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -1022,6 +1068,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1036,6 +1085,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1314,6 +1366,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1339,6 +1400,84 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
+	 * streamed always, even if it does not contain any changes (that is, when
+	 * all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the toplevel xact (we send the XID in all messages), but we never
+	 * stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
+	 * memory. We could also keep the hash table and update it with new ctid
+	 * values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
  */
@@ -1489,57 +1628,171 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
-	}
 
-	snapshot_now = txn->base_snapshot;
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the
+	 * xid aborted.  That will happen during catalog access.
+	 */
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						  ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction. This resets the TXN such that it
+ * can be used to stream the remaining data of transaction being processed.
+ */
+static void
+ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+					  Snapshot snapshot_now,
+					  CommandId command_id,
+					  XLogRecPtr last_lsn,
+					  ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+
+	ReorderBufferToastReset(rb, txn);
+	if (specinsert != NULL)
+		ReorderBufferReturnChange(rb, specinsert);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. We iterate over the top and subtransactions (using a k-way
+ * merge) and replay the changes in lsn order.
+ *
+ * If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool stream_started = false;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1562,14 +1815,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming ? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* We only need to send begin/commit for non-streamed transactions. */
+		if (!streaming)
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1577,6 +1831,32 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * We can't call start stream callback before processing first
+			 * change.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1653,7 +1933,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1693,7 +1974,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1751,7 +2032,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1760,10 +2044,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1794,7 +2075,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1849,14 +2129,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, send the last message for this set of
+		 * changes depending upon streaming mode.
+		 */
+		if (streaming)
+		{
+			if (stream_started)
+			{
+				rb->stream_stop(rb, txn, prev_lsn);
+				stream_started = false;
+			}
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot for the next set of changes in
+		 * streaming mode.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1874,14 +2175,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as
+		 * streamed (if they contained changes). Otherwise, remove all the
+		 * changes and deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1900,15 +2214,105 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
+		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * This error can only occur when we are sending the data in
+			 * streaming mode and the streaming in not finished yet.
+			 */
+			Assert(streaming);
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+
+			/* Reset the TXN so that it is allowed to stream remaining data. */
+			ReorderBufferResetTXN(rb, txn, snapshot_now,
+								  command_id, prev_lsn,
+								  specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+	}
+	PG_END_TRY();
+}
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot	snapshot_now;
+	CommandId	command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
 
-		PG_RE_THROW();
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
 	}
-	PG_END_TRY();
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
 }
 
 /*
@@ -1935,6 +2339,22 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_abort(rb, txn, lsn);
+
+		/*
+		 * We might have decoded changes for this transaction that could load
+		 * the cache as per the current transaction's view (consider DDL's
+		 * happened in this transaction). We don't want the decoding of future
+		 * transactions to use those cache entries so execute invalidations.
+		 */
+		if (txn->ninvalidations > 0)
+			ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+											   txn->invalidations);
+	}
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2004,6 +2424,10 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2139,8 +2563,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2148,6 +2581,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2159,19 +2593,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2200,6 +2643,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2392,6 +2836,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction  at-a-time to evict and spill its changes to
  * disk until we reach under the memory limit.
@@ -2423,11 +2899,38 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if supported. Otherwise, spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStream(rb))
+		{
+			/*
+			 * Pick the largest toplevel transaction and evict it from memory
+			 * by streaming the already decoded part.
+			 */
+			txn = ReorderBufferLargestTopTXN(rb);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it
+			 * from memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2725,6 +3228,113 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+/* Returns true, if the output plugin supports streaming, false, otherwise. */
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * We can't make any assumptions about base snapshot here, similar to what
+	 * ReorderBufferCommit() does. That relies on base_snapshot getting
+	 * transferred from subxact in ReorderBufferCommitChild(), but that was
+	 * not yet called as the transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 *
+	 * Unlike DecodeCommit which adds xids of all the subtransactions in
+	 * snapshot's xip array via SnapBuildCommittedTxn, we can't do that here
+	 * but we do add them to subxip array instead via ReorderBufferCopySnap.
+	 * This allows the catalog changes made in subtransactions decoded till
+	 * now to be visible.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		/*
+		 * If this transaction has no snapshot, it didn't make any changes to
+		 * the database till now, so there's nothing to decode.
+		 */
+		if (txn->base_snapshot == NULL)
+		{
+			Assert(txn->ninvalidations == 0);
+			return;
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -3824,6 +4434,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases, we assume the CID is from the future
+	 * command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 9d60ed8..b1d48c4 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -162,6 +162,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -181,6 +182,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -249,6 +268,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
-- 
1.8.3.1

v31-0006-Bugfix-handling-of-incomplete-toast-spec-insert.patch000664 000765 000024 00000070205 13700013167 025503 0ustar00amitkapilastaff000000 000000 From af88302f01fbfc381fa0a1c4cf55133b246011d3 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 4 Jul 2020 10:19:59 +0530
Subject: [PATCH v31 06/14] Bugfix handling of incomplete toast/spec insert.

---
 src/backend/access/heap/heapam.c                |   3 +
 src/backend/replication/logical/decode.c        |  17 +-
 src/backend/replication/logical/reorderbuffer.c | 441 ++++++++++++++++++------
 src/include/access/heapam_xlog.h                |   1 +
 src/include/replication/reorderbuffer.h         |  50 ++-
 5 files changed, 395 insertions(+), 117 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0022b31..b61139b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1955,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7153eba..2010d5a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 332a8ca..d6a4d26 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -179,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -237,7 +252,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool partial_truncate);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -436,62 +452,71 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 /*
  * Free an ReorderBufferChange.
  */
-void
-ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+static void
+ReorderBufferFreeChange(ReorderBuffer *rb, ReorderBufferChange *change)
 {
-	/* update memory accounting info */
-	ReorderBufferChangeMemoryUpdate(rb, change, false);
-
 	/* free contained data */
 	switch (change->action)
 	{
-		case REORDER_BUFFER_CHANGE_INSERT:
-		case REORDER_BUFFER_CHANGE_UPDATE:
-		case REORDER_BUFFER_CHANGE_DELETE:
-		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
-			if (change->data.tp.newtuple)
-			{
-				ReorderBufferReturnTupleBuf(rb, change->data.tp.newtuple);
-				change->data.tp.newtuple = NULL;
-			}
+	case REORDER_BUFFER_CHANGE_INSERT:
+	case REORDER_BUFFER_CHANGE_UPDATE:
+	case REORDER_BUFFER_CHANGE_DELETE:
+	case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
+		if (change->data.tp.newtuple)
+		{
+			ReorderBufferReturnTupleBuf(rb, change->data.tp.newtuple);
+			change->data.tp.newtuple = NULL;
+		}
 
-			if (change->data.tp.oldtuple)
-			{
-				ReorderBufferReturnTupleBuf(rb, change->data.tp.oldtuple);
-				change->data.tp.oldtuple = NULL;
-			}
-			break;
-		case REORDER_BUFFER_CHANGE_MESSAGE:
-			if (change->data.msg.prefix != NULL)
-				pfree(change->data.msg.prefix);
-			change->data.msg.prefix = NULL;
-			if (change->data.msg.message != NULL)
-				pfree(change->data.msg.message);
-			change->data.msg.message = NULL;
-			break;
-		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
-			if (change->data.snapshot)
-			{
-				ReorderBufferFreeSnap(rb, change->data.snapshot);
-				change->data.snapshot = NULL;
-			}
-			break;
-			/* no data in addition to the struct itself */
-		case REORDER_BUFFER_CHANGE_TRUNCATE:
-			if (change->data.truncate.relids != NULL)
-			{
-				ReorderBufferReturnRelids(rb, change->data.truncate.relids);
-				change->data.truncate.relids = NULL;
-			}
-			break;
-		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
-		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
-		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
-			break;
+		if (change->data.tp.oldtuple)
+		{
+			ReorderBufferReturnTupleBuf(rb, change->data.tp.oldtuple);
+			change->data.tp.oldtuple = NULL;
+		}
+		break;
+	case REORDER_BUFFER_CHANGE_MESSAGE:
+		if (change->data.msg.prefix != NULL)
+			pfree(change->data.msg.prefix);
+		change->data.msg.prefix = NULL;
+		if (change->data.msg.message != NULL)
+			pfree(change->data.msg.message);
+		change->data.msg.message = NULL;
+		break;
+	case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
+		if (change->data.snapshot)
+		{
+			ReorderBufferFreeSnap(rb, change->data.snapshot);
+			change->data.snapshot = NULL;
+		}
+		break;
+		/* no data in addition to the struct itself */
+	case REORDER_BUFFER_CHANGE_TRUNCATE:
+		if (change->data.truncate.relids != NULL)
+		{
+			ReorderBufferReturnRelids(rb, change->data.truncate.relids);
+			change->data.truncate.relids = NULL;
+		}
+		break;
+	case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+	case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
+	case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		break;
 	}
 
 	pfree(change);
 }
+/*
+ * Free an ReorderBufferChange and update memory accounting.
+ */
+void
+ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+{
+	/* update memory accounting info */
+	ReorderBufferChangeMemoryUpdate(rb, change, false);
+
+	/* free contained data */
+	ReorderBufferFreeChange(rb, change);
+}
 
 /*
  * Get a fresh ReorderBufferTupleBuf fitting at least a tuple of size
@@ -642,16 +667,104 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 }
 
 /*
+ * Handle incomplete tuple during streaming.  If streaming is enabled then we
+ * might need to stream the in-progress transaction.  So the problem is that
+ * sometime we might get some incomplete changes which we can not stream
+ * until we get the complete change. e.g.  toast table insert without the main
+ * table insert.  So this function remember the lsn of the last complete change
+ * and the complete size upto last complete lsn so that if we need to stream
+ * we can only stream upto last complete lsn.
+ */
+static void
+ReorderBufferHandleIncompleteTuple(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   ReorderBufferChange *change,
+								   bool toast_insert, Size total_size)
+{
+	ReorderBufferTXN *toptxn;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * If this is a first incomplete change then set the size of the complete
+	 * change.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		(toast_insert || IsSpecInsert(change->action)))
+		toptxn->complete_size = total_size;
+
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Basically,
+	 * both update and insert will do the insert in the toast table.  And as
+	 * explained in the function header we can not stream the only toast
+	 * changes.  So whenever we get the toast insert we set the flag and clear
+	 * the same whenever we get the next insert or update on the main table.
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial tuple and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+
+	/*
+	 * If we don't have any incomplete change after this change then set this
+	 * LSN as last complete lsn.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(toptxn)))
+	{
+		toptxn->last_complete_lsn = change->lsn;
+
+		/*
+		 * If the transaction is serialized and the the changes are complete in
+		 * the top level transaction then immediately stream the transaction.
+		 * The reason for not waiting for memory limit to get full is that in
+		 * the streaming mode, if the transaction serialized that means we have
+		 * already reached the memory limit but that time we could not stream
+		 * this due to incomplete tuple so now stream it as soon as the tuple
+		 * is complete.  Also, if we don't stream the serialized changes then
+		 * if we get some more incomplete changes in this transaction then we
+		 * don't have a way to partly truncate the serialized changes.
+		 */
+		if (rbtxn_is_serialized(txn))
+			ReorderBufferStreamTXN(rb, toptxn);
+	}
+}
+
+/*
  * Queue a change into a transaction so it can be replayed upon commit.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
+	Size	total_size = 0;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the last changes we have detected that the transaction
+	 * is aborted.  So there is no point in collecting further changes for the
+	 * transaction.
+	 */
+	if (txn->concurrent_abort)
+	{
+		ReorderBufferFreeChange(rb, change);
+		return;
+	}
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -660,9 +773,28 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * Get the total size of the top transaction before updating the size for
+	 * current change so that if this is the incomplete tuple we know the size
+	 * prior to this change.  That will be used for updating the size of the
+	 * complete changes in the top transaction for streaming.
+	 */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			total_size = txn->toptxn->total_size;
+		else
+			total_size = txn->total_size;
+	}
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* Handle the incomplete tuple, If streaming is enabled */
+	if (ReorderBufferCanStream(rb))
+		ReorderBufferHandleIncompleteTuple(rb, txn, change, toast_insert,
+										   total_size);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -692,7 +824,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1403,11 +1535,46 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
  * Discard changes from a transaction (and subtransactions), after streaming
  * them. Keep the remaining info - transactions, tuplecids, invalidations and
  * snapshots.
+ *
+ * If partial_truncate is false we completely truncate the transaction,
+ * otherwise we truncate upto last_complete_lsn
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 bool partial_truncate)
 {
 	dlist_mutable_iter iter;
+	ReorderBufferTXN *toptxn;
+
+	/* Get the top transaction */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * The serialized transaction should never be partly truncated, because if
+	 * it is serialized then we stream it as soon as its changes get completed.
+	 */
+	Assert(!(rbtxn_is_serialized(txn) && partial_truncate));
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
 
 	/* cleanup subtransactions & their changes */
 	dlist_foreach_modify(iter, &txn->subtxns)
@@ -1424,7 +1591,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, partial_truncate);
 	}
 
 	/* cleanup changes in the toplevel txn */
@@ -1437,6 +1604,14 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
+		/* We have truncated upto last complete lsn so stop. */
+		if (partial_truncate && (change->lsn > toptxn->last_complete_lsn))
+		{
+			/* The transaction must have incomplete changes. */
+			Assert(rbtxn_has_incomplete_tuple(toptxn));
+			break;
+		}
+
 		/* remove the change from it's containing list */
 		dlist_delete(&change->node);
 
@@ -1444,24 +1619,6 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
-	 * streamed always, even if it does not contain any changes (that is, when
-	 * all the changes are in subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts for
-	 * XIDs the downstream is not aware of. And of course, it always knows
-	 * about the toplevel xact (we send the XID in all messages), but we never
-	 * stream XIDs of empty subxacts.
-	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
-	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
 	 * values, but this seems simpler and good enough for now.
@@ -1472,9 +1629,39 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
-	/* also reset the number of entries in the transaction */
-	txn->nentries_mem = 0;
-	txn->nentries = 0;
+	/*
+	 * Adjust nentries/nentries_mem based on the changes processed.  See
+	 * comments where nprocessed is declared.
+	 */
+	if (partial_truncate)
+	{
+		txn->nentries -= txn->nprocessed;
+		txn->nentries_mem -= txn->nprocessed;
+	}
+	else
+	{
+		txn->nentries = 0;
+		txn->nentries_mem = 0;
+	}
+	txn->nprocessed = 0;
+
+	/*
+	 * If this is a top transaction then we can reset the
+	 * last_complete_lsn and complete_size, because by now we would
+	 * have stream all the changes upto last_complete_lsn.
+	 */
+	if (partial_truncate && (txn->toptxn == NULL))
+	{
+		toptxn->last_complete_lsn = InvalidXLogRecPtr;
+		toptxn->complete_size = 0;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
 }
 
 /*
@@ -1758,7 +1945,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed. */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Stop the stream. */
 	rb->stream_stop(rb, txn, last_lsn);
@@ -1793,6 +1980,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
 	ReorderBufferChange *volatile specinsert = NULL;
 	volatile bool stream_started = false;
+	volatile bool	partial_truncate = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1855,7 +2044,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			/* Set the xid for concurrent abort check. */
 			if (streaming)
-				SetupCheckXidLive(change->txn->xid);
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
 
 			switch (change->action)
 			{
@@ -2112,6 +2304,27 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
 			}
+
+			if (streaming)
+			{
+				/*
+				 * Increment the nprocessed count.  See the detailed comment
+				 * for usage of this in ReorderBufferTXN structure.
+				 */
+				curtxn->nprocessed++;
+
+				/*
+				 * If the transaction contains incomplete tuple and this is the
+				 * last complete change then stop further processing of the
+				 * transaction.  And, set the partial truncate flag to true.
+				 */
+				if (rbtxn_has_incomplete_tuple(txn) &&
+					prev_lsn == txn->last_complete_lsn)
+				{
+					partial_truncate = true;
+					break;
+				}
+			}
 		}
 
 		/*
@@ -2183,7 +2396,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming)
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, partial_truncate);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
@@ -2232,6 +2445,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
+			curtxn->concurrent_abort = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2510,7 +2724,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2559,7 +2773,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2582,6 +2796,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2596,8 +2811,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2605,12 +2825,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2852,18 +3080,28 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 	dlist_foreach(iter, &rb->toplevel_by_lsn)
 	{
 		ReorderBufferTXN *txn;
+		Size	size = 0;
+		Size	largest_size = 0;
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
+		/*
+		 * If this transaction have some incomplete changes then only consider
+		 * the size upto last complete lsn.
+		 */
+		if (rbtxn_has_incomplete_tuple(txn))
+			size = txn->complete_size;
+		else
+			size = txn->total_size;
+
+		/* If the current transaction is larger then remember it. */
+		if ((largest != NULL || size > largest_size) && size > 0)
+		{
 			largest = txn;
+			largest_size = size;
+		}
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2901,18 +3139,13 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 		 * Pick the largest transaction (or subtransaction) and evict it from
 		 * memory by streaming, if supported. Otherwise, spill to disk.
 		 */
-		if (ReorderBufferCanStream(rb))
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
 		{
-			/*
-			 * Pick the largest toplevel transaction and evict it from memory
-			 * by streaming the already decoded part.
-			 */
-			txn = ReorderBufferLargestTopTXN(rb);
-
 			/* we know there has to be one, because the size is not zero */
 			Assert(txn && !txn->toptxn);
-			Assert(txn->size > 0);
-			Assert(rb->size >= txn->size);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
 			ReorderBufferStreamTXN(rb, txn);
 		}
@@ -2930,14 +3163,14 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(rb->size >= txn->size);
 
 			ReorderBufferSerializeTXN(rb, txn);
-		}
 
-		/*
-		 * After eviction, the transaction should have no entries in memory,
-		 * and should use 0 bytes for changes.
-		 */
-		Assert(txn->size == 0);
-		Assert(txn->nentries_mem == 0);
+			/*
+			 * After eviction, the transaction should have no entries in memory, and
+			 * should use 0 bytes for changes.
+			 */
+			Assert(txn->size == 0);
+			Assert(txn->nentries_mem == 0);
+		}
 	}
 
 	/* We must be under the memory limit now. */
@@ -3329,10 +3562,6 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	/* Process and send the changes to output plugin. */
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
-
-	Assert(dlist_is_empty(&txn->changes));
-	Assert(txn->nentries == 0);
-	Assert(txn->nentries_mem == 0);
 }
 
 /*
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index b1d48c4..c4c0903 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -163,6 +163,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -182,6 +184,26 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -190,10 +212,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -339,6 +357,26 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* Size of the complete changes. */
+	Size		complete_size;
+
+	/* LSN of the last complete change. */
+	XLogRecPtr	last_complete_lsn;
+
+	/*
+	 * Number of changes processed.  This is used to keep track of changes that
+	 * remained to be streamed.  As of now, this can happen either due to toast
+	 * tuples or speculative insertions where we need to wait for multiple
+	 * changes before we can send them.
+	 */
+	uint64		nprocessed;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -526,7 +564,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v31-0007-Add-support-for-streaming-to-built-in-replicatio.patch000664 000765 000024 00000263130 13700013167 025625 0ustar00amitkapilastaff000000 000000 From d2de3466a4db7096418035278a309698309f2732 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 4 Jul 2020 10:40:15 +0530
Subject: [PATCH v31 07/14] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |    4 +-
 doc/src/sgml/ref/create_subscription.sgml          |   11 +
 src/backend/catalog/pg_subscription.c              |    1 +
 src/backend/commands/subscriptioncmds.c            |   45 +-
 src/backend/postmaster/pgstat.c                    |   12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    3 +
 src/backend/replication/logical/proto.c            |  140 ++-
 src/backend/replication/logical/worker.c           | 1012 ++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c        |  318 +++++-
 src/backend/replication/slotfuncs.c                |    6 +
 src/backend/replication/walsender.c                |    6 +
 src/include/catalog/pg_subscription.h              |    3 +
 src/include/pgstat.h                               |    6 +-
 src/include/replication/logicalproto.h             |   42 +-
 src/include/replication/walreceiver.h              |    1 +
 src/test/subscription/t/009_stream_simple.pl       |   86 ++
 src/test/subscription/t/010_stream_subxact.pl      |  102 ++
 src/test/subscription/t/011_stream_ddl.pl          |   95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |   82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |   84 ++
 20 files changed, 2019 insertions(+), 40 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index c24ace1..d8de56c 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 5bbc165..c25b7c5 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731..f28482f 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9ebb026..9065a1b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c022597..a55ccc0 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4138,6 +4138,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9..5257ab0 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd..83d0642 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a12..d2d9469 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,32 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the subscription.
+ * This is necessary so that different workers processing a remote transaction
+ * with the same XID don't interfere.
+ *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,6 +54,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -64,6 +86,7 @@
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +94,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -100,6 +124,7 @@ typedef struct SlotErrCallbackArg
 } SlotErrCallbackArg;
 
 static MemoryContext ApplyMessageContext = NULL;
+static MemoryContext LogicalStreamingContext = NULL;
 MemoryContext ApplyContext = NULL;
 
 WalReceiverConn *wrconn = NULL;
@@ -110,12 +135,58 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool	in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+static int	stream_fd = -1;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;						/* XID of the subxact */
+	off_t           offset;					/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
+/*
+ * Array of serialized XIDs.
+ */
+static int	nxids = 0;
+static int	maxnxids = 0;
+static TransactionId	*xids = NULL;
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +258,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != -1);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -553,6 +660,326 @@ apply_handle_origin(StringInfo s)
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+
+	Assert(!in_streamed_transaction);
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/*
+	 * if this is not the first segment, open existing file
+	 *
+	 * XXX Note that the cleanup is performed by stream_open_file.
+	 */
+	if (!first_segment)
+	{
+		MemoryContext oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+
+		/* Read the subxacts info in per-stream context. */
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+		MemoryContextSwitchTo(oldctx);
+	}
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+	{
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+		return;
+	}
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		int			fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+
+		subidx = -1;
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = OpenTransientFile(path, O_WRONLY | PG_BINARY);
+		if (fd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+		}
+
+		/* OK, truncate the file at the right offset. */
+		if (ftruncate(fd, subxacts[subidx].offset))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\": %m", path)));
+		CloseTransientFile(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	int			fd;
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	LogicalRepCommitData commit_data;
+
+	MemoryContext oldcxt;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	}
+
+	ensure_transaction();
+
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	buffer = palloc(8192);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		nbytes = read(fd, &len, sizeof(len));
+		pgstat_report_wait_end();
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
+		if (read(fd, buffer, len) != len)
+		{
+			int			save_errno = errno;
+
+			CloseTransientFile(fd);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file: %m")));
+			return;
+		}
+		pgstat_report_wait_end();
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+
+		/*
+		 * send feedback to upstream
+		 *
+		 * XXX Probably should send a valid LSN. But which one?
+		 */
+		send_feedback(InvalidXLogRecPtr, false, false);
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -565,6 +992,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1010,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1049,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1167,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1312,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1685,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1826,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1478,6 +1939,22 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 }
 
 /*
+ * Cleanup function.
+ *
+ * Called on logical replication worker exit.
+ */
+static void
+worker_onexit(int code, Datum arg)
+{
+	int	i;
+
+	elog(LOG, "cleanup files for %d transactions", nxids);
+
+	for (i = nxids-1; i >= 0; i--)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+}
+
+/*
  * Apply main loop.
  */
 static void
@@ -1493,6 +1970,17 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reeset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													 ALLOCSET_DEFAULT_SIZES);
+
+	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
+	before_shmem_exit(worker_onexit, (Datum) 0);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1941,6 +2429,529 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
+
+	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	if ((len > 0) && (write(fd, subxacts, len) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	/*
+	 * We don't need to fsync or anything, as we'll recreate the files after a
+	 * crash from scratch. So just close the file.
+	 */
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ *
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	Size		len;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	subxact_filename(path, subid, xid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	/* read number of subxact items */
+	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Let the caller decide which memory context it will be allocated.
+	 * Ideally, during stream start it will be allocated in the
+	 * LogicalStreamingContext which will be reset on stream stop, and
+	 * during the stream abort we need this memory only for short term so
+	 * it will be allocated in ApplyMessageContext.
+	 */
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+
+	if ((len > 0) && ((read(fd, subxacts, len)) != len))
+	{
+		int			save_errno = errno;
+
+		CloseTransientFile(fd);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read file \"%s\": %m",
+						path)));
+		return;
+	}
+
+	pgstat_report_wait_end();
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd >= 0);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.subxacts",
+			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	char		tempdirpath[MAXPGPATH];
+
+	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+	/*
+	 * We might need to create the tablespace's tempfile directory, if no
+	 * one has yet done so.
+	 */
+	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						tempdirpath)));
+
+	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.changes",
+			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	int			i;
+	char		path[MAXPGPATH];
+	bool		found = false;
+
+	subxact_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	changes_filename(path, subid, xid);
+
+	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not remove file \"%s\": %m", path)));
+
+	/*
+	 * Cleanup the XID from the array - find the XID in the array and
+	 * remove it by shifting all the remaining elements. The array is
+	 * bound to be fairly small (maximum number of in-progress xacts,
+	 * so max_connections + max_prepared_transactions) so simply loop
+	 * through the array and find index of the XID. Then move the rest
+	 * of the array by one element to the left.
+	 *
+	 * Notice we also call this from stream_open_file for first segment
+	 * of each transaction, to deal with possible left-overs after a
+	 * crash, so it's entirely possible not to find the XID in the
+	 * array here. In that case we don't remove anything.
+	 *
+	 * XXX Perhaps it'd be better to handle this automatically after a
+	 * restart, instead of doing it over and over for each transaction.
+	 */
+	for (i = 0; i < nxids; i++)
+	{
+		if (xids[i] == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * Move the last entry from the array to the place. We don't keep
+	 * the streamed transactions sorted or anything - we only expect
+	 * a few of them in progress (max_connections + max_prepared_xacts)
+	 * so linear search is just fine.
+	 */
+	xids[i] = xids[nxids-1];
+	nxids--;
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, perform cleanup by removing existing
+ * files after a possible previous crash.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	int			flags;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == -1);
+
+	/*
+	 * If this is the first segment for this transaction, try removing
+	 * existing files (if there are any, possibly after a crash).
+	 */
+	if (first_segment)
+	{
+		MemoryContext	oldcxt;
+
+		/* XXX make sure there are no previous files for this transaction */
+		stream_cleanup_files(subid, xid, true);
+
+		/* Need to allocate this in permanent context */
+		oldcxt = MemoryContextSwitchTo(ApplyContext);
+
+		/*
+		 * We need to remember the XIDs we spilled to files, so that we can
+		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
+		 *
+		 * The number of XIDs we may need to track is fairly small, because
+		 * we can only stream toplevel xacts (so limited by max_connections
+		 * and max_prepared_transactions), and we only stream the large ones.
+		 * So we simply keep the XIDs in an unsorted array. If the number of
+		 * xacts gets large for some reason (e.g. very high max_connections),
+		 * a more elaborate approach might be better - e.g. sorted array, to
+		 * speed-up the lookups.
+		 */
+		if (nxids == maxnxids)	/* array of XIDs is full */
+		{
+			if (!xids)
+			{
+				maxnxids = 64;
+				xids = palloc(maxnxids * sizeof(TransactionId));
+			}
+			else
+			{
+				maxnxids = 2 * maxnxids;
+				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
+			}
+		}
+
+		xids[nxids++] = xid;
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	changes_filename(path, subid, xid);
+
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so
+	 * make sure we're the ones creating it. Otherwise just open the file
+	 * for writing, in append mode.
+	 */
+	if (first_segment)
+		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	else
+		flags = (O_WRONLY | O_APPEND | PG_BINARY);
+
+	stream_fd = OpenTransientFile(path, flags);
+
+	if (stream_fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	CloseTransientFile(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = -1;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != -1);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
+
+	/* first write the size */
+	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* then the action */
+	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	if (write(stream_fd, &s->data[s->cursor], len) != len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not serialize streamed change to file: %m")));
+
+	pgstat_report_wait_end();
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3117,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 15379e3..1509f9b 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +122,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +148,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +211,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +240,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +263,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +373,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +423,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +465,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +486,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +518,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +538,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +562,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +582,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +607,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +639,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -588,6 +720,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -624,6 +841,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -753,12 +1002,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -793,7 +1075,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 88033a7..a2dc66b 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -158,6 +158,12 @@ create_logical_replication_slot(char *name, char *plugin,
 									NULL, NULL, NULL);
 
 	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e2477c4..1abf243 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1017,6 +1017,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndUpdateProgress);
 
 		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
+		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
 		 * messages or send keepalives. As we possibly need to wait for
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d4..617b909 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201..6352ff9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -981,7 +981,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561..89158ed 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index ac1acbb..9513206 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v31-0008-Enable-streaming-for-all-subscription-TAP-tests.patch000664 000765 000024 00000035344 13700013167 025376 0ustar00amitkapilastaff000000 000000 From 9836f9e53b7645a418eec1d0b65cc60cee96be09 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v31 08/14] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318f..6f7bedc 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f8..4ba8086 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334e..cdf9b8e 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v31-0009-Add-TAP-test-for-streaming-vs.-DDL.patch000664 000765 000024 00000010623 13700013167 022324 0ustar00amitkapilastaff000000 000000 From 78d8638dda7eaf27e72c271051695b1142c9bd72 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v31 09/14] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v31-0010-Provide-new-api-to-get-the-streaming-changes.patch000664 000765 000024 00000014506 13700013167 024667 0ustar00amitkapilastaff000000 000000 From 1fd195d326068ba19b041f30d5def93e0845ba7c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Sat, 2 May 2020 11:41:59 +0530
Subject: [PATCH v31 10/14] Provide new api to get the streaming changes

---
 .gitignore                                     |  1 +
 doc/src/sgml/test-decoding.sgml                | 22 ++++++++++++++++++++++
 src/backend/catalog/system_views.sql           |  8 ++++++++
 src/backend/replication/logical/logicalfuncs.c | 23 ++++++++++++++++++-----
 src/include/catalog/pg_proc.dat                |  9 +++++++++
 5 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/.gitignore b/.gitignore
index 794e35b..6083744 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d..eed6e9d 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_streaming_changes('test_slot', NULL, NULL);
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5314e93..98d3ad0 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1240,6 +1240,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
 
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_peek_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index b99c94e..70c28ff 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -108,7 +108,8 @@ check_permissions(void)
  * Helper function for the various SQL callable logical decoding functions.
  */
 static Datum
-pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool binary)
+pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
+								 bool binary, bool streaming)
 {
 	Name		name;
 	XLogRecPtr	upto_lsn;
@@ -252,6 +253,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or has been invalidated.")));
 
+		/* If called has not asked for streaming changes then disable it. */
+		ctx->streaming &= streaming;
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
@@ -362,7 +366,16 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 Datum
 pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, false);
+}
+
+/*
+ * SQL function to get the streaming changes as text, consuming the data.
+ */
+Datum
+pg_logical_slot_get_streaming_changes(PG_FUNCTION_ARGS)
+{
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, true);
 }
 
 /*
@@ -371,7 +384,7 @@ pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, false, false);
 }
 
 /*
@@ -380,7 +393,7 @@ pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, true, false);
 }
 
 /*
@@ -389,7 +402,7 @@ pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, true, false);
 }
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 38295ac..06a0656 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10127,6 +10127,15 @@
   proargmodes => '{i,i,i,v,o,o,o}',
   proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
   prosrc => 'pg_logical_slot_get_binary_changes' },
+{ oid => '6150', descr => 'get streaming changes from replication slot',
+  proname => 'pg_logical_slot_get_streaming_changes', procost => '1000',
+  prorows => '1000', provariadic => 'text', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'name pg_lsn int4 _text',
+  proallargtypes => '{name,pg_lsn,int4,_text,pg_lsn,xid,text}',
+  proargmodes => '{i,i,i,v,o,o,o}',
+  proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
+  prosrc => 'pg_logical_slot_get_streaming_changes' },
 { oid => '3784', descr => 'peek at changes from replication slot',
   proname => 'pg_logical_slot_peek_changes', procost => '1000',
   prorows => '1000', provariadic => 'text', proisstrict => 'f',
-- 
1.8.3.1

v31-0011-Add-streaming-option-in-pg_dump.patch000664 000765 000024 00000005324 13700013167 022372 0ustar00amitkapilastaff000000 000000 From ceb6ba7aa2159e86beb97cdfbe052dd5d7386275 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v31 11/14] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index a41a3db..d0fb24e 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4235,8 +4236,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4249,6 +4250,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4265,6 +4267,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4342,6 +4345,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0c2fcfb..af64270 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
1.8.3.1

v31-0012-Change-buffile-interface-required-for-streaming-.patch000664 000765 000024 00000031501 13700013167 025550 0ustar00amitkapilastaff000000 000000 From cdfd66a61d71f8cc1dedc716dd3a00be6e6fd6e0 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 11 Jun 2020 16:40:25 +0530
Subject: [PATCH v31 12/14] Change buffile interface required for streaming
 transaction

Implement the BuffileTruncate and SEEK_END.  And, also add an
option to provide a mode while opening the shared buffiles, instead
of always opening in readonly mode
---
 src/backend/postmaster/pgstat.c           |  3 ++
 src/backend/storage/file/buffile.c        | 81 +++++++++++++++++++++++++++----
 src/backend/storage/file/fd.c             |  9 ++--
 src/backend/storage/file/sharedfileset.c  | 21 +++++---
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  3 +-
 10 files changed, 103 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a55ccc0..a9fbe41 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 3907349..bde6fa1 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,12 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.  Buffile
+ * infrasturcture can be used in the single backend as well if the files need
+ * to be survived across the transaction as well as files needs to be opened
+ * and closed multiple times.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +279,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +303,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +323,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -666,11 +668,22 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
+			break;
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +851,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 7dc6dd2..060811c 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1741,18 +1741,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index f7206c9..0907f79 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -34,16 +34,22 @@ static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name)
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
  *
  * Under the covers the set is one or more directories which will eventually
  * be deleted when there are no backends attached.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * one backend but the files needs to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases dsm segment should be passed NULL so that the files will be
+ * deleted on the proc exit.
  */
 void
 SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
@@ -68,7 +74,8 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
 }
 
 /*
@@ -131,13 +138,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59..788815c 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a43..b83fb50 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6352ff9..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752ba..fc34c49 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d..e209f04 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf07..b2f4ba4 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,7 +37,8 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
-- 
1.8.3.1

v31-0013-Worker-tempfile-use-the-shared-buffile-infrastru.patch000664 000765 000024 00000075115 13700013167 025674 0ustar00amitkapilastaff000000 000000 From 38f63f620062d9e40fb70d69780ca2b5e0abf93e Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 11 Jun 2020 16:42:07 +0530
Subject: [PATCH v31 13/14] Worker tempfile use the shared buffile
 infrastructure

Tobe merged with 0008, kept separate to make it easy for the
review.
---
 src/backend/replication/logical/worker.c | 630 ++++++++++++++-----------------
 1 file changed, 281 insertions(+), 349 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index d2d9469..a543ee9 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -32,9 +32,12 @@
  * to truncate the file with serialized changes.
  *
  * The files are placed in tmp file directory by default, and the filenames
- * include both the XID of the toplevel transaction and OID of the subscription.
- * This is necessary so that different workers processing a remote transaction
- * with the same XID don't interfere.
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use buffiles instead of using normal temporary files because the buffile
+ * infrastructure supports temporary files that exceed the OS file size limit.
  *
  *-------------------------------------------------------------------------
  */
@@ -56,6 +59,7 @@
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -85,6 +89,7 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -123,10 +128,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a xid we create this entry in the
+ * xidhash and we also create the streaming file and store the fileset handle.
+ * So that on the subsequent stream for the xid we can search the entry in the
+ * hash and get the fileset handle.  The subxact file is created iff there is
+ * any suxact info under this xid.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
-static MemoryContext LogicalStreamingContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -136,15 +157,26 @@ bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
 /* fields valid only when processing streamed transaction */
-bool	in_streamed_transaction = false;
+bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
-static int	stream_fd = -1;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
 
 typedef struct SubXactInfo
 {
-	TransactionId xid;						/* XID of the subxact */
-	off_t           offset;					/* offset in the file */
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
 } SubXactInfo;
 
 static uint32 nsubxacts = 0;
@@ -171,13 +203,6 @@ static void stream_open_file(Oid subid, TransactionId xid, bool first);
 static void stream_write_change(char action, StringInfo s);
 static void stream_close_file(void);
 
-/*
- * Array of serialized XIDs.
- */
-static int	nxids = 0;
-static int	maxnxids = 0;
-static TransactionId	*xids = NULL;
-
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
@@ -275,7 +300,7 @@ handle_streamed_transaction(const char action, StringInfo s)
 	if (!in_streamed_transaction)
 		return false;
 
-	Assert(stream_fd != -1);
+	Assert(stream_fd != NULL);
 	Assert(TransactionIdIsValid(stream_xid));
 
 	/*
@@ -666,31 +691,39 @@ static void
 apply_handle_stream_start(StringInfo s)
 {
 	bool		first_segment;
+	HASHCTL		hash_ctl;
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
 	/* notify handle methods we're processing a remote transaction */
 	in_streamed_transaction = true;
 
 	/* extract XID of the top-level transaction */
 	stream_xid = logicalrep_read_stream_start(s, &first_segment);
 
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
 	/* open the spool file for this transaction */
 	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
 
-	/*
-	 * if this is not the first segment, open existing file
-	 *
-	 * XXX Note that the cleanup is performed by stream_open_file.
-	 */
+	/* if this is not the first segment, open existing file */
 	if (!first_segment)
-	{
-		MemoryContext oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
-
-		/* Read the subxacts info in per-stream context. */
 		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
-		MemoryContextSwitchTo(oldctx);
-	}
 
 	pgstat_report_activity(STATE_RUNNING, NULL);
 }
@@ -710,6 +743,12 @@ apply_handle_stream_stop(StringInfo s)
 	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
 	stream_close_file();
 
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
 	in_streamed_transaction = false;
 
 	/* Reset per-stream context */
@@ -736,10 +775,7 @@ apply_handle_stream_abort(StringInfo s)
 	 * just delete the files with serialized info.
 	 */
 	if (xid == subxid)
-	{
 		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
-		return;
-	}
 	else
 	{
 		/*
@@ -761,11 +797,13 @@ apply_handle_stream_abort(StringInfo s)
 
 		int64		i;
 		int64		subidx;
-		int			fd;
+		BufFile    *fd;
 		bool		found = false;
 		char		path[MAXPGPATH];
+		StreamXidHash *ent;
 
 		subidx = -1;
+		ensure_transaction();
 		subxact_info_read(MyLogicalRepWorker->subid, xid);
 
 		/* XXX optimize the search by bsearch on sorted data */
@@ -787,33 +825,32 @@ apply_handle_stream_abort(StringInfo s)
 		{
 			/* Cleanup the subxact info */
 			cleanup_subxact_info();
+			CommitTransactionCommand();
 			return;
 		}
 
 		Assert((subidx >= 0) && (subidx < nsubxacts));
 
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
 		changes_filename(path, MyLogicalRepWorker->subid, xid);
-		fd = OpenTransientFile(path, O_WRONLY | PG_BINARY);
-		if (fd < 0)
-		{
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not open file \"%s\": %m",
-							path)));
-		}
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
 
-		/* OK, truncate the file at the right offset. */
-		if (ftruncate(fd, subxacts[subidx].offset))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not truncate file \"%s\": %m", path)));
-		CloseTransientFile(fd);
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
 
 		/* discard the subxacts added later */
 		nsubxacts = subidx;
 
 		/* write the updated subxact list */
 		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
 	}
 }
 
@@ -823,16 +860,16 @@ apply_handle_stream_abort(StringInfo s)
 static void
 apply_handle_stream_commit(StringInfo s)
 {
-	int			fd;
 	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
-
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
+	bool		found;
 	LogicalRepCommitData commit_data;
-
+	StreamXidHash *ent;
 	MemoryContext oldcxt;
+	BufFile    *fd;
 
 	Assert(!in_streamed_transaction);
 
@@ -840,25 +877,20 @@ apply_handle_stream_commit(StringInfo s)
 
 	elog(DEBUG1, "received commit for streamed transaction %u", xid);
 
-	/* open the spool file for the committed transaction */
-	changes_filename(path, MyLogicalRepWorker->subid, xid);
-
-	elog(DEBUG1, "replaying changes from file '%s'", path);
-
-	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (fd < 0)
-	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
-	}
-
 	ensure_transaction();
-
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	buffer = palloc(8192);
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
 	initStringInfo(&s2);
 
 	MemoryContextSwitchTo(oldcxt);
@@ -881,9 +913,7 @@ apply_handle_stream_commit(StringInfo s)
 		int			len;
 
 		/* read length of the on-disk record */
-		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
-		nbytes = read(fd, &len, sizeof(len));
-		pgstat_report_wait_end();
+		nbytes = BufFileRead(fd, &len, sizeof(len));
 
 		/* have we reached end of the file? */
 		if (nbytes == 0)
@@ -891,16 +921,9 @@ apply_handle_stream_commit(StringInfo s)
 
 		/* do we have a correct length? */
 		if (nbytes != sizeof(len))
-		{
-			int			save_errno = errno;
-
-			CloseTransientFile(fd);
-			errno = save_errno;
 			ereport(ERROR,
 					(errcode_for_file_access(),
-					 errmsg("could not read file: %m")));
-			return;
-		}
+					 errmsg("could not read from streaming transaction's changes file: %m")));
 
 		Assert(len > 0);
 
@@ -908,19 +931,10 @@ apply_handle_stream_commit(StringInfo s)
 		buffer = repalloc(buffer, len);
 
 		/* and finally read the data into the buffer */
-		pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_READ);
-		if (read(fd, buffer, len) != len)
-		{
-			int			save_errno = errno;
-
-			CloseTransientFile(fd);
-			errno = save_errno;
+		if (BufFileRead(fd, buffer, len) != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
-					 errmsg("could not read file: %m")));
-			return;
-		}
-		pgstat_report_wait_end();
+					 errmsg("could not read from streaming transaction's changes file: %m")));
 
 		/* copy the buffer to the stringinfo and call apply_dispatch */
 		resetStringInfo(&s2);
@@ -948,15 +962,11 @@ apply_handle_stream_commit(StringInfo s)
 		 */
 		send_feedback(InvalidXLogRecPtr, false, false);
 	}
-
-	if (CloseTransientFile(fd) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not close file \"%s\": %m", path)));
+	BufFileClose(fd);
 
 	/*
-	 * Update origin state so we can restart streaming from correct
-	 * position in case of crash.
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
 	 */
 	replorigin_session_origin_lsn = commit_data.end_lsn;
 	replorigin_session_origin_timestamp = commit_data.committime;
@@ -1946,12 +1956,39 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 static void
 worker_onexit(int code, Datum arg)
 {
-	int	i;
+	HASH_SEQ_STATUS status;
+	StreamXidHash *ent;
+	char		path[MAXPGPATH];
+
+	/* nothing to clean */
+	if (xidhash == NULL)
+		return;
+
+	/*
+	 * Scan complete hash and delete the underlying files for the xids.
+	 * Also release the memory for the shared file sets.
+	 */
+	hash_seq_init(&status, xidhash);
+	while ((ent = (StreamXidHash *) hash_seq_search(&status)) != NULL)
+	{
+		changes_filename(path, MyLogicalRepWorker->subid, ent->xid);
+		BufFileDeleteShared(ent->stream_fileset, path);
+		pfree(ent->stream_fileset);
 
-	elog(LOG, "cleanup files for %d transactions", nxids);
+		/*
+		 * We might not have created the subxact fileset if there is no sub
+		 * transaction.
+		 */
+		if (ent->subxact_fileset)
+		{
+			subxact_filename(path, MyLogicalRepWorker->subid, ent->xid);
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+		}
+	}
 
-	for (i = nxids-1; i >= 0; i--)
-		stream_cleanup_files(MyLogicalRepWorker->subid, xids[i], true);
+	/* Remove the xid hash */
+	hash_destroy(xidhash);
 }
 
 /*
@@ -1972,11 +2009,11 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 
 	/*
 	 * This memory context used for per stream data when streaming mode is
-	 * enabled.  This context is reeset on each stream stop.
+	 * enabled.  This context is reset on each stream stop.
 	 */
 	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
 													"LogicalStreamingContext",
-													 ALLOCSET_DEFAULT_SIZES);
+													ALLOCSET_DEFAULT_SIZES);
 
 	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
 	before_shmem_exit(worker_onexit, (Datum) 0);
@@ -2085,7 +2122,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -2441,64 +2478,62 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 static void
 subxact_info_write(Oid subid, TransactionId xid)
 {
-	int			fd;
 	char		path[MAXPGPATH];
+	bool		found;
 	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
 
 	Assert(TransactionIdIsValid(xid));
 
 	subxact_filename(path, subid, xid);
 
-	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
-	if (fd < 0)
-	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create file \"%s\": %m",
-						path)));
-		return;
-	}
-
-	len = sizeof(SubXactInfo) * nsubxacts;
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
 
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
-
-	if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
 	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not write to file \"%s\": %m",
-						path)));
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
 		return;
 	}
 
-	if ((len > 0) && (write(fd, subxacts, len) != len))
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
 	{
-		int			save_errno = errno;
+		ent->subxact_fileset =
+			MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
 
-		CloseTransientFile(fd);
-		errno = save_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not write to file \"%s\": %m",
-						path)));
-		return;
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
 	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
 
-	pgstat_report_wait_end();
+	len = sizeof(SubXactInfo) * nsubxacts;
 
-	/*
-	 * We don't need to fsync or anything, as we'll recreate the files after a
-	 * crash from scratch. So just close the file.
-	 */
-	if (CloseTransientFile(fd) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not close file \"%s\": %m", path)));
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
 
 	/*
 	 * But we free the memory allocated for subxact info. There might be one
@@ -2513,50 +2548,45 @@ subxact_info_write(Oid subid, TransactionId xid)
  *	  Restore information about subxacts of a streamed transaction.
  *
  * Read information about subxacts into the global variables.
- *
- * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
  */
 static void
 subxact_info_read(Oid subid, TransactionId xid)
 {
-	int			fd;
 	char		path[MAXPGPATH];
+	bool		found;
 	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
 
 	Assert(TransactionIdIsValid(xid));
 	Assert(!subxacts);
 	Assert(nsubxacts == 0);
 	Assert(nsubxacts_max == 0);
 
-	subxact_filename(path, subid, xid);
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
 
-	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (fd < 0)
-	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
 		return;
-	}
 
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
+	subxact_filename(path, subid, xid);
 
-	/* read number of subxact items */
-	if (read(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
-	{
-		int			save_errno = errno;
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
 
-		CloseTransientFile(fd);
-		errno = save_errno;
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
 		ereport(ERROR,
 				(errcode_for_file_access(),
-				 errmsg("could not read file \"%s\": %m",
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
 						path)));
-		return;
-	}
-
-	pgstat_report_wait_end();
 
 	len = sizeof(SubXactInfo) * nsubxacts;
 
@@ -2564,35 +2594,23 @@ subxact_info_read(Oid subid, TransactionId xid)
 	nsubxacts_max = 1 << my_log2(nsubxacts);
 
 	/*
-	 * Let the caller decide which memory context it will be allocated.
-	 * Ideally, during stream start it will be allocated in the
-	 * LogicalStreamingContext which will be reset on stream stop, and
-	 * during the stream abort we need this memory only for short term so
-	 * it will be allocated in ApplyMessageContext.
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
 	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
 	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
 
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_READ);
-
-	if ((len > 0) && ((read(fd, subxacts, len)) != len))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-		errno = save_errno;
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
 		ereport(ERROR,
 				(errcode_for_file_access(),
-				 errmsg("could not read file \"%s\": %m",
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
 						path)));
-		return;
-	}
-
-	pgstat_report_wait_end();
 
-	if (CloseTransientFile(fd) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not close file \"%s\": %m", path)));
+	BufFileClose(fd);
 }
 
 /*
@@ -2606,7 +2624,7 @@ subxact_info_add(TransactionId xid)
 
 	/* We must have a valid top level stream xid and a stream fd. */
 	Assert(TransactionIdIsValid(stream_xid));
-	Assert(stream_fd >= 0);
+	Assert(stream_fd != NULL);
 
 	/*
 	 * If the XID matches the toplevel transaction, we don't want to add it.
@@ -2658,7 +2676,13 @@ subxact_info_add(TransactionId xid)
 	}
 
 	subxacts[nsubxacts].xid = xid;
-	subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
 
 	nsubxacts++;
 }
@@ -2667,44 +2691,14 @@ subxact_info_add(TransactionId xid)
 static void
 subxact_filename(char *path, Oid subid, TransactionId xid)
 {
-	char		tempdirpath[MAXPGPATH];
-
-	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
-
-	/*
-	 * We might need to create the tablespace's tempfile directory, if no
-	 * one has yet done so.
-	 */
-	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create directory \"%s\": %m",
-						tempdirpath)));
-
-	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.subxacts",
-			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
 }
 
 /* format filename for file containing serialized changes */
-static void
+static inline void
 changes_filename(char *path, Oid subid, TransactionId xid)
 {
-	char		tempdirpath[MAXPGPATH];
-
-	TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
-
-	/*
-	 * We might need to create the tablespace's tempfile directory, if no
-	 * one has yet done so.
-	 */
-	if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create directory \"%s\": %m",
-						tempdirpath)));
-
-	snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.changes",
-			 tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
 }
 
 /*
@@ -2721,60 +2715,29 @@ changes_filename(char *path, Oid subid, TransactionId xid)
 static void
 stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
 {
-	int			i;
 	char		path[MAXPGPATH];
-	bool		found = false;
+	StreamXidHash *ent;
 
-	subxact_filename(path, subid, xid);
-
-	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not remove file \"%s\": %m", path)));
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
 
+	/* Delete the change file and release the stream fileset memory */
 	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
 
-	if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not remove file \"%s\": %m", path)));
-
-	/*
-	 * Cleanup the XID from the array - find the XID in the array and
-	 * remove it by shifting all the remaining elements. The array is
-	 * bound to be fairly small (maximum number of in-progress xacts,
-	 * so max_connections + max_prepared_transactions) so simply loop
-	 * through the array and find index of the XID. Then move the rest
-	 * of the array by one element to the left.
-	 *
-	 * Notice we also call this from stream_open_file for first segment
-	 * of each transaction, to deal with possible left-overs after a
-	 * crash, so it's entirely possible not to find the XID in the
-	 * array here. In that case we don't remove anything.
-	 *
-	 * XXX Perhaps it'd be better to handle this automatically after a
-	 * restart, instead of doing it over and over for each transaction.
-	 */
-	for (i = 0; i < nxids; i++)
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
 	{
-		if (xids[i] == xid)
-		{
-			found = true;
-			break;
-		}
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
 	}
-
-	if (!found)
-		return;
-
-	/*
-	 * Move the last entry from the array to the place. We don't keep
-	 * the streamed transactions sorted or anything - we only expect
-	 * a few of them in progress (max_connections + max_prepared_xacts)
-	 * so linear search is just fine.
-	 */
-	xids[i] = xids[nxids-1];
-	nxids--;
 }
 
 /*
@@ -2783,8 +2746,8 @@ stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
  *
  * Open a file for streamed changes from a toplevel transaction identified
  * by stream_xid (global variable). If it's the first chunk of streamed
- * changes for this transaction, perform cleanup by removing existing
- * files after a possible previous crash.
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
  *
  * This can only be called at the beginning of a "streaming" block, i.e.
  * between stream_start/stream_stop messages from the upstream.
@@ -2793,79 +2756,61 @@ static void
 stream_open_file(Oid subid, TransactionId xid, bool first_segment)
 {
 	char		path[MAXPGPATH];
-	int			flags;
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
 
 	Assert(in_streamed_transaction);
 	Assert(OidIsValid(subid));
 	Assert(TransactionIdIsValid(xid));
-	Assert(stream_fd == -1);
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
 
 	/*
-	 * If this is the first segment for this transaction, try removing
-	 * existing files (if there are any, possibly after a crash).
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
 	 */
 	if (first_segment)
 	{
-		MemoryContext	oldcxt;
-
-		/* XXX make sure there are no previous files for this transaction */
-		stream_cleanup_files(subid, xid, true);
-
-		/* Need to allocate this in permanent context */
-		oldcxt = MemoryContextSwitchTo(ApplyContext);
-
 		/*
-		 * We need to remember the XIDs we spilled to files, so that we can
-		 * remove them at worker exit (e.g. after DROP SUBSCRIPTION).
-		 *
-		 * The number of XIDs we may need to track is fairly small, because
-		 * we can only stream toplevel xacts (so limited by max_connections
-		 * and max_prepared_transactions), and we only stream the large ones.
-		 * So we simply keep the XIDs in an unsorted array. If the number of
-		 * xacts gets large for some reason (e.g. very high max_connections),
-		 * a more elaborate approach might be better - e.g. sorted array, to
-		 * speed-up the lookups.
+		 * Shared fileset handle must be allocated in the persistent context.
 		 */
-		if (nxids == maxnxids)	/* array of XIDs is full */
-		{
-			if (!xids)
-			{
-				maxnxids = 64;
-				xids = palloc(maxnxids * sizeof(TransactionId));
-			}
-			else
-			{
-				maxnxids = 2 * maxnxids;
-				xids = repalloc(xids, maxnxids * sizeof(TransactionId));
-			}
-		}
+		SharedFileSet *fileset =
+		MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
 
-		xids[nxids++] = xid;
+		SharedFileSetInit(fileset, NULL);
+		stream_fd = BufFileCreateShared(fileset, path);
 
-		MemoryContextSwitchTo(oldcxt);
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
 	}
-
-	changes_filename(path, subid, xid);
-
-	elog(DEBUG1, "opening file '%s' for streamed changes", path);
-
-	/*
-	 * If this is the first streamed segment, the file must not exist, so
-	 * make sure we're the ones creating it. Otherwise just open the file
-	 * for writing, in append mode.
-	 */
-	if (first_segment)
-		flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
 	else
-		flags = (O_WRONLY | O_APPEND | PG_BINARY);
-
-	stream_fd = OpenTransientFile(path, flags);
-
-	if (stream_fd < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m",
-						path)));
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+	MemoryContextSwitchTo(oldcxt);
 }
 
 /*
@@ -2880,12 +2825,12 @@ stream_close_file(void)
 {
 	Assert(in_streamed_transaction);
 	Assert(TransactionIdIsValid(stream_xid));
-	Assert(stream_fd != -1);
+	Assert(stream_fd != NULL);
 
-	CloseTransientFile(stream_fd);
+	BufFileClose(stream_fd);
 
 	stream_xid = InvalidTransactionId;
-	stream_fd = -1;
+	stream_fd = NULL;
 }
 
 /*
@@ -2907,34 +2852,21 @@ stream_write_change(char action, StringInfo s)
 
 	Assert(in_streamed_transaction);
 	Assert(TransactionIdIsValid(stream_xid));
-	Assert(stream_fd != -1);
+	Assert(stream_fd != NULL);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
-	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_CHANGES_WRITE);
-
 	/* first write the size */
-	if (write(stream_fd, &len, sizeof(len)) != sizeof(len))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not serialize streamed change to file: %m")));
+	BufFileWrite(stream_fd, &len, sizeof(len));
 
 	/* then the action */
-	if (write(stream_fd, &action, sizeof(action)) != sizeof(action))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not serialize streamed change to file: %m")));
+	BufFileWrite(stream_fd, &action, sizeof(action));
 
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
-	if (write(stream_fd, &s->data[s->cursor], len) != len)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not serialize streamed change to file: %m")));
-
-	pgstat_report_wait_end();
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
 }
 
 /*
-- 
1.8.3.1

v31-0014-POC-On_procexit_cleanup.patch000664 000765 000024 00000016144 13700013167 020774 0ustar00amitkapilastaff000000 000000 From 73e5f623904d40f74793e7174f8d82880ca355d3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Fri, 26 Jun 2020 11:30:13 +0530
Subject: [PATCH v31 14/14] POC On_procexit_cleanup

---
 src/backend/replication/logical/worker.c | 70 ++++++++------------------------
 src/backend/storage/file/buffile.c       |  3 ++
 src/backend/storage/file/sharedfileset.c | 62 ++++++++++++++++++++++++++++
 src/include/storage/sharedfileset.h      |  1 +
 4 files changed, 84 insertions(+), 52 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a543ee9..a6d52a8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1949,49 +1949,6 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 }
 
 /*
- * Cleanup function.
- *
- * Called on logical replication worker exit.
- */
-static void
-worker_onexit(int code, Datum arg)
-{
-	HASH_SEQ_STATUS status;
-	StreamXidHash *ent;
-	char		path[MAXPGPATH];
-
-	/* nothing to clean */
-	if (xidhash == NULL)
-		return;
-
-	/*
-	 * Scan complete hash and delete the underlying files for the xids.
-	 * Also release the memory for the shared file sets.
-	 */
-	hash_seq_init(&status, xidhash);
-	while ((ent = (StreamXidHash *) hash_seq_search(&status)) != NULL)
-	{
-		changes_filename(path, MyLogicalRepWorker->subid, ent->xid);
-		BufFileDeleteShared(ent->stream_fileset, path);
-		pfree(ent->stream_fileset);
-
-		/*
-		 * We might not have created the subxact fileset if there is no sub
-		 * transaction.
-		 */
-		if (ent->subxact_fileset)
-		{
-			subxact_filename(path, MyLogicalRepWorker->subid, ent->xid);
-			BufFileDeleteShared(ent->subxact_fileset, path);
-			pfree(ent->subxact_fileset);
-		}
-	}
-
-	/* Remove the xid hash */
-	hash_destroy(xidhash);
-}
-
-/*
  * Apply main loop.
  */
 static void
@@ -2015,9 +1972,6 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 													"LogicalStreamingContext",
 													ALLOCSET_DEFAULT_SIZES);
 
-	/* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
-	before_shmem_exit(worker_onexit, (Datum) 0);
-
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -2518,10 +2472,17 @@ subxact_info_write(Oid subid, TransactionId xid)
 	 */
 	if (ent->subxact_fileset == NULL)
 	{
-		ent->subxact_fileset =
-			MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
+		MemoryContext oldctx;
 
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, we allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
 		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
 		fd = BufFileCreateShared(ent->subxact_fileset, path);
 	}
 	else
@@ -2787,13 +2748,18 @@ stream_open_file(Oid subid, TransactionId xid, bool first_segment)
 	 */
 	if (first_segment)
 	{
+		MemoryContext oldctx;
+		SharedFileSet *fileset;
+
 		/*
-		 * Shared fileset handle must be allocated in the persistent context.
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, we allocate it in a persistent context.
 		 */
-		SharedFileSet *fileset =
-		MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
-
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
 		SharedFileSetInit(fileset, NULL);
+		oldctx = MemoryContextSwitchTo(oldctx);
+
 		stream_fd = BufFileCreateShared(fileset, path);
 
 		/* Remember the fileset for the next stream of the same transaction */
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index bde6fa1..502875a 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -364,6 +364,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 		CHECK_FOR_INTERRUPTS();
 	}
 
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
+
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
 }
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 0907f79..c9ccb84 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -25,10 +25,14 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
@@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	/* Register our cleanup callback. */
 	if (seg)
 		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		if (filesetlist == NIL)
+			on_proc_exit(SharedFileSetOnProcExit, 0);
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -214,6 +225,57 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 }
 
 /*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list if all the sharedfileset registered and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry, registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	/* Loop over all the shared fileset entries to find the input fileset */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+	Assert(found);
+}
+
+/*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
  */
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index b2f4ba4..d5edb60 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -42,5 +42,6 @@ extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
1.8.3.1

#400Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#398)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jun 30, 2020 at 5:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 30, 2020 at 10:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 30, 2020 at 9:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Can't we name the last parameter as 'commit_lsn' as that is how
documentation in the patch spells it and it sounds more appropriate?

You are right commit_lsn seems more appropriate here.

Also, is there a reason for assigning report_location and
write_location differently than what we do in commit_cb_wrapper?
Basically, assign those as txn->final_lsn and txn->end_lsn
respectively.

Yes, I think it should be handled in same way as commit_cb_wrapper.
Because before calling ReorderBufferStreamCommit in
ReorderBufferCommit, we are properly updating the final_lsn as well as
the end_lsn.

Okay, I have made these changes in the attached patch and there are
few more changes in
0003-Extend-the-output-plugin-API-with-stream-methods.
1. In pg_decode_stream_message, for transactional messages, we were
displaying message contents which is different from other streaming
APIs. I have changed it so that streaming API doesn't display message
contents for transactional messages.

Ok, make sense.

2.
+ /* in streaming mode, stream_change_cb is required */
+ if (ctx->callbacks.stream_change_cb == NULL)
+ ereport(ERROR,
+ (errmsg("Output plugin supports streaming, but has not registered "
+ "stream_change_cb callback.")));

The error messages seem a bit weird. (a) doesn't include error code,
(b) not in PG style. I have changed all the error messages to fix
these two issues and change the message as well

ok

3. Rearranged the functions stream_* so that the optional functions
are at the end and also arranged other functions in a way that looks
more logical to me.

Make sense to me.

4. Updated comments, commit message, and edited docs in the patch.

I have made a few changes in
0004-Gracefully-handle-concurrent-aborts-of-transacti as well.
1. The variable bsysscan was not being reset in case of error. I have
introduced a new function to reset both bsysscan and CheckXidAlive
during transaction abort. Also, snapmgr.c doesn't seem right place
for these variables, so I moved them to xact.c. I think this will
make the initialization of CheckXidAlive during catch in
ReorderBufferProcessTXN redundant.

That looks better.

2. Updated comments and commit message.

Let me know what you think about the above changes.

All the above changes look good to me and I will include in the next version.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#401Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#399)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 30, 2020 at 5:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Let me know what you think about the above changes.

I went ahead and made few changes in
0005-Implement-streaming-mode-in-ReorderBuffer which are explained
below. I have few questions and suggestions for the patch as well
which are also covered in below points.

1.
+ if (prev_lsn == InvalidXLogRecPtr)
+ {
+ if (streaming)
+ rb->stream_start(rb, txn, change->lsn);
+ else
+ rb->begin(rb, txn);
+ stream_started = true;
+ }

I don't think we want to move begin callback here that will change the
existing semantics, so it is better to move begin at its original
position. I have made the required changes in the attached patch.

Looks good to me.

2.
ReorderBufferTruncateTXN()
{
..
+ dlist_foreach_modify(iter, &txn->changes)
+ {
+ ReorderBufferChange *change;
+
+ change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+ /* remove the change from it's containing list */
+ dlist_delete(&change->node);
+
+ ReorderBufferReturnChange(rb, change);
+ }
..
}

I think here we can add an Assert that we're not mixing changes from
different transactions. See the changes in the patch.

Looks fine.

3.
SetupCheckXidLive()
{
..
+ /*
+ * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+ * aborted. That will happen during catalog access.  Also, reset the
+ * bsysscan flag.
+ */
+ if (!TransactionIdDidCommit(xid))
+ {
+ CheckXidAlive = xid;
+ bsysscan = false;
..
}

What is the need to reset bsysscan flag here if we are already
resetting on error (like in the previous patch sent by me)?

Yeah, now we don't not need this.

4.
ReorderBufferProcessTXN()
{
..
..
+ /* Reset the CheckXidAlive */
+ if (streaming)
+ CheckXidAlive = InvalidTransactionId;
..
}

Similar to the previous point, we don't need this as well because
AbortCurrentTransaction would have taken care of this.

Right

5.
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)

The above comment doesn't make much sense to me, so I have removed it.
Basically, if there are no changes before commit, we still need to
send commit and anyway if there are no more changes
ReorderBufferProcessTXN will not do anything.

ok

6.
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
if (txn->snapshot_now == NULL)
+ {
+ dlist_iter subxact_i;
+
+ /* make sure this transaction is streamed for the first time */
+ Assert(!rbtxn_is_streamed(txn));
+
+ /* at the beginning we should have invalid command ID */
+ Assert(txn->command_id == InvalidCommandId);
+
+ dlist_foreach(subxact_i, &txn->subtxns)
+ {
+ ReorderBufferTXN *subtxn;
+
+ subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+ ReorderBufferTransferSnapToParent(txn, subtxn);
+ }
..
}

Here, it is possible that there is no base_snapshot for txn, so we
need a check for that similar to ReorderBufferCommit.

7. Apart from the above, I made few changes in comments and ran pgindent.

Ok

8. We can't stream the transaction before we reach the
SNAPBUILD_CONSISTENT state because some other output plugin can apply
those changes unlike what we do with pgoutput plugin (which writes to
file). And, I think applying the transactions without reaching a
consistent state would be anyway wrong. So, we should avoid that and
if do that then we should have an Assert for streamed txns rather than
sending abort for them in ReorderBufferForget.

I will work on this point.

9.
+ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
{
..
+ ReorderBufferToastReset(rb, txn);
+ if (specinsert != NULL)
+ ReorderBufferReturnChange(rb, specinsert);
..
}

Why do we need to do these here when we wouldn't have been done for
any exception other than ERRCODE_TRANSACTION_ROLLBACK?

Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
gracefully and we are continuing with further decoding so we need to
return this change back.

10. I have got the below failure once. I have not investigated this
in detail as the patch is still under progress. See, if you have any
idea?
# Failed test 'check extra columns contain local defaults'
# at t/013_stream_subxact_ddl_abort.pl line 81.
# got: '2|0'
# expected: '1000|500'
# Looks like you failed 1 test of 2.
make[2]: *** [check] Error 1
make[1]: *** [check-subscription-recurse] Error 2
make[1]: *** Waiting for unfinished jobs....
make: *** [check-world-src/test-recurse] Error 2

Even I got the failure once and after that, it did not reproduce. I
have executed it multiple time but it did not reproduce again. Are
you able to reproduce it consistently?

11. Can we test by introducing a new GUC such that all the
transactions (at least in existing tests) start to stream? Basically,
it will allow us to disregard logical_decoding_work_mem and ensure
that all regression tests pass through new-code. Note, I am
suggesting this just for testing purposes, not for actual integration
in the code.

Yeah, that's a good suggestion.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#402Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#397)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jun 30, 2020 at 10:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 30, 2020 at 9:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 29, 2020 at 4:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jun 22, 2020 at 4:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Other than above tests, can we somehow verify that the invalidations
generated at commit time are the same as what we do with this patch?
We have verified with individual commands but it would be great if we
can verify for the regression tests.

I have verified this using a few random test cases. For verifying
this I have made some temporary code changes with an assert as shown
below. Basically, on DecodeCommit we call
ReorderBufferAddInvalidations function only for an assert checking.

-void
ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
XLogRecPtr
lsn, Size nmsgs,
-
SharedInvalidationMessage *msgs)
+
SharedInvalidationMessage *msgs, bool commit)
{
ReorderBufferTXN *txn;

txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
-
+       if (commit)
+       {
+               Assert(txn->ninvalidations == nmsgs);
+               return;
+       }

The result is that for a normal local test it works fine. But with
regression suit, it hit an assert at many places because if the
rollback of the subtransaction is involved then at commit time
invalidation messages those are not logged whereas with command time
invalidation those are logged.

Yeah, somehow, we need to ignore rollback to savepoint tests and
verify for others.

Yeah, I have run the regression suite, I can see a lot of failure
maybe we can somehow see the diff and confirm that all the failures
are due to rollback to savepoint only. I will work on this.

I have compared the changes logged at command end vs logged at commit
time. I have ignored the invalidation for the transaction which has
any aborted subtransaction in it. While testing this I found one
issue, the issue is that if there are some invalidation generated
between last command counter increment and the commit transaction then
those were not logged. I have fixed the issue by logging the pending
invalidation in RecordTransactionCommit. I will include the changes
in the next patch set.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v32-0002-WAL-Log-invalidations-at-command-end-with-wal_le.patchapplication/octet-stream; name=v32-0002-WAL-Log-invalidations-at-command-end-with-wal_le.patchDownload
From 7e9432410d8c30b78f621e0ad0d8d32b14ecd7bf Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 6 Jun 2020 09:54:21 +0530
Subject: [PATCH v32 02/14] WAL Log invalidations at command end with
 wal_level=logical.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When wal_level=logical, write invalidations at command end into WAL so
that decoding can use this information.

This patch is required to allow the streaming of in-progress transactions
in logical decoding.  The actual work to allow streaming will be committed
as a separate patch.

We still add the invalidations to the cache and write them to WAL at
commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not need
to be changed.

The invalidations written at command end uses a new xlog record type
XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See
LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures, which
still rely on the invalidations written to commit records.

The invalidations are decoded and accumulated in top-transaction, and then
executed during replay.  This obviates the need to decode the
invalidations as part of a commit record.

LogStandbyInvalidations was accumulating all the invalidations in memory,
and then only wrote them once at commit time, which may reduce the
performance impact by amortizing the overhead and deduplicating the
invalidations.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/rmgrdesc/xactdesc.c          | 10 +++++
 src/backend/access/transam/xact.c               | 14 ++++++
 src/backend/replication/logical/decode.c        | 58 +++++++++++++++----------
 src/backend/replication/logical/reorderbuffer.c | 52 ++++++++++++++++++----
 src/backend/utils/cache/inval.c                 | 54 +++++++++++++++++++++++
 src/include/access/xact.h                       | 13 +++++-
 src/include/replication/reorderbuffer.h         |  3 ++
 src/include/utils/inval.h                       |  2 +
 8 files changed, 173 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce755..68aa994 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -396,6 +396,13 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		standby_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs, InvalidOid,
+								   InvalidOid, false);
+	}
 }
 
 const char *
@@ -423,6 +430,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a93fb8a..ac1dc22 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1224,6 +1224,13 @@ RecordTransactionCommit(void)
 	bool		RelcacheInitFileInval = false;
 	bool		wrote_xlog;
 
+	/*
+	 * Log any pending invalidations which are adding between the last
+	 * command counter increment and the commit.
+	 */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	/* Get data needed for commit record */
 	nrels = smgrGetPendingDeletes(true, &rels);
 	nchildren = xactGetCommittedChildren(&children);
@@ -6022,6 +6029,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0c0c371..7153eba 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -278,10 +278,39 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 			/*
 			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here.
-			 * See LogicalDecodingProcessRecord.
+			 * record if required.  So, we don't need to do anything here. See
+			 * LogicalDecodingProcessRecord.
 			 */
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/*
+				 * Execute the invalidations for xid-less transactions,
+				 * otherwise, accumulate them so that they can be processed at
+				 * the commit time.
+				 */
+				if (!ctx->fast_forward)
+				{
+					if (TransactionIdIsValid(xid))
+					{
+						ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+													  invals->nmsgs, invals->msgs);
+						ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+														  buf->origptr);
+					}
+					else
+						ReorderBufferImmediateInvalidation(ctx->reorder,
+														   invals->nmsgs,
+														   invals->msgs);
+				}
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
@@ -334,15 +363,11 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_STANDBY_LOCK:
 			break;
 		case XLOG_INVALIDATIONS:
-			{
-				xl_invalidations *invalidations =
-				(xl_invalidations *) XLogRecGetData(r);
 
-				if (!ctx->fast_forward)
-					ReorderBufferImmediateInvalidation(ctx->reorder,
-													   invalidations->nmsgs,
-													   invalidations->msgs);
-			}
+			/*
+			 * We are processing the invalidations at the command level via
+			 * XLOG_XACT_INVALIDATIONS.  So we don't need to do anything here.
+			 */
 			break;
 		default:
 			elog(ERROR, "unexpected RM_STANDBY_ID record type: %u", info);
@@ -573,19 +598,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
-	/*
-	 * Process invalidation messages, even if we're not interested in the
-	 * transaction's contents, since the various caches need to always be
-	 * consistent.
-	 */
-	if (parsed->nmsgs > 0)
-	{
-		if (!ctx->fast_forward)
-			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
-										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
-	}
-
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 642a1c7..4b277fe 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -860,6 +860,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to top-level transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -2205,7 +2208,11 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 /*
  * Setup the invalidation of the toplevel transaction.
  *
- * This needs to be done before ReorderBufferCommit is called!
+ * This needs to be called for each XLOG_XACT_INVALIDATIONS message and
+ * accumulates all the invalidation messages in the toplevel transaction.
+ * This is required because in some cases where we skip processing the
+ * transaction (see ReorderBufferForget), we need to execute all the
+ * invalidations together.
  */
 void
 ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
@@ -2216,17 +2223,35 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	if (txn->ninvalidations != 0)
-		elog(ERROR, "only ever add one set of invalidations");
+	/*
+	 * We collect all the invalidations under the top transaction so that we
+	 * can execute them all together.
+	 */
+	if (txn->toptxn)
+		txn = txn->toptxn;
 
 	Assert(nmsgs > 0);
 
-	txn->ninvalidations = nmsgs;
-	txn->invalidations = (SharedInvalidationMessage *)
-		MemoryContextAlloc(rb->context,
-						   sizeof(SharedInvalidationMessage) * nmsgs);
-	memcpy(txn->invalidations, msgs,
-		   sizeof(SharedInvalidationMessage) * nmsgs);
+	/* Accumulate invalidations. */
+	if (txn->ninvalidations == 0)
+	{
+		txn->ninvalidations = nmsgs;
+		txn->invalidations = (SharedInvalidationMessage *)
+			MemoryContextAlloc(rb->context,
+							   sizeof(SharedInvalidationMessage) * nmsgs);
+		memcpy(txn->invalidations, msgs,
+			   sizeof(SharedInvalidationMessage) * nmsgs);
+	}
+	else
+	{
+		txn->invalidations = (SharedInvalidationMessage *)
+			repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
+					 (txn->ninvalidations + nmsgs));
+
+		memcpy(txn->invalidations + txn->ninvalidations, msgs,
+			   nmsgs * sizeof(SharedInvalidationMessage));
+		txn->ninvalidations += nmsgs;
+	}
 }
 
 /*
@@ -2254,6 +2279,15 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * Mark top-level transaction as having catalog changes too if one of its
+	 * children has so that the ReorderBufferBuildTupleCidHash can
+	 * conveniently check just top-level transaction and decide whether to
+	 * build the hash table or not.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..e3fa723 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,9 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transactions.  See
+ *      CommandEndInvalidationMessages.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +107,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -1094,6 +1098,11 @@ CommandEndInvalidationMessages(void)
 
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
+
+	/* WAL Log per-command invalidation messages for wal_level=logical */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
 							   &transInvalInfo->CurrentCmdInvalidMsgs);
 }
@@ -1501,3 +1510,48 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * LogLogicalInvalidations
+ *
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end.
+ */
+void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage *invalMessages;
+	int			nmsgs = 0;
+
+	/* Quick exit if we haven't done anything with invalidation messages. */
+	if (transInvalInfo == NULL)
+		return;
+
+	ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+									 MakeSharedInvalidMessagesArray);
+
+	Assert(!(numSharedInvalidMessagesArray > 0 &&
+			 SharedInvalidMessagesArray == NULL));
+
+	invalMessages = SharedInvalidMessagesArray;
+	nmsgs = numSharedInvalidMessagesArray;
+	SharedInvalidMessagesArray = NULL;
+	numSharedInvalidMessagesArray = 0;
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						 nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index aef8555..ac3f5e3 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,17 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4..74ffe78 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -220,6 +220,9 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	end_lsn;
 
+	/* Toplevel transaction for this subxact (NULL for top-level). */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN of the last lsn at which snapshot information reside, so we can
 	 * restart decoding from there and fully recover this transaction from
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index bc5081c..463888c 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -61,4 +61,6 @@ extern void CacheRegisterRelcacheCallback(RelcacheCallbackFunction func,
 extern void CallSyscacheCallbacks(int cacheid, uint32 hashvalue);
 
 extern void InvalidateSystemCaches(void);
+
+extern void LogLogicalInvalidations(void);
 #endif							/* INVAL_H */
-- 
1.8.3.1

#403Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#401)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

9.
+ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
{
..
+ ReorderBufferToastReset(rb, txn);
+ if (specinsert != NULL)
+ ReorderBufferReturnChange(rb, specinsert);
..
}

Why do we need to do these here when we wouldn't have been done for
any exception other than ERRCODE_TRANSACTION_ROLLBACK?

Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
gracefully and we are continuing with further decoding so we need to
return this change back.

Okay, then I suggest we should do these before calling stream_stop and
also move ReorderBufferResetTXN after calling stream_stop to follow a
pattern similar to try block unless there is a reason for not doing
so. Also, it would be good if we can initialize specinsert with NULL
after returning the change as we are doing at other places.

10. I have got the below failure once. I have not investigated this
in detail as the patch is still under progress. See, if you have any
idea?
# Failed test 'check extra columns contain local defaults'
# at t/013_stream_subxact_ddl_abort.pl line 81.
# got: '2|0'
# expected: '1000|500'
# Looks like you failed 1 test of 2.
make[2]: *** [check] Error 1
make[1]: *** [check-subscription-recurse] Error 2
make[1]: *** Waiting for unfinished jobs....
make: *** [check-world-src/test-recurse] Error 2

Even I got the failure once and after that, it did not reproduce. I
have executed it multiple time but it did not reproduce again. Are
you able to reproduce it consistently?

No, I am also not able to reproduce it consistently but I think this
can fail if a subscriber sends the replay_location before actually
replaying the changes. First, I thought that extra send_feedback we
have in apply_handle_stream_commit might have caused this but I guess
that can't happen because we need the commit time location for that
and we are storing the same at the end of apply_handle_stream_commit
after applying all messages. I am not sure what is going on here. I
think we somehow need to reproduce this or some variant of this test
consistently to find the root cause.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#404Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#403)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

9.
+ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
{
..
+ ReorderBufferToastReset(rb, txn);
+ if (specinsert != NULL)
+ ReorderBufferReturnChange(rb, specinsert);
..
}

Why do we need to do these here when we wouldn't have been done for
any exception other than ERRCODE_TRANSACTION_ROLLBACK?

Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
gracefully and we are continuing with further decoding so we need to
return this change back.

Okay, then I suggest we should do these before calling stream_stop and
also move ReorderBufferResetTXN after calling stream_stop to follow a
pattern similar to try block unless there is a reason for not doing
so. Also, it would be good if we can initialize specinsert with NULL
after returning the change as we are doing at other places.

Okay

10. I have got the below failure once. I have not investigated this
in detail as the patch is still under progress. See, if you have any
idea?
# Failed test 'check extra columns contain local defaults'
# at t/013_stream_subxact_ddl_abort.pl line 81.
# got: '2|0'
# expected: '1000|500'
# Looks like you failed 1 test of 2.
make[2]: *** [check] Error 1
make[1]: *** [check-subscription-recurse] Error 2
make[1]: *** Waiting for unfinished jobs....
make: *** [check-world-src/test-recurse] Error 2

Even I got the failure once and after that, it did not reproduce. I
have executed it multiple time but it did not reproduce again. Are
you able to reproduce it consistently?

No, I am also not able to reproduce it consistently but I think this
can fail if a subscriber sends the replay_location before actually
replaying the changes. First, I thought that extra send_feedback we
have in apply_handle_stream_commit might have caused this but I guess
that can't happen because we need the commit time location for that
and we are storing the same at the end of apply_handle_stream_commit
after applying all messages. I am not sure what is going on here. I
think we somehow need to reproduce this or some variant of this test
consistently to find the root cause.

And I think it appeared first time for me, so maybe either induced
from past few versions so some changes in the last few versions might
have exposed it. I have noticed that almost 50% of the time I am able
to reproduce after the clean build so I can trace back from which
version it started appearing that way it will be easy to narrow down.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#405Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#404)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jul 6, 2020 at 11:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

10. I have got the below failure once. I have not investigated this
in detail as the patch is still under progress. See, if you have any
idea?
# Failed test 'check extra columns contain local defaults'
# at t/013_stream_subxact_ddl_abort.pl line 81.
# got: '2|0'
# expected: '1000|500'
# Looks like you failed 1 test of 2.
make[2]: *** [check] Error 1
make[1]: *** [check-subscription-recurse] Error 2
make[1]: *** Waiting for unfinished jobs....
make: *** [check-world-src/test-recurse] Error 2

Even I got the failure once and after that, it did not reproduce. I
have executed it multiple time but it did not reproduce again. Are
you able to reproduce it consistently?

No, I am also not able to reproduce it consistently but I think this
can fail if a subscriber sends the replay_location before actually
replaying the changes. First, I thought that extra send_feedback we
have in apply_handle_stream_commit might have caused this but I guess
that can't happen because we need the commit time location for that
and we are storing the same at the end of apply_handle_stream_commit
after applying all messages. I am not sure what is going on here. I
think we somehow need to reproduce this or some variant of this test
consistently to find the root cause.

And I think it appeared first time for me, so maybe either induced
from past few versions so some changes in the last few versions might
have exposed it. I have noticed that almost 50% of the time I am able
to reproduce after the clean build so I can trace back from which
version it started appearing that way it will be easy to narrow down.

One more comment
ReorderBufferLargestTopTXN
{
..
dlist_foreach(iter, &rb->toplevel_by_lsn)
{
ReorderBufferTXN *txn;
+ Size size = 0;
+ Size largest_size = 0;

txn = dlist_container(ReorderBufferTXN, node, iter.cur);

- /* if the current transaction is larger, remember it */
- if ((!largest) || (txn->size > largest->size))
+ /*
+ * If this transaction have some incomplete changes then only consider
+ * the size upto last complete lsn.
+ */
+ if (rbtxn_has_incomplete_tuple(txn))
+ size = txn->complete_size;
+ else
+ size = txn->total_size;
+
+ /* If the current transaction is larger then remember it. */
+ if ((largest != NULL || size > largest_size) && size > 0)

Here largest_size is a local variable inside the loop which is
initialized to 0 in each iteration and that will lead to picking each
next txn as largest. This seems wrong to me.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#406Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#405)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jul 6, 2020 at 3:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 6, 2020 at 11:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

10. I have got the below failure once. I have not investigated this
in detail as the patch is still under progress. See, if you have any
idea?
# Failed test 'check extra columns contain local defaults'
# at t/013_stream_subxact_ddl_abort.pl line 81.
# got: '2|0'
# expected: '1000|500'
# Looks like you failed 1 test of 2.
make[2]: *** [check] Error 1
make[1]: *** [check-subscription-recurse] Error 2
make[1]: *** Waiting for unfinished jobs....
make: *** [check-world-src/test-recurse] Error 2

Even I got the failure once and after that, it did not reproduce. I
have executed it multiple time but it did not reproduce again. Are
you able to reproduce it consistently?

No, I am also not able to reproduce it consistently but I think this
can fail if a subscriber sends the replay_location before actually
replaying the changes. First, I thought that extra send_feedback we
have in apply_handle_stream_commit might have caused this but I guess
that can't happen because we need the commit time location for that
and we are storing the same at the end of apply_handle_stream_commit
after applying all messages. I am not sure what is going on here. I
think we somehow need to reproduce this or some variant of this test
consistently to find the root cause.

And I think it appeared first time for me, so maybe either induced
from past few versions so some changes in the last few versions might
have exposed it. I have noticed that almost 50% of the time I am able
to reproduce after the clean build so I can trace back from which
version it started appearing that way it will be easy to narrow down.

One more comment
ReorderBufferLargestTopTXN
{
..
dlist_foreach(iter, &rb->toplevel_by_lsn)
{
ReorderBufferTXN *txn;
+ Size size = 0;
+ Size largest_size = 0;

txn = dlist_container(ReorderBufferTXN, node, iter.cur);

- /* if the current transaction is larger, remember it */
- if ((!largest) || (txn->size > largest->size))
+ /*
+ * If this transaction have some incomplete changes then only consider
+ * the size upto last complete lsn.
+ */
+ if (rbtxn_has_incomplete_tuple(txn))
+ size = txn->complete_size;
+ else
+ size = txn->total_size;
+
+ /* If the current transaction is larger then remember it. */
+ if ((largest != NULL || size > largest_size) && size > 0)

Here largest_size is a local variable inside the loop which is
initialized to 0 in each iteration and that will lead to picking each
next txn as largest. This seems wrong to me.

You are right, will fix.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#407Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#402)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sun, Jul 5, 2020 at 8:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 30, 2020 at 10:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Yeah, I have run the regression suite, I can see a lot of failure
maybe we can somehow see the diff and confirm that all the failures
are due to rollback to savepoint only. I will work on this.

I have compared the changes logged at command end vs logged at commit
time. I have ignored the invalidation for the transaction which has
any aborted subtransaction in it. While testing this I found one
issue, the issue is that if there are some invalidation generated
between last command counter increment and the commit transaction then
those were not logged. I have fixed the issue by logging the pending
invalidation in RecordTransactionCommit. I will include the changes
in the next patch set.

I think it would have been better if you could have given examples for
such cases where you need this extra logging. Anyway, below are few
minor comments on this patch:

1.
+ /*
+ * Log any pending invalidations which are adding between the last
+ * command counter increment and the commit.
+ */
+ if (XLogLogicalInfoActive())
+ LogLogicalInvalidations();

I think we can change this comment slightly and extend a bit to say
for which kind of special cases we are adding this. "Log any pending
invalidations which are added between the last CommandCounterIncrement
and the commit. Normally for DDLs, we log this at each command end,
however for certain cases where we directly update the system table
the invalidations were not logged at command end."

Something like above based on cases that are not covered by command
end WAL logging.

2.
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end.
+ */
+void
+LogLogicalInvalidations()

After this is getting used at a new place, it is better to modify the
above comment to something like: "Emit WAL for invalidations. This is
currently only used for logging invalidations at the command end or at
commit time if any invalidations are pending."

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#408Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#407)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jul 8, 2020 at 9:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Jul 5, 2020 at 8:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have compared the changes logged at command end vs logged at commit
time. I have ignored the invalidation for the transaction which has
any aborted subtransaction in it. While testing this I found one
issue, the issue is that if there are some invalidation generated
between last command counter increment and the commit transaction then
those were not logged. I have fixed the issue by logging the pending
invalidation in RecordTransactionCommit. I will include the changes
in the next patch set.

I think it would have been better if you could have given examples for
such cases where you need this extra logging. Anyway, below are few
minor comments on this patch:

1.
+ /*
+ * Log any pending invalidations which are adding between the last
+ * command counter increment and the commit.
+ */
+ if (XLogLogicalInfoActive())
+ LogLogicalInvalidations();

I think we can change this comment slightly and extend a bit to say
for which kind of special cases we are adding this. "Log any pending
invalidations which are added between the last CommandCounterIncrement
and the commit. Normally for DDLs, we log this at each command end,
however for certain cases where we directly update the system table
the invalidations were not logged at command end."

Something like above based on cases that are not covered by command
end WAL logging.

2.
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end.
+ */
+void
+LogLogicalInvalidations()

After this is getting used at a new place, it is better to modify the
above comment to something like: "Emit WAL for invalidations. This is
currently only used for logging invalidations at the command end or at
commit time if any invalidations are pending."

I have done some more review and below are my comments:

Review-v31-0010-Provide-new-api-to-get-the-streaming-changes
----------------------------------------------------------------------------------------------
1.
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1240,6 +1240,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int,
VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';

If we are going to add a new streaming API for get_changes, don't we
need for pg_logical_slot_get_binary_changes,
pg_logical_slot_peek_changes and pg_logical_slot_peek_binary_changes
as well? I was thinking why not add a new parameter (streaming
boolean) instead of adding the new APIs. This could be an optional
parameter which if user doesn't specify will be considered as false.
We already have optional parameters for APIs like
pg_create_logical_replication_slot.

2. You forgot to update sgml/func.sgml. This will be required even if
we decide to add a new parameter instead of a new API.

3.
+ /* If called has not asked for streaming changes then disable it. */
+ ctx->streaming &= streaming;

/If called/If the caller

4.
diff --git a/.gitignore b/.gitignore
index 794e35b..6083744 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/

Why the patch contains this change?

5. If I apply the first six patches and run the regressions, it fails
primarily because streaming got enabled by default. And then when I
applied this patch, the tests passed because it disables streaming by
default. I think this should be patch 0007.

Replication Origins
------------------------------
I think we also need to conclude on origins related discussion [1]/messages/by-id/CAA4eK1JwXaCezFw+kZwoxbLKYD0nWpC2rPgx7vUsaDAc0AZaow@mail.gmail.com.
As far as I can see, the origin_id can be sent with the first startup
message. The origin_lsn and origin_commit can be sent with the last
start of streaming commit if we want but not sure if that is of use.
If we need to send it earlier then we need to record it with other WAL
records. The point is that those are set with
pg_replication_origin_xact_setup but not sure how and when that
function is called. The other alternative is that we can ignore that
for now and once the usage is clear we can enhance it. What do you
think?

[1]: /messages/by-id/CAA4eK1JwXaCezFw+kZwoxbLKYD0nWpC2rPgx7vUsaDAc0AZaow@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#409Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#408)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

I was going through this thread and testing and reviewing the patches, I
think this is a great feature to have and one which customers would
appreciate. I wanted to help out, and I saw a request for a test patch for
a GUC to always enable streaming on logical replication. Here's one on top
of patchset v31, just in case you still need it. By default the GUC is
turned on, I ran the regression tests with it and didn't see any errors.

thanks,
Ajin
Fujitsu Australia

On Wed, Jul 8, 2020 at 8:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Show quoted text

On Wed, Jul 8, 2020 at 9:36 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Sun, Jul 5, 2020 at 8:37 PM Dilip Kumar <dilipbalaut@gmail.com>

wrote:

I have compared the changes logged at command end vs logged at commit
time. I have ignored the invalidation for the transaction which has
any aborted subtransaction in it. While testing this I found one
issue, the issue is that if there are some invalidation generated
between last command counter increment and the commit transaction then
those were not logged. I have fixed the issue by logging the pending
invalidation in RecordTransactionCommit. I will include the changes
in the next patch set.

I think it would have been better if you could have given examples for
such cases where you need this extra logging. Anyway, below are few
minor comments on this patch:

1.
+ /*
+ * Log any pending invalidations which are adding between the last
+ * command counter increment and the commit.
+ */
+ if (XLogLogicalInfoActive())
+ LogLogicalInvalidations();

I think we can change this comment slightly and extend a bit to say
for which kind of special cases we are adding this. "Log any pending
invalidations which are added between the last CommandCounterIncrement
and the commit. Normally for DDLs, we log this at each command end,
however for certain cases where we directly update the system table
the invalidations were not logged at command end."

Something like above based on cases that are not covered by command
end WAL logging.

2.
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end.
+ */
+void
+LogLogicalInvalidations()

After this is getting used at a new place, it is better to modify the
above comment to something like: "Emit WAL for invalidations. This is
currently only used for logging invalidations at the command end or at
commit time if any invalidations are pending."

I have done some more review and below are my comments:

Review-v31-0010-Provide-new-api-to-get-the-streaming-changes

----------------------------------------------------------------------------------------------
1.
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1240,6 +1240,14 @@ LANGUAGE INTERNAL
VOLATILE ROWS 1000 COST 1000
AS 'pg_logical_slot_get_changes';
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int,
VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';

If we are going to add a new streaming API for get_changes, don't we
need for pg_logical_slot_get_binary_changes,
pg_logical_slot_peek_changes and pg_logical_slot_peek_binary_changes
as well? I was thinking why not add a new parameter (streaming
boolean) instead of adding the new APIs. This could be an optional
parameter which if user doesn't specify will be considered as false.
We already have optional parameters for APIs like
pg_create_logical_replication_slot.

2. You forgot to update sgml/func.sgml. This will be required even if
we decide to add a new parameter instead of a new API.

3.
+ /* If called has not asked for streaming changes then disable it. */
+ ctx->streaming &= streaming;

/If called/If the caller

4.
diff --git a/.gitignore b/.gitignore
index 794e35b..6083744 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
/Debug/
/Release/
/tmp_install/
+/build/

Why the patch contains this change?

5. If I apply the first six patches and run the regressions, it fails
primarily because streaming got enabled by default. And then when I
applied this patch, the tests passed because it disables streaming by
default. I think this should be patch 0007.

Replication Origins
------------------------------
I think we also need to conclude on origins related discussion [1].
As far as I can see, the origin_id can be sent with the first startup
message. The origin_lsn and origin_commit can be sent with the last
start of streaming commit if we want but not sure if that is of use.
If we need to send it earlier then we need to record it with other WAL
records. The point is that those are set with
pg_replication_origin_xact_setup but not sure how and when that
function is called. The other alternative is that we can ignore that
for now and once the usage is clear we can enhance it. What do you
think?

[1] -
/messages/by-id/CAA4eK1JwXaCezFw+kZwoxbLKYD0nWpC2rPgx7vUsaDAc0AZaow@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v31-0015-TEST-guc-always-streaming-logical.patchapplication/octet-stream; name=v31-0015-TEST-guc-always-streaming-logical.patchDownload
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c7f1877..8c147bd 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -82,6 +82,8 @@ bool		XactDeferrable;
 
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
+bool		always_stream_logical = true;
+
 /*
  * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
  * transaction.  Currently, it is used in logical decoding.  It's possible
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2ceb192..843b0f5 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -3476,6 +3476,10 @@ ReorderBufferCanStream(ReorderBuffer *rb)
 {
 	LogicalDecodingContext *ctx = rb->private_data;
 
+	/* force streaming on logical replication if guc set */
+	if (always_stream_logical)
+		ctx->streaming = true;
+
 	return ctx->streaming;
 }
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3a802d8..8f5144d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2041,6 +2041,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"always_stream_logical", PGC_USERSET, REPLICATION_MASTER,
+			gettext_noop("Always stream during logical replication, do not spill to disk."),
+		},
+		&always_stream_logical,
+		true,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5f767eb..f99d9c7 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -65,6 +65,9 @@ extern bool xact_is_sampled;
 extern bool DefaultXactDeferrable;
 extern bool XactDeferrable;
 
+/* to turn on forced  streaming of logical replication */
+extern bool always_stream_logical;
+
 typedef enum
 {
 	SYNCHRONOUS_COMMIT_OFF,		/* asynchronous commit */
#410Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#409)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian <itsajin@gmail.com> wrote:

I was going through this thread and testing and reviewing the patches, I think this is a great feature to have and one which customers would appreciate. I wanted to help out, and I saw a request for a test patch for a GUC to always enable streaming on logical replication. Here's one on top of patchset v31, just in case you still need it. By default the GUC is turned on, I ran the regression tests with it and didn't see any errors.

Thanks for showing the interest in patch. How have you ensured that
streaming is happening? I don't think the proposed patch can ensure
it for every case because we also rely on logical_decoding_work_mem to
decide whether to stream/spill, see ReorderBufferCheckMemoryLimit. I
think with your patch it will allow streaming for cases where we have
large amount of WAL to decode.

I feel you need to add some DEBUG messages (or some other way) to
ensure that all existing and new test cases related to logical
decoding will perform the streaming.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#411Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#410)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jul 9, 2020 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian <itsajin@gmail.com> wrote:

Thanks for showing the interest in patch. How have you ensured that
streaming is happening? I don't think the proposed patch can ensure
it for every case because we also rely on logical_decoding_work_mem to
decide whether to stream/spill, see ReorderBufferCheckMemoryLimit. I
think with your patch it will allow streaming for cases where we have
large amount of WAL to decode.

Maybe I missed something but I looked at ReorderBufferCheckMemoryLimit,
even there it checks the same function ReorderBufferCanStream () and
decides whether to stream or spill. Did I miss something?

while (rb->size >= logical_decoding_work_mem * 1024L)
{
/*
* Pick the largest transaction (or subtransaction) and evict it from
* memory by streaming, if supported. Otherwise, spill to disk.
*/
if (ReorderBufferCanStream(rb) &&
(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
{
/* we know there has to be one, because the size is not zero */
Assert(txn && !txn->toptxn);
Assert(txn->total_size > 0);
Assert(rb->size >= txn->total_size);

ReorderBufferStreamTXN(rb, txn);
}
else
{

I will also add debug and test as you suggested.

regards,
Ajin Cherian
Fujitsu Australia

#412Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#411)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jul 9, 2020 at 8:18 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Thu, Jul 9, 2020 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian <itsajin@gmail.com> wrote:

Thanks for showing the interest in patch. How have you ensured that
streaming is happening? I don't think the proposed patch can ensure
it for every case because we also rely on logical_decoding_work_mem to
decide whether to stream/spill, see ReorderBufferCheckMemoryLimit. I
think with your patch it will allow streaming for cases where we have
large amount of WAL to decode.

Maybe I missed something but I looked at ReorderBufferCheckMemoryLimit, even there it checks the same function ReorderBufferCanStream () and decides whether to stream or spill. Did I miss something?

while (rb->size >= logical_decoding_work_mem * 1024L)
{

There is a check before above loop:

ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
{
ReorderBufferTXN *txn;

/* bail out if we haven't exceeded the memory limit */
if (rb->size < logical_decoding_work_mem * 1024L)
return;

This will prevent the streaming/spill to occur.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#413Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#412)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jul 9, 2020 at 8:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 9, 2020 at 8:18 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Thu, Jul 9, 2020 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian <itsajin@gmail.com> wrote:

Thanks for showing the interest in patch. How have you ensured that
streaming is happening? I don't think the proposed patch can ensure
it for every case because we also rely on logical_decoding_work_mem to
decide whether to stream/spill, see ReorderBufferCheckMemoryLimit. I
think with your patch it will allow streaming for cases where we have
large amount of WAL to decode.

Maybe I missed something but I looked at ReorderBufferCheckMemoryLimit, even there it checks the same function ReorderBufferCanStream () and decides whether to stream or spill. Did I miss something?

while (rb->size >= logical_decoding_work_mem * 1024L)
{

There is a check before above loop:

ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
{
ReorderBufferTXN *txn;

/* bail out if we haven't exceeded the memory limit */
if (rb->size < logical_decoding_work_mem * 1024L)
return;

This will prevent the streaming/spill to occur.

I think if the GUC is set then maybe we can bypass this check so that
it can try to stream every single change?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#414Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#413)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jul 9, 2020 at 8:55 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jul 9, 2020 at 8:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 9, 2020 at 8:18 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Thu, Jul 9, 2020 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian <itsajin@gmail.com> wrote:

Thanks for showing the interest in patch. How have you ensured that
streaming is happening? I don't think the proposed patch can ensure
it for every case because we also rely on logical_decoding_work_mem to
decide whether to stream/spill, see ReorderBufferCheckMemoryLimit. I
think with your patch it will allow streaming for cases where we have
large amount of WAL to decode.

Maybe I missed something but I looked at ReorderBufferCheckMemoryLimit, even there it checks the same function ReorderBufferCanStream () and decides whether to stream or spill. Did I miss something?

while (rb->size >= logical_decoding_work_mem * 1024L)
{

There is a check before above loop:

ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
{
ReorderBufferTXN *txn;

/* bail out if we haven't exceeded the memory limit */
if (rb->size < logical_decoding_work_mem * 1024L)
return;

This will prevent the streaming/spill to occur.

I think if the GUC is set then maybe we can bypass this check so that
it can try to stream every single change?

Yeah and probably we need to do something for the check "while
(rb->size >= logical_decoding_work_mem * 1024L)" as well.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#415Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#408)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jul 8, 2020 at 3:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 8, 2020 at 9:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Jul 5, 2020 at 8:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have compared the changes logged at command end vs logged at commit
time. I have ignored the invalidation for the transaction which has
any aborted subtransaction in it. While testing this I found one
issue, the issue is that if there are some invalidation generated
between last command counter increment and the commit transaction then
those were not logged. I have fixed the issue by logging the pending
invalidation in RecordTransactionCommit. I will include the changes
in the next patch set.

I think it would have been better if you could have given examples for
such cases where you need this extra logging. Anyway, below are few
minor comments on this patch:

1.
+ /*
+ * Log any pending invalidations which are adding between the last
+ * command counter increment and the commit.
+ */
+ if (XLogLogicalInfoActive())
+ LogLogicalInvalidations();

I think we can change this comment slightly and extend a bit to say
for which kind of special cases we are adding this. "Log any pending
invalidations which are added between the last CommandCounterIncrement
and the commit. Normally for DDLs, we log this at each command end,
however for certain cases where we directly update the system table
the invalidations were not logged at command end."

Something like above based on cases that are not covered by command
end WAL logging.

2.
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end.
+ */
+void
+LogLogicalInvalidations()

After this is getting used at a new place, it is better to modify the
above comment to something like: "Emit WAL for invalidations. This is
currently only used for logging invalidations at the command end or at
commit time if any invalidations are pending."

I have done some more review and below are my comments:

Review-v31-0010-Provide-new-api-to-get-the-streaming-changes
----------------------------------------------------------------------------------------------
1.
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1240,6 +1240,14 @@ LANGUAGE INTERNAL
VOLATILE ROWS 1000 COST 1000
AS 'pg_logical_slot_get_changes';
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int,
VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';

If we are going to add a new streaming API for get_changes, don't we
need for pg_logical_slot_get_binary_changes,
pg_logical_slot_peek_changes and pg_logical_slot_peek_binary_changes
as well? I was thinking why not add a new parameter (streaming
boolean) instead of adding the new APIs. This could be an optional
parameter which if user doesn't specify will be considered as false.
We already have optional parameters for APIs like
pg_create_logical_replication_slot.

2. You forgot to update sgml/func.sgml. This will be required even if
we decide to add a new parameter instead of a new API.

3.
+ /* If called has not asked for streaming changes then disable it. */
+ ctx->streaming &= streaming;

/If called/If the caller

4.
diff --git a/.gitignore b/.gitignore
index 794e35b..6083744 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
/Debug/
/Release/
/tmp_install/
+/build/

Why the patch contains this change?

5. If I apply the first six patches and run the regressions, it fails
primarily because streaming got enabled by default. And then when I
applied this patch, the tests passed because it disables streaming by
default. I think this should be patch 0007.

Only replying to the replication origin point, other comment looks
fine to me so I will work on those.

Replication Origins
------------------------------
I think we also need to conclude on origins related discussion [1].
As far as I can see, the origin_id can be sent with the first startup
message. The origin_lsn and origin_commit can be sent with the last
start of streaming commit if we want but not sure if that is of use.
If we need to send it earlier then we need to record it with other WAL
records. The point is that those are set with
pg_replication_origin_xact_setup but not sure how and when that
function is called.

pg_replication_origin_xact_setup is exposed function so this will
allow a user to set an origin for their session so that all the
operation done from that session will be marked by that origin id.
And the clear use case for this is to avoid sending such transactions
by suing FilterByOrigin. But I am not sure about the point that we
discussed at [1] that what is the use of the origin and origin_lsn we
send at pgoutput_begin_txn.

The other alternative is that we can ignore that

for now and once the usage is clear we can enhance it. What do you
think?

That seems like a sensible option to me.

[1] - /messages/by-id/CAA4eK1JwXaCezFw+kZwoxbLKYD0nWpC2rPgx7vUsaDAc0AZaow@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#416Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#415)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jul 9, 2020 at 2:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jul 8, 2020 at 3:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Only replying to the replication origin point, other comment looks
fine to me so I will work on those.

Replication Origins
------------------------------
I think we also need to conclude on origins related discussion [1].
As far as I can see, the origin_id can be sent with the first startup
message. The origin_lsn and origin_commit can be sent with the last
start of streaming commit if we want but not sure if that is of use.
If we need to send it earlier then we need to record it with other WAL
records. The point is that those are set with
pg_replication_origin_xact_setup but not sure how and when that
function is called.

pg_replication_origin_xact_setup is exposed function so this will
allow a user to set an origin for their session so that all the
operation done from that session will be marked by that origin id.

Hmm, I think that can be done by pg_replication_origin_session_setup.

And the clear use case for this is to avoid sending such transactions
by suing FilterByOrigin. But I am not sure about the point that we
discussed at [1] that what is the use of the origin and origin_lsn we
send at pgoutput_begin_txn.

I could see the use of 'origin' with FilterByOrigin but not sure how
origin_lsn can be used?

The other alternative is that we can ignore that

for now and once the usage is clear we can enhance it. What do you
think?

That seems like a sensible option to me.

I have responded to that another thread. Let us see if someone
responds to it. Feel free to add if you have some points related to
that thread.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#417Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#414)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jul 9, 2020 at 1:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think if the GUC is set then maybe we can bypass this check so that
it can try to stream every single change?

Yeah and probably we need to do something for the check "while
(rb->size >= logical_decoding_work_mem * 1024L)" as well.

I have made this change, as discussed, the regression tests seem to run

fine. I have added a debug that records the streaming for each transaction
number. I also had to bypass certain asserts
in ReorderBufferLargestTopTXN() as now we are going through the entire list
of transactions and not just picking the biggest transaction .

regards,
Ajin
Fujitsu Australia

Attachments:

v31-0015-TEST-guc-always-streaming-logical.patchapplication/octet-stream; name=v31-0015-TEST-guc-always-streaming-logical.patchDownload
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c7f1877..8c147bd 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -82,6 +82,8 @@ bool		XactDeferrable;
 
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
+bool		always_stream_logical = true;
+
 /*
  * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
  * transaction.  Currently, it is used in logical decoding.  It's possible
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2ceb192..97aed74 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -3105,9 +3105,12 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 
    }
 
-   Assert(largest);
-   Assert(largest->size > 0);
-   Assert(largest->size <= rb->size);
+   if (!always_stream_logical)
+   {
+       Assert(largest);
+       Assert(largest->size > 0);
+       Assert(largest->size <= rb->size);
+   }
 
    return largest;
 }
@@ -3130,8 +3133,22 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	ReorderBufferTXN *txn;
 
 	/* bail out if we haven't exceeded the memory limit */
-	if (rb->size < logical_decoding_work_mem * 1024L)
+	if (!always_stream_logical && rb->size < logical_decoding_work_mem * 1024L)
+		return;
+
+	/* If GUC set to always stream, then stream everything */
+	if (always_stream_logical)
+	{
+		while ((txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			ReorderBufferStreamTXN(rb, txn);
+			elog(DEBUG2, "initiate stream for changes in XID %u",
+				  txn->xid);
+
+		}
 		return;
+	}
+
 
 	/*
 	 * Loop until we reach under the memory limit.  One might think that just
@@ -3476,6 +3493,10 @@ ReorderBufferCanStream(ReorderBuffer *rb)
 {
 	LogicalDecodingContext *ctx = rb->private_data;
 
+	/* force streaming on logical replication if guc set */
+	if (always_stream_logical)
+		ctx->streaming = true;
+
 	return ctx->streaming;
 }
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3a802d8..8f5144d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2041,6 +2041,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"always_stream_logical", PGC_USERSET, REPLICATION_MASTER,
+			gettext_noop("Always stream during logical replication, do not spill to disk."),
+		},
+		&always_stream_logical,
+		true,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5f767eb..f99d9c7 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -65,6 +65,9 @@ extern bool xact_is_sampled;
 extern bool DefaultXactDeferrable;
 extern bool XactDeferrable;
 
+/* to turn on forced  streaming of logical replication */
+extern bool always_stream_logical;
+
 typedef enum
 {
 	SYNCHRONOUS_COMMIT_OFF,		/* asynchronous commit */
#418Dilip Kumar
dilipbalaut@gmail.com
In reply to: Ajin Cherian (#417)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Jul 10, 2020 at 9:21 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Thu, Jul 9, 2020 at 1:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think if the GUC is set then maybe we can bypass this check so that
it can try to stream every single change?

Yeah and probably we need to do something for the check "while
(rb->size >= logical_decoding_work_mem * 1024L)" as well.

I have made this change, as discussed, the regression tests seem to run fine. I have added a debug that records the streaming for each transaction >number. I also had to bypass certain asserts in ReorderBufferLargestTopTXN() as now we are going through the entire list of transactions and not just picking the biggest transaction .

So if always_stream_logical is true then we are always going for the
streaming even if the size is not reached and that is good. And if
always_stream_logical is set then we are setting ctx->streaming=true
that is also good. So now I don't think we need to change this part
of the code, because when we bypass the memory limit and set the
ctx->streaming=true it will always select the streaming option unless
it is impossible. With your changes sometimes due to incomplete toast
changes, if it can not pick the largest top txn for streaming it will
hang forever in the while loop, in that case, it should go for
spilling.

while (rb->size >= logical_decoding_work_mem * 1024L)
{
/*
* Pick the largest transaction (or subtransaction) and evict it from
* memory by streaming, if supported. Otherwise, spill to disk.
*/
if (ReorderBufferCanStream(rb) &&
(txn = ReorderBufferLargestTopTXN(rb)) != NULL)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#419Ajin Cherian
itsajin@gmail.com
In reply to: Dilip Kumar (#418)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Jul 10, 2020 at 3:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

With your changes sometimes due to incomplete toast
changes, if it can not pick the largest top txn for streaming it will
hang forever in the while loop, in that case, it should go for
spilling.

while (rb->size >= logical_decoding_work_mem * 1024L)
{
/*
* Pick the largest transaction (or subtransaction) and evict it from
* memory by streaming, if supported. Otherwise, spill to disk.
*/
if (ReorderBufferCanStream(rb) &&
(txn = ReorderBufferLargestTopTXN(rb)) != NULL)

Which is this condition (of not picking largest top txn)?
Wouldn't ReorderBufferLargestTopTXN then return a NULL? If not, is there a
way to know that a transaction cannot be streamed, so there can be an exit
condition for the while loop?

regards,
Ajin Cherian
Fujitsu Australia

#420Dilip Kumar
dilipbalaut@gmail.com
In reply to: Ajin Cherian (#419)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Jul 10, 2020 at 11:01 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Fri, Jul 10, 2020 at 3:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

With your changes sometimes due to incomplete toast
changes, if it can not pick the largest top txn for streaming it will
hang forever in the while loop, in that case, it should go for
spilling.

while (rb->size >= logical_decoding_work_mem * 1024L)
{
/*
* Pick the largest transaction (or subtransaction) and evict it from
* memory by streaming, if supported. Otherwise, spill to disk.
*/
if (ReorderBufferCanStream(rb) &&
(txn = ReorderBufferLargestTopTXN(rb)) != NULL)

Which is this condition (of not picking largest top txn)? Wouldn't ReorderBufferLargestTopTXN then return a NULL? If not, is there a way to know that a transaction cannot be streamed, so there can be an exit condition for the while loop?

Okay, I see, so if ReorderBufferLargestTopTXN returns NULL you are
breaking the loop. I did not see the other part of the patch but I
agree that it will not go in an infinite loop.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#421Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#399)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 30, 2020 at 5:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Let me know what you think about the above changes.

I went ahead and made few changes in
0005-Implement-streaming-mode-in-ReorderBuffer which are explained
below. I have few questions and suggestions for the patch as well
which are also covered in below points.

1.
+ if (prev_lsn == InvalidXLogRecPtr)
+ {
+ if (streaming)
+ rb->stream_start(rb, txn, change->lsn);
+ else
+ rb->begin(rb, txn);
+ stream_started = true;
+ }

I don't think we want to move begin callback here that will change the
existing semantics, so it is better to move begin at its original
position. I have made the required changes in the attached patch.

2.
ReorderBufferTruncateTXN()
{
..
+ dlist_foreach_modify(iter, &txn->changes)
+ {
+ ReorderBufferChange *change;
+
+ change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+ /* remove the change from it's containing list */
+ dlist_delete(&change->node);
+
+ ReorderBufferReturnChange(rb, change);
+ }
..
}

I think here we can add an Assert that we're not mixing changes from
different transactions. See the changes in the patch.

3.
SetupCheckXidLive()
{
..
+ /*
+ * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+ * aborted. That will happen during catalog access.  Also, reset the
+ * bsysscan flag.
+ */
+ if (!TransactionIdDidCommit(xid))
+ {
+ CheckXidAlive = xid;
+ bsysscan = false;
..
}

What is the need to reset bsysscan flag here if we are already
resetting on error (like in the previous patch sent by me)?

4.
ReorderBufferProcessTXN()
{
..
..
+ /* Reset the CheckXidAlive */
+ if (streaming)
+ CheckXidAlive = InvalidTransactionId;
..
}

Similar to the previous point, we don't need this as well because
AbortCurrentTransaction would have taken care of this.

5.
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)

The above comment doesn't make much sense to me, so I have removed it.
Basically, if there are no changes before commit, we still need to
send commit and anyway if there are no more changes
ReorderBufferProcessTXN will not do anything.

6.
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
if (txn->snapshot_now == NULL)
+ {
+ dlist_iter subxact_i;
+
+ /* make sure this transaction is streamed for the first time */
+ Assert(!rbtxn_is_streamed(txn));
+
+ /* at the beginning we should have invalid command ID */
+ Assert(txn->command_id == InvalidCommandId);
+
+ dlist_foreach(subxact_i, &txn->subtxns)
+ {
+ ReorderBufferTXN *subtxn;
+
+ subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+ ReorderBufferTransferSnapToParent(txn, subtxn);
+ }
..
}

Here, it is possible that there is no base_snapshot for txn, so we
need a check for that similar to ReorderBufferCommit.

7. Apart from the above, I made few changes in comments and ran pgindent.

8. We can't stream the transaction before we reach the
SNAPBUILD_CONSISTENT state because some other output plugin can apply
those changes unlike what we do with pgoutput plugin (which writes to
file). And, I think applying the transactions without reaching a
consistent state would be anyway wrong. So, we should avoid that and
if do that then we should have an Assert for streamed txns rather than
sending abort for them in ReorderBufferForget.

I was analyzing this point so currently, we only enable streaming in
StartReplicationSlot so basically in CreateReplicationSlot the
streaming will be always off because by that time plugins are not yet
startup that will happen only on StartReplicationSlot. See below
snippet from patch 0007. However, I agree during start replication
slot we might decode some of the extra walls of the transaction for
which we already got the commit confirmation and we must have a way to
avoid that. But I think we don't need to do anything for the
CONSISTENT snapshot point. What's your thought on this?

@@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
WalSndPrepareWrite, WalSndWriteData,
WalSndUpdateProgress);

+ /*
+ * Make sure streaming is disabled here - we may have the methods,
+ * but we don't have anywhere to send the data yet.
+ */
+ ctx->streaming = false;
+

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#422Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#404)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

9.
+ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
{
..
+ ReorderBufferToastReset(rb, txn);
+ if (specinsert != NULL)
+ ReorderBufferReturnChange(rb, specinsert);
..
}

Why do we need to do these here when we wouldn't have been done for
any exception other than ERRCODE_TRANSACTION_ROLLBACK?

Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
gracefully and we are continuing with further decoding so we need to
return this change back.

Okay, then I suggest we should do these before calling stream_stop and
also move ReorderBufferResetTXN after calling stream_stop to follow a
pattern similar to try block unless there is a reason for not doing
so. Also, it would be good if we can initialize specinsert with NULL
after returning the change as we are doing at other places.

Okay

10. I have got the below failure once. I have not investigated this
in detail as the patch is still under progress. See, if you have any
idea?
# Failed test 'check extra columns contain local defaults'
# at t/013_stream_subxact_ddl_abort.pl line 81.
# got: '2|0'
# expected: '1000|500'
# Looks like you failed 1 test of 2.
make[2]: *** [check] Error 1
make[1]: *** [check-subscription-recurse] Error 2
make[1]: *** Waiting for unfinished jobs....
make: *** [check-world-src/test-recurse] Error 2

Even I got the failure once and after that, it did not reproduce. I
have executed it multiple time but it did not reproduce again. Are
you able to reproduce it consistently?

No, I am also not able to reproduce it consistently but I think this
can fail if a subscriber sends the replay_location before actually
replaying the changes. First, I thought that extra send_feedback we
have in apply_handle_stream_commit might have caused this but I guess
that can't happen because we need the commit time location for that
and we are storing the same at the end of apply_handle_stream_commit
after applying all messages. I am not sure what is going on here. I
think we somehow need to reproduce this or some variant of this test
consistently to find the root cause.

And I think it appeared first time for me, so maybe either induced
from past few versions so some changes in the last few versions might
have exposed it. I have noticed that almost 50% of the time I am able
to reproduce after the clean build so I can trace back from which
version it started appearing that way it will be easy to narrow down.

I think the reason for the failure is that we are not setting
remote_final_lsn, in the streaming mode. I have put multiple logs and
executed in log and from logs it appeared that some of the logical wal
did not get replayed due to below check in
should_apply_changes_for_rel.
return (rel->state == SUBREL_STATE_READY || (rel->state ==
SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn));

I still need to do the detailed analysis that why does this fail in
some cases, basically, most of the time the rel->state is
SUBREL_STATE_READY so this check passes but whenever the state is
SUBREL_STATE_SYNCDONE it failed because we never update
remote_final_lsn. I will try to set this value in
apply_handle_stream_commit and see whether it ever fails or not.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#423Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#421)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Jul 10, 2020 at 3:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

8. We can't stream the transaction before we reach the
SNAPBUILD_CONSISTENT state because some other output plugin can apply
those changes unlike what we do with pgoutput plugin (which writes to
file). And, I think applying the transactions without reaching a
consistent state would be anyway wrong. So, we should avoid that and
if do that then we should have an Assert for streamed txns rather than
sending abort for them in ReorderBufferForget.

I was analyzing this point so currently, we only enable streaming in
StartReplicationSlot so basically in CreateReplicationSlot the
streaming will be always off because by that time plugins are not yet
startup that will happen only on StartReplicationSlot.

What do you mean by 'startup' in the above sentence? AFAICS, we do
call startup_cb_wrapper in CreateInitDecodingContext which is called
from both CreateReplicationSlot and create_logical_replication_slot
before the start of decoding. In CreateInitDecodingContext, we call
StartupDecodingContext which should load the plugin.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#424Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#423)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jul 13, 2020 at 10:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 10, 2020 at 3:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

8. We can't stream the transaction before we reach the
SNAPBUILD_CONSISTENT state because some other output plugin can apply
those changes unlike what we do with pgoutput plugin (which writes to
file). And, I think applying the transactions without reaching a
consistent state would be anyway wrong. So, we should avoid that and
if do that then we should have an Assert for streamed txns rather than
sending abort for them in ReorderBufferForget.

I was analyzing this point so currently, we only enable streaming in
StartReplicationSlot so basically in CreateReplicationSlot the
streaming will be always off because by that time plugins are not yet
startup that will happen only on StartReplicationSlot.

What do you mean by 'startup' in the above sentence? AFAICS, we do
call startup_cb_wrapper in CreateInitDecodingContext which is called
from both CreateReplicationSlot and create_logical_replication_slot
before the start of decoding. In CreateInitDecodingContext, we call
StartupDecodingContext which should load the plugin.

Yeah, you are right that we do call startup_cb_wrapper from
CreateInitDecodingContext as well. I think I got confused by below
comment in patch 0007

@@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
WalSndPrepareWrite, WalSndWriteData,
WalSndUpdateProgress);
+ /*
+ * Make sure streaming is disabled here - we may have the methods,
+ * but we don't have anywhere to send the data yet.
+ */
+ ctx->streaming = false;
+

Basically, during CreateReplicationSlot we forcefully disable the
streaming with the comment "we don't have anywhere to send the data
yet". So my point is during CreateReplicationSlot time the streaming
will always be off and once we are done with creating the slot we will
be having consistent snapshot. So my point is can we just check that
while decoding unless the current LSN reaches the start_decoding_at
point we should not start streaming and after that we can start. At
that time we can have an assert that the snapshot should be
CONSISTENT. However, before doing that I need to check on this point
that why after creating slot we are setting ctx->streaming to false.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#425Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#422)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sun, Jul 12, 2020 at 9:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

9.
+ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
{
..
+ ReorderBufferToastReset(rb, txn);
+ if (specinsert != NULL)
+ ReorderBufferReturnChange(rb, specinsert);
..
}

Why do we need to do these here when we wouldn't have been done for
any exception other than ERRCODE_TRANSACTION_ROLLBACK?

Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
gracefully and we are continuing with further decoding so we need to
return this change back.

Okay, then I suggest we should do these before calling stream_stop and
also move ReorderBufferResetTXN after calling stream_stop to follow a
pattern similar to try block unless there is a reason for not doing
so. Also, it would be good if we can initialize specinsert with NULL
after returning the change as we are doing at other places.

Okay

10. I have got the below failure once. I have not investigated this
in detail as the patch is still under progress. See, if you have any
idea?
# Failed test 'check extra columns contain local defaults'
# at t/013_stream_subxact_ddl_abort.pl line 81.
# got: '2|0'
# expected: '1000|500'
# Looks like you failed 1 test of 2.
make[2]: *** [check] Error 1
make[1]: *** [check-subscription-recurse] Error 2
make[1]: *** Waiting for unfinished jobs....
make: *** [check-world-src/test-recurse] Error 2

Even I got the failure once and after that, it did not reproduce. I
have executed it multiple time but it did not reproduce again. Are
you able to reproduce it consistently?

No, I am also not able to reproduce it consistently but I think this
can fail if a subscriber sends the replay_location before actually
replaying the changes. First, I thought that extra send_feedback we
have in apply_handle_stream_commit might have caused this but I guess
that can't happen because we need the commit time location for that
and we are storing the same at the end of apply_handle_stream_commit
after applying all messages. I am not sure what is going on here. I
think we somehow need to reproduce this or some variant of this test
consistently to find the root cause.

And I think it appeared first time for me, so maybe either induced
from past few versions so some changes in the last few versions might
have exposed it. I have noticed that almost 50% of the time I am able
to reproduce after the clean build so I can trace back from which
version it started appearing that way it will be easy to narrow down.

I think the reason for the failure is that we are not setting
remote_final_lsn, in the streaming mode. I have put multiple logs and
executed in log and from logs it appeared that some of the logical wal
did not get replayed due to below check in
should_apply_changes_for_rel.
return (rel->state == SUBREL_STATE_READY || (rel->state ==
SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn));

I still need to do the detailed analysis that why does this fail in
some cases, basically, most of the time the rel->state is
SUBREL_STATE_READY so this check passes but whenever the state is
SUBREL_STATE_SYNCDONE it failed because we never update
remote_final_lsn. I will try to set this value in
apply_handle_stream_commit and see whether it ever fails or not.

I have verified that after setting the remote_final_lsn in the
apply_handle_stream_commit, I don't see that regression failure in
over 70 runs whereas without that change it failed 6 times in 50 runs.
Apart from this, I have noticed one more thing related to the same
point. Basically, in the apply_handle_commit, we are calling
process_syncing_tables whereas we are not calling the same in
apply_handle_stream_commit.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#426Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#424)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jul 13, 2020 at 10:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 13, 2020 at 10:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 10, 2020 at 3:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

8. We can't stream the transaction before we reach the
SNAPBUILD_CONSISTENT state because some other output plugin can apply
those changes unlike what we do with pgoutput plugin (which writes to
file). And, I think applying the transactions without reaching a
consistent state would be anyway wrong. So, we should avoid that and
if do that then we should have an Assert for streamed txns rather than
sending abort for them in ReorderBufferForget.

I was analyzing this point so currently, we only enable streaming in
StartReplicationSlot so basically in CreateReplicationSlot the
streaming will be always off because by that time plugins are not yet
startup that will happen only on StartReplicationSlot.

What do you mean by 'startup' in the above sentence? AFAICS, we do
call startup_cb_wrapper in CreateInitDecodingContext which is called
from both CreateReplicationSlot and create_logical_replication_slot
before the start of decoding. In CreateInitDecodingContext, we call
StartupDecodingContext which should load the plugin.

Yeah, you are right that we do call startup_cb_wrapper from
CreateInitDecodingContext as well. I think I got confused by below
comment in patch 0007

@@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
WalSndPrepareWrite, WalSndWriteData,
WalSndUpdateProgress);
+ /*
+ * Make sure streaming is disabled here - we may have the methods,
+ * but we don't have anywhere to send the data yet.
+ */
+ ctx->streaming = false;
+

Basically, during CreateReplicationSlot we forcefully disable the
streaming with the comment "we don't have anywhere to send the data
yet". So my point is during CreateReplicationSlot time the streaming
will always be off and once we are done with creating the slot we will
be having consistent snapshot. So my point is can we just check that
while decoding unless the current LSN reaches the start_decoding_at
point we should not start streaming and after that we can start. At
that time we can have an assert that the snapshot should be
CONSISTENT. However, before doing that I need to check on this point
that why after creating slot we are setting ctx->streaming to false.

I think you can refer to commit message as well for that "We however
must explicitly disable streaming replication during replication slot
creation, even if the plugin supports it. We don't need to replicate
the changes accumulated during this phase, and moreover, we don't have
a replication connection open so we don't have where to send the data
anyway.". I don't think this is a good way to hack the streaming flag
because for SQL API's, we don't have a good reason to disable the
streaming in this way. I guess if we had a condition related to
reaching CONSISTENT snapshot during streaming then we won't need to
hack the streaming flag in this way. Once we reach the CONSISTENT
snapshot state, we come out of the creation of a replication slot (see
how we use DecodingContextReady to achieve that) phase. So, I feel we
should remove the ctx->streaming setting to false and add a CONSISTENT
snapshot check during streaming unless you have a reason for not doing
so.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#427Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#425)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jul 13, 2020 at 10:47 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sun, Jul 12, 2020 at 9:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

10. I have got the below failure once. I have not investigated this
in detail as the patch is still under progress. See, if you have any
idea?
# Failed test 'check extra columns contain local defaults'
# at t/013_stream_subxact_ddl_abort.pl line 81.
# got: '2|0'
# expected: '1000|500'
# Looks like you failed 1 test of 2.
make[2]: *** [check] Error 1
make[1]: *** [check-subscription-recurse] Error 2
make[1]: *** Waiting for unfinished jobs....
make: *** [check-world-src/test-recurse] Error 2

Even I got the failure once and after that, it did not reproduce. I
have executed it multiple time but it did not reproduce again. Are
you able to reproduce it consistently?

...
..

I think the reason for the failure is that we are not setting
remote_final_lsn, in the streaming mode. I have put multiple logs and
executed in log and from logs it appeared that some of the logical wal
did not get replayed due to below check in
should_apply_changes_for_rel.
return (rel->state == SUBREL_STATE_READY || (rel->state ==
SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn));

I still need to do the detailed analysis that why does this fail in
some cases, basically, most of the time the rel->state is
SUBREL_STATE_READY so this check passes but whenever the state is
SUBREL_STATE_SYNCDONE it failed because we never update
remote_final_lsn. I will try to set this value in
apply_handle_stream_commit and see whether it ever fails or not.

I have verified that after setting the remote_final_lsn in the
apply_handle_stream_commit, I don't see that regression failure in
over 70 runs whereas without that change it failed 6 times in 50 runs.

Your analysis and fix seem correct to me.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#428Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#426)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 13, 2020 at 10:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 13, 2020 at 10:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 10, 2020 at 3:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

8. We can't stream the transaction before we reach the
SNAPBUILD_CONSISTENT state because some other output plugin can apply
those changes unlike what we do with pgoutput plugin (which writes to
file). And, I think applying the transactions without reaching a
consistent state would be anyway wrong. So, we should avoid that and
if do that then we should have an Assert for streamed txns rather than
sending abort for them in ReorderBufferForget.

I was analyzing this point so currently, we only enable streaming in
StartReplicationSlot so basically in CreateReplicationSlot the
streaming will be always off because by that time plugins are not yet
startup that will happen only on StartReplicationSlot.

What do you mean by 'startup' in the above sentence? AFAICS, we do
call startup_cb_wrapper in CreateInitDecodingContext which is called
from both CreateReplicationSlot and create_logical_replication_slot
before the start of decoding. In CreateInitDecodingContext, we call
StartupDecodingContext which should load the plugin.

Yeah, you are right that we do call startup_cb_wrapper from
CreateInitDecodingContext as well. I think I got confused by below
comment in patch 0007

@@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
WalSndPrepareWrite, WalSndWriteData,
WalSndUpdateProgress);
+ /*
+ * Make sure streaming is disabled here - we may have the methods,
+ * but we don't have anywhere to send the data yet.
+ */
+ ctx->streaming = false;
+

Basically, during CreateReplicationSlot we forcefully disable the
streaming with the comment "we don't have anywhere to send the data
yet". So my point is during CreateReplicationSlot time the streaming
will always be off and once we are done with creating the slot we will
be having consistent snapshot. So my point is can we just check that
while decoding unless the current LSN reaches the start_decoding_at
point we should not start streaming and after that we can start. At
that time we can have an assert that the snapshot should be
CONSISTENT. However, before doing that I need to check on this point
that why after creating slot we are setting ctx->streaming to false.

I think you can refer to commit message as well for that "We however
must explicitly disable streaming replication during replication slot
creation, even if the plugin supports it. We don't need to replicate
the changes accumulated during this phase, and moreover, we don't have
a replication connection open so we don't have where to send the data
anyway.". I don't think this is a good way to hack the streaming flag
because for SQL API's, we don't have a good reason to disable the
streaming in this way. I guess if we had a condition related to
reaching CONSISTENT snapshot during streaming then we won't need to
hack the streaming flag in this way. Once we reach the CONSISTENT
snapshot state, we come out of the creation of a replication slot (see
how we use DecodingContextReady to achieve that) phase. So, I feel we
should remove the ctx->streaming setting to false and add a CONSISTENT
snapshot check during streaming unless you have a reason for not doing
so.

I was worried about the point that streaming on/off is sent by the
subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if
we keep streaming on during create then it may not be right. But, I
agree with your point that it's better we can avoid streaming during
slot creation by CONSISTENT snapshot check instead of disabling this
way. And, anyways as soon as we reach the consistent snapshot we will
stop processing further records so we will not attempt to stream
during slot creation.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#429Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#428)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jul 13, 2020 at 2:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think you can refer to commit message as well for that "We however
must explicitly disable streaming replication during replication slot
creation, even if the plugin supports it. We don't need to replicate
the changes accumulated during this phase, and moreover, we don't have
a replication connection open so we don't have where to send the data
anyway.". I don't think this is a good way to hack the streaming flag
because for SQL API's, we don't have a good reason to disable the
streaming in this way. I guess if we had a condition related to
reaching CONSISTENT snapshot during streaming then we won't need to
hack the streaming flag in this way. Once we reach the CONSISTENT
snapshot state, we come out of the creation of a replication slot (see
how we use DecodingContextReady to achieve that) phase. So, I feel we
should remove the ctx->streaming setting to false and add a CONSISTENT
snapshot check during streaming unless you have a reason for not doing
so.

I was worried about the point that streaming on/off is sent by the
subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if
we keep streaming on during create then it may not be right.

Then, how is that used on the publisher-side? AFAICS, the streaming
is enabled based on whether streaming callbacks are provided and we do
that in 0003-Extend-the-logical-decoding-output-plugin-API-wi patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#430Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#429)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jul 13, 2020 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 13, 2020 at 2:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think you can refer to commit message as well for that "We however
must explicitly disable streaming replication during replication slot
creation, even if the plugin supports it. We don't need to replicate
the changes accumulated during this phase, and moreover, we don't have
a replication connection open so we don't have where to send the data
anyway.". I don't think this is a good way to hack the streaming flag
because for SQL API's, we don't have a good reason to disable the
streaming in this way. I guess if we had a condition related to
reaching CONSISTENT snapshot during streaming then we won't need to
hack the streaming flag in this way. Once we reach the CONSISTENT
snapshot state, we come out of the creation of a replication slot (see
how we use DecodingContextReady to achieve that) phase. So, I feel we
should remove the ctx->streaming setting to false and add a CONSISTENT
snapshot check during streaming unless you have a reason for not doing
so.

I was worried about the point that streaming on/off is sent by the
subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if
we keep streaming on during create then it may not be right.

Then, how is that used on the publisher-side? AFAICS, the streaming
is enabled based on whether streaming callbacks are provided and we do
that in 0003-Extend-the-logical-decoding-output-plugin-API-wi patch.

Basically, first, we enable based on whether we have the callbacks or
not but later once we get the START REPLICATION command from the
subscriber then we set it to false if the streaming is not enabled
from the subscriber side. You can refer below code in patch 0007.

pgoutput_startup
{
parse_output_parameters(ctx->output_plugin_options,
&data->protocol_version,
- &data->publication_names);
+ &data->publication_names,
+ &enable_streaming);
/* Check if we support requested protocol */
if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("publication_names parameter missing")));
+ /*
+ * Decide whether to enable streaming. It is disabled by default, in
+ * which case we just update the flag in decoding context. Otherwise
+ * we only allow it with sufficient version of the protocol, and when
+ * the output plugin supports it.
+ */
+ if (!enable_streaming)
+ ctx->streaming = false;
}

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#431Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#430)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jul 13, 2020 at 3:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 13, 2020 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 13, 2020 at 2:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think you can refer to commit message as well for that "We however
must explicitly disable streaming replication during replication slot
creation, even if the plugin supports it. We don't need to replicate
the changes accumulated during this phase, and moreover, we don't have
a replication connection open so we don't have where to send the data
anyway.". I don't think this is a good way to hack the streaming flag
because for SQL API's, we don't have a good reason to disable the
streaming in this way. I guess if we had a condition related to
reaching CONSISTENT snapshot during streaming then we won't need to
hack the streaming flag in this way. Once we reach the CONSISTENT
snapshot state, we come out of the creation of a replication slot (see
how we use DecodingContextReady to achieve that) phase. So, I feel we
should remove the ctx->streaming setting to false and add a CONSISTENT
snapshot check during streaming unless you have a reason for not doing
so.

I was worried about the point that streaming on/off is sent by the
subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if
we keep streaming on during create then it may not be right.

Then, how is that used on the publisher-side? AFAICS, the streaming
is enabled based on whether streaming callbacks are provided and we do
that in 0003-Extend-the-logical-decoding-output-plugin-API-wi patch.

Basically, first, we enable based on whether we have the callbacks or
not but later once we get the START REPLICATION command from the
subscriber then we set it to false if the streaming is not enabled
from the subscriber side. You can refer below code in patch 0007.

pgoutput_startup
{
parse_output_parameters(ctx->output_plugin_options,
&data->protocol_version,
- &data->publication_names);
+ &data->publication_names,
+ &enable_streaming);
/* Check if we support requested protocol */
if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("publication_names parameter missing")));
+ /*
+ * Decide whether to enable streaming. It is disabled by default, in
+ * which case we just update the flag in decoding context. Otherwise
+ * we only allow it with sufficient version of the protocol, and when
+ * the output plugin supports it.
+ */
+ if (!enable_streaming)
+ ctx->streaming = false;
}

Okay, in that case, we can do both enable and disable streaming in
this function itself rather than allow the caller to later modify it.
I suggest similarly we can enable/disable it for SQL API in
pg_decode_startup via output_plugin_options. This way it will look
consistent for both SQL APIs and for command-based replication. If we
can do so, then probably adding an Assert for Consistent Snapshot
while performing streaming should be okay.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#432Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#431)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jul 13, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 13, 2020 at 3:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 13, 2020 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 13, 2020 at 2:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think you can refer to commit message as well for that "We however
must explicitly disable streaming replication during replication slot
creation, even if the plugin supports it. We don't need to replicate
the changes accumulated during this phase, and moreover, we don't have
a replication connection open so we don't have where to send the data
anyway.". I don't think this is a good way to hack the streaming flag
because for SQL API's, we don't have a good reason to disable the
streaming in this way. I guess if we had a condition related to
reaching CONSISTENT snapshot during streaming then we won't need to
hack the streaming flag in this way. Once we reach the CONSISTENT
snapshot state, we come out of the creation of a replication slot (see
how we use DecodingContextReady to achieve that) phase. So, I feel we
should remove the ctx->streaming setting to false and add a CONSISTENT
snapshot check during streaming unless you have a reason for not doing
so.

I was worried about the point that streaming on/off is sent by the
subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if
we keep streaming on during create then it may not be right.

Then, how is that used on the publisher-side? AFAICS, the streaming
is enabled based on whether streaming callbacks are provided and we do
that in 0003-Extend-the-logical-decoding-output-plugin-API-wi patch.

Basically, first, we enable based on whether we have the callbacks or
not but later once we get the START REPLICATION command from the
subscriber then we set it to false if the streaming is not enabled
from the subscriber side. You can refer below code in patch 0007.

pgoutput_startup
{
parse_output_parameters(ctx->output_plugin_options,
&data->protocol_version,
- &data->publication_names);
+ &data->publication_names,
+ &enable_streaming);
/* Check if we support requested protocol */
if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("publication_names parameter missing")));
+ /*
+ * Decide whether to enable streaming. It is disabled by default, in
+ * which case we just update the flag in decoding context. Otherwise
+ * we only allow it with sufficient version of the protocol, and when
+ * the output plugin supports it.
+ */
+ if (!enable_streaming)
+ ctx->streaming = false;
}

Okay, in that case, we can do both enable and disable streaming in
this function itself rather than allow the caller to later modify it.
I suggest similarly we can enable/disable it for SQL API in
pg_decode_startup via output_plugin_options. This way it will look
consistent for both SQL APIs and for command-based replication. If we
can do so, then probably adding an Assert for Consistent Snapshot
while performing streaming should be okay.

Sounds good to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#433Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#432)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jul 13, 2020 at 4:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 13, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, in that case, we can do both enable and disable streaming in
this function itself rather than allow the caller to later modify it.
I suggest similarly we can enable/disable it for SQL API in
pg_decode_startup via output_plugin_options. This way it will look
consistent for both SQL APIs and for command-based replication. If we
can do so, then probably adding an Assert for Consistent Snapshot
while performing streaming should be okay.

Sounds good to me.

Please find the latest patches. I have made changes only in the
subscriber-side patches (0007 and 0008 as per the current patch-set).
The main changes are:
1. As discussed above, remove SendFeedback call from apply_handle_stream_commit
2. In SharedFilesetInit, ensure to register callback once
3. In stream_open_file, change slight handling around MemoryContexts
4. Merged the subscriber-side patches.
5. Added/Edited comments in 0007 and 0008.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v32.tarapplication/x-tar; name=v32.tarDownload
v32-0001-Immediately-WAL-log-subtransaction-and-top-level.patch000664 000765 000024 00000026121 13703317267 025526 0ustar00amitkapilastaff000000 000000 From 437994738124a4fbf07a8148c6d89f34fc2edc05 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 5 Jun 2020 09:03:16 +0530
Subject: [PATCH v32 01/12] Immediately WAL-log subtransaction and top-level
 XID association.

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead) only when wal_level=logical.
We can not remove the existing XLOG_XACT_ASSIGNMENT WAL as that is
required for avoiding overflow in the hot standby snapshot.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/transam/xact.c        | 50 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 23 +++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 44 ++++++++++++++--------------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 107 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b3ee7fa..bd4c3cf 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to top-level XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5120,6 +5122,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6022,3 +6025,50 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * IsSubTransactionAssignmentPending
+ *
+ * This is used to decide whether we need to WAL log the top-level XID for
+ * operation in a subtransaction.  We require that for logical decoding, see
+ * LogicalDecodingProcessRecord.
+ *
+ * This returns true if wal_level >= logical and we are inside a valid
+ * subtransaction, for which the assignment was not yet written to any WAL
+ * record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	/* wal_level has to be logical */
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it should not be already 'assigned' */
+	return !CurrentTransactionState->assigned;
+}
+
+/*
+ * MarkSubTransactionAssigned
+ *
+ * Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f..c526bb1 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,19 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogResetInsertion) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cb76be4..a757bac 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1197,6 +1197,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1235,6 +1236,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3a..0c0c371 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -94,11 +94,27 @@ void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
 	XLogRecordBuffer buf;
+	TransactionId txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the top-level xid is valid, we need to assign the subxact to the
+	 * top-level xact. We need to do this for all records, hence we do it
+	 * before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -216,13 +232,8 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	/*
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
-	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
-	 * when the snapshot is being built: it is possible to get later records
-	 * that require subxids to be properly assigned.
 	 */
-	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
 		return;
 
 	switch (info)
@@ -264,22 +275,13 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
 
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			/*
+			 * We assign subxact to the toplevel xact while processing each
+			 * record if required.  So, we don't need to do anything here.
+			 * See LogicalDecodingProcessRecord.
+			 */
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index db19187..aef8555 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 5b14334..d8391aa 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of top-level xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index b0f2a6e..b976882 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -191,6 +191,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -304,6 +306,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v32-0002-WAL-Log-invalidations-at-command-end-with-wal_le.patch000664 000765 000024 00000035200 13703317267 025360 0ustar00amitkapilastaff000000 000000 From f72d77f52beb1335d3e245391ef85665d950be65 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 6 Jun 2020 09:54:21 +0530
Subject: [PATCH v32 02/12] WAL Log invalidations at command end with
 wal_level=logical.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When wal_level=logical, write invalidations at command end into WAL so
that decoding can use this information.

This patch is required to allow the streaming of in-progress transactions
in logical decoding.  The actual work to allow streaming will be committed
as a separate patch.

We still add the invalidations to the cache and write them to WAL at
commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not need
to be changed.

The invalidations written at command end uses a new xlog record type
XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See
LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures, which
still rely on the invalidations written to commit records.

The invalidations are decoded and accumulated in top-transaction, and then
executed during replay.  This obviates the need to decode the
invalidations as part of a commit record.

LogStandbyInvalidations was accumulating all the invalidations in memory,
and then only wrote them once at commit time, which may reduce the
performance impact by amortizing the overhead and deduplicating the
invalidations.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/rmgrdesc/xactdesc.c          | 10 +++++
 src/backend/access/transam/xact.c               | 14 ++++++
 src/backend/replication/logical/decode.c        | 58 +++++++++++++++----------
 src/backend/replication/logical/reorderbuffer.c | 52 ++++++++++++++++++----
 src/backend/utils/cache/inval.c                 | 54 +++++++++++++++++++++++
 src/include/access/xact.h                       | 13 +++++-
 src/include/replication/reorderbuffer.h         |  3 ++
 src/include/utils/inval.h                       |  2 +
 8 files changed, 173 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce755..68aa994 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -396,6 +396,13 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		standby_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs, InvalidOid,
+								   InvalidOid, false);
+	}
 }
 
 const char *
@@ -423,6 +430,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index bd4c3cf..cd24359 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1224,6 +1224,13 @@ RecordTransactionCommit(void)
 	bool		RelcacheInitFileInval = false;
 	bool		wrote_xlog;
 
+	/*
+	 * Log any pending invalidations which are adding between the last
+	 * command counter increment and the commit.
+	 */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	/* Get data needed for commit record */
 	nrels = smgrGetPendingDeletes(true, &rels);
 	nchildren = xactGetCommittedChildren(&children);
@@ -6022,6 +6029,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0c0c371..7153eba 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -278,10 +278,39 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 			/*
 			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here.
-			 * See LogicalDecodingProcessRecord.
+			 * record if required.  So, we don't need to do anything here. See
+			 * LogicalDecodingProcessRecord.
 			 */
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/*
+				 * Execute the invalidations for xid-less transactions,
+				 * otherwise, accumulate them so that they can be processed at
+				 * the commit time.
+				 */
+				if (!ctx->fast_forward)
+				{
+					if (TransactionIdIsValid(xid))
+					{
+						ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+													  invals->nmsgs, invals->msgs);
+						ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+														  buf->origptr);
+					}
+					else
+						ReorderBufferImmediateInvalidation(ctx->reorder,
+														   invals->nmsgs,
+														   invals->msgs);
+				}
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
@@ -334,15 +363,11 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_STANDBY_LOCK:
 			break;
 		case XLOG_INVALIDATIONS:
-			{
-				xl_invalidations *invalidations =
-				(xl_invalidations *) XLogRecGetData(r);
 
-				if (!ctx->fast_forward)
-					ReorderBufferImmediateInvalidation(ctx->reorder,
-													   invalidations->nmsgs,
-													   invalidations->msgs);
-			}
+			/*
+			 * We are processing the invalidations at the command level via
+			 * XLOG_XACT_INVALIDATIONS.  So we don't need to do anything here.
+			 */
 			break;
 		default:
 			elog(ERROR, "unexpected RM_STANDBY_ID record type: %u", info);
@@ -573,19 +598,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
-	/*
-	 * Process invalidation messages, even if we're not interested in the
-	 * transaction's contents, since the various caches need to always be
-	 * consistent.
-	 */
-	if (parsed->nmsgs > 0)
-	{
-		if (!ctx->fast_forward)
-			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
-										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
-	}
-
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5251932..1661190 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -856,6 +856,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to top-level transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -2201,7 +2204,11 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 /*
  * Setup the invalidation of the toplevel transaction.
  *
- * This needs to be done before ReorderBufferCommit is called!
+ * This needs to be called for each XLOG_XACT_INVALIDATIONS message and
+ * accumulates all the invalidation messages in the toplevel transaction.
+ * This is required because in some cases where we skip processing the
+ * transaction (see ReorderBufferForget), we need to execute all the
+ * invalidations together.
  */
 void
 ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
@@ -2212,17 +2219,35 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	if (txn->ninvalidations != 0)
-		elog(ERROR, "only ever add one set of invalidations");
+	/*
+	 * We collect all the invalidations under the top transaction so that we
+	 * can execute them all together.
+	 */
+	if (txn->toptxn)
+		txn = txn->toptxn;
 
 	Assert(nmsgs > 0);
 
-	txn->ninvalidations = nmsgs;
-	txn->invalidations = (SharedInvalidationMessage *)
-		MemoryContextAlloc(rb->context,
-						   sizeof(SharedInvalidationMessage) * nmsgs);
-	memcpy(txn->invalidations, msgs,
-		   sizeof(SharedInvalidationMessage) * nmsgs);
+	/* Accumulate invalidations. */
+	if (txn->ninvalidations == 0)
+	{
+		txn->ninvalidations = nmsgs;
+		txn->invalidations = (SharedInvalidationMessage *)
+			MemoryContextAlloc(rb->context,
+							   sizeof(SharedInvalidationMessage) * nmsgs);
+		memcpy(txn->invalidations, msgs,
+			   sizeof(SharedInvalidationMessage) * nmsgs);
+	}
+	else
+	{
+		txn->invalidations = (SharedInvalidationMessage *)
+			repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
+					 (txn->ninvalidations + nmsgs));
+
+		memcpy(txn->invalidations + txn->ninvalidations, msgs,
+			   nmsgs * sizeof(SharedInvalidationMessage));
+		txn->ninvalidations += nmsgs;
+	}
 }
 
 /*
@@ -2250,6 +2275,15 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * Mark top-level transaction as having catalog changes too if one of its
+	 * children has so that the ReorderBufferBuildTupleCidHash can
+	 * conveniently check just top-level transaction and decide whether to
+	 * build the hash table or not.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..e3fa723 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,9 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transactions.  See
+ *      CommandEndInvalidationMessages.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +107,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -1094,6 +1098,11 @@ CommandEndInvalidationMessages(void)
 
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
+
+	/* WAL Log per-command invalidation messages for wal_level=logical */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
 							   &transInvalInfo->CurrentCmdInvalidMsgs);
 }
@@ -1501,3 +1510,48 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * LogLogicalInvalidations
+ *
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end.
+ */
+void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage *invalMessages;
+	int			nmsgs = 0;
+
+	/* Quick exit if we haven't done anything with invalidation messages. */
+	if (transInvalInfo == NULL)
+		return;
+
+	ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+									 MakeSharedInvalidMessagesArray);
+
+	Assert(!(numSharedInvalidMessagesArray > 0 &&
+			 SharedInvalidMessagesArray == NULL));
+
+	invalMessages = SharedInvalidMessagesArray;
+	nmsgs = numSharedInvalidMessagesArray;
+	SharedInvalidMessagesArray = NULL;
+	numSharedInvalidMessagesArray = 0;
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						 nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index aef8555..ac3f5e3 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,17 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 019bd38..1055e99 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -220,6 +220,9 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	end_lsn;
 
+	/* Toplevel transaction for this subxact (NULL for top-level). */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN of the last lsn at which snapshot information reside, so we can
 	 * restart decoding from there and fully recover this transaction from
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index bc5081c..463888c 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -61,4 +61,6 @@ extern void CacheRegisterRelcacheCallback(RelcacheCallbackFunction func,
 extern void CallSyscacheCallbacks(int cacheid, uint32 hashvalue);
 
 extern void InvalidateSystemCaches(void);
+
+extern void LogLogicalInvalidations(void);
 #endif							/* INVAL_H */
-- 
1.8.3.1

v32-0003-Extend-the-logical-decoding-output-plugin-API-wi.patch000664 000765 000024 00000110356 13703317267 025434 0ustar00amitkapilastaff000000 000000 From f98fdd969536a0b2dde27cd6b2a8207a2c17f932 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:01:24 +0530
Subject: [PATCH v32 03/12] Extend the logical decoding output plugin API with
 stream methods.

This adds seven methods to the output plugin API, adding support for
streaming changes for large transactions.

* stream_start
* stream_stop
* stream_abort
* stream_commit
* stream_change
* stream_message
* stream_truncate

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate a chunk of changes
streamed for a particular toplevel transaction.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/test_decoding.c     | 123 +++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 218 +++++++++++++++++++
 src/backend/replication/logical/logical.c | 351 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  59 +++++
 6 files changed, 825 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..4a71826 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
 }
 
 
@@ -540,3 +569,97 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the changes as the transaction can abort
+ * at a later point in time.  We don't want users to see the changes until the
+ * transaction is committed.
+ */
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the contents for transactional messages
+ * as the transaction can abort at a later point in time.  We don't want users to
+ * see the message contents until the transaction is committed.
+ */
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (transactional)
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu",
+						 transactional, prefix, sz);
+	}
+	else
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+						 transactional, prefix, sz);
+		appendBinaryStringInfo(ctx->out, message, sz);
+	}
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the detailed information of Truncate.
+ * See pg_decode_stream_change.
+ */
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index c89f93c..18116c8 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,6 +389,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -401,6 +408,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -679,6 +695,117 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+    
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The actual changes are not displayed as the transaction can abort at a later
+      point in time and we don't decode changes for aborted transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The message contents for transactional messages are not displayed as the transaction
+      can abort at a later point in time and we don't decode changes for aborted
+      transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -747,4 +874,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and two optional callbacks
+    (<function>stream_message_cb</function>) and (<function>stream_truncate_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be..6ee59bd 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr message_lsn, bool transactional,
+									  const char *prefix, Size message_size, const char *message);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require start/stop/abort/commit/change
+	 * callbacks. The message and truncate callback are optional, similar to
+	 * regular output plugins. We however enable streaming when at least one
+	 * of the methods is enabled, so that we can easily identify missing
+	 * methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional, so we do not
+	 * fail with ERROR when missing, but the wrappers simply do nothing. We
+	 * must set the ReorderBuffer callbacks to something, otherwise the calls
+	 * from there will crash (we don't want to move the checks there).
+	 */
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,307 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_start_cb callback")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_stop_cb callback")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_abort_cb callback")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_commit_cb callback")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_change_cb callback")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475..deef318 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..2d9aa11 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.commit_lsn
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1055e99..42bc817 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -348,6 +348,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											   ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
 struct ReorderBuffer
 {
 	/*
@@ -387,6 +435,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamTruncateCB stream_truncate;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v32-0004-Gracefully-handle-concurrent-aborts-of-transacti.patch000664 000765 000024 00000035744 13703317267 025765 0ustar00amitkapilastaff000000 000000 From f00e48115c804b5c798021ef53d46a42a3bf7129 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:49:40 +0530
Subject: [PATCH v32 04/12] Gracefully handle concurrent aborts of transactions
 being decoded.

When decoding committed transactions this is not an issue, and we never
decode transactions that abort before the decoding starts.

But for an upcoming patch that allows decoding of in-progress
transactions, this may cause failures when the output plugin consults
catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.

Author: Dilip Kumar, Nikhil Sontakke, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 doc/src/sgml/logicaldecoding.sgml         |  9 +++--
 src/backend/access/heap/heapam.c          | 10 ++++++
 src/backend/access/index/genam.c          | 53 +++++++++++++++++++++++++++++
 src/backend/access/table/tableam.c        |  8 +++++
 src/backend/access/transam/xact.c         | 19 +++++++++++
 src/backend/replication/logical/logical.c | 10 ++++++
 src/include/access/tableam.h              | 55 +++++++++++++++++++++++++++++++
 src/include/access/xact.h                 |  4 +++
 src/include/replication/logical.h         |  1 +
 9 files changed, 166 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 18116c8..98b47b0 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7bd4570..b53f99a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments in xact.c where
+	 * these variables are declared.  Normally we have such a check at tableam
+	 * level API but this is called from many places so we need to ensure it
+	 * here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..9d9a70a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,10 +430,37 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments in xact.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments in xact.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments in xact.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 4b2bb29..a61e279 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -235,6 +235,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
 	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
+	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index cd24359..f8cf3bf 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -83,6 +83,19 @@ bool		XactDeferrable;
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
 /*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool		bsysscan = false;
+
+/*
  * When running as a parallel worker, we place only a single
  * TransactionStateData on the parallel worker's state stack, and the XID
  * reflected there will be that of the *innermost* currently-active
@@ -2677,6 +2690,9 @@ AbortTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* If in parallel mode, clean up workers and exit parallel mode. */
 	if (IsInParallelMode())
 	{
@@ -4979,6 +4995,9 @@ AbortSubTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* Exit from parallel mode, if necessary. */
 	if (IsInParallelMode())
 	{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6ee59bd..8deff89 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1442,3 +1442,13 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Clear logical streaming state during (sub)transaction abort.
+ */
+void
+ResetLogicalStreamingState(void)
+{
+	CheckXidAlive = InvalidTransactionId;
+	bsysscan = false;
+}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b3d2a6d..acb6c38 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -19,6 +19,7 @@
 
 #include "access/relscan.h"
 #include "access/sdir.h"
+#include "access/xact.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1017,6 +1027,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1056,6 +1073,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with
+	 * valid CheckXidAlive for catalog or regular tables.  See detailed
+	 * comments in xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1712,6 +1737,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1729,6 +1762,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1747,6 +1788,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1763,6 +1811,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index ac3f5e3..5f767eb 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,10 @@ typedef enum
 /* Synchronous commit level */
 extern int	synchronous_commit;
 
+/* used during logical streaming of a transaction */
+extern TransactionId CheckXidAlive;
+extern bool bsysscan;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index deef318..b0fae98 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -121,5 +121,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+extern void ResetLogicalStreamingState(void);
 
 #endif
-- 
1.8.3.1

v32-0005-Implement-streaming-mode-in-ReorderBuffer.patch000664 000765 000024 00000113244 13703317267 024371 0ustar00amitkapilastaff000000 000000 From a3d70dd34805535d8f93f208fafe68d27f02c133 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 08:40:51 +0530
Subject: [PATCH v32 05/12] Implement streaming mode in ReorderBuffer.

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke new stream API methods. However, sometimes if we have
incomplete toast or speculative insert we spill to the disk because we can
not generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/heap/heapam_visibility.c     |  42 +-
 src/backend/replication/logical/reorderbuffer.c | 765 +++++++++++++++++++++---
 src/include/replication/reorderbuffer.h         |  26 +
 3 files changed, 756 insertions(+), 77 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..c771280 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the tuple
+		 * yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1659,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1661190..fcdc91f 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -236,6 +237,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +246,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -367,6 +378,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -764,6 +778,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the (sub)transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -1018,6 +1064,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1032,6 +1081,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1310,6 +1362,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1335,6 +1396,84 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
+	 * streamed always, even if it does not contain any changes (that is, when
+	 * all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the toplevel xact (we send the XID in all messages), but we never
+	 * stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
+	 * memory. We could also keep the hash table and update it with new ctid
+	 * values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
  */
@@ -1485,57 +1624,171 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
-	}
 
-	snapshot_now = txn->base_snapshot;
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the
+	 * xid aborted.  That will happen during catalog access.
+	 */
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						  ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction. This resets the TXN such that it
+ * can be used to stream the remaining data of transaction being processed.
+ */
+static void
+ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+					  Snapshot snapshot_now,
+					  CommandId command_id,
+					  XLogRecPtr last_lsn,
+					  ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+
+	ReorderBufferToastReset(rb, txn);
+	if (specinsert != NULL)
+		ReorderBufferReturnChange(rb, specinsert);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. We iterate over the top and subtransactions (using a k-way
+ * merge) and replay the changes in lsn order.
+ *
+ * If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool stream_started = false;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1558,14 +1811,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming ? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* We only need to send begin/commit for non-streamed transactions. */
+		if (!streaming)
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1573,6 +1827,32 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * We can't call start stream callback before processing first
+			 * change.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1649,7 +1929,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1689,7 +1970,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1747,7 +2028,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1756,10 +2040,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1790,7 +2071,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1845,14 +2125,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, send the last message for this set of
+		 * changes depending upon streaming mode.
+		 */
+		if (streaming)
+		{
+			if (stream_started)
+			{
+				rb->stream_stop(rb, txn, prev_lsn);
+				stream_started = false;
+			}
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot for the next set of changes in
+		 * streaming mode.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1870,14 +2171,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as
+		 * streamed (if they contained changes). Otherwise, remove all the
+		 * changes and deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1896,15 +2210,105 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
+		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * This error can only occur when we are sending the data in
+			 * streaming mode and the streaming in not finished yet.
+			 */
+			Assert(streaming);
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+
+			/* Reset the TXN so that it is allowed to stream remaining data. */
+			ReorderBufferResetTXN(rb, txn, snapshot_now,
+								  command_id, prev_lsn,
+								  specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+	}
+	PG_END_TRY();
+}
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot	snapshot_now;
+	CommandId	command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
 
-		PG_RE_THROW();
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
 	}
-	PG_END_TRY();
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
 }
 
 /*
@@ -1931,6 +2335,22 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_abort(rb, txn, lsn);
+
+		/*
+		 * We might have decoded changes for this transaction that could load
+		 * the cache as per the current transaction's view (consider DDL's
+		 * happened in this transaction). We don't want the decoding of future
+		 * transactions to use those cache entries so execute invalidations.
+		 */
+		if (txn->ninvalidations > 0)
+			ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+											   txn->invalidations);
+	}
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2000,6 +2420,10 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2135,8 +2559,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2144,6 +2577,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2155,19 +2589,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2196,6 +2639,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2388,6 +2832,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
  * disk until we reach under the memory limit.
@@ -2419,11 +2895,38 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if supported. Otherwise, spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStream(rb))
+		{
+			/*
+			 * Pick the largest toplevel transaction and evict it from memory
+			 * by streaming the already decoded part.
+			 */
+			txn = ReorderBufferLargestTopTXN(rb);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it
+			 * from memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2713,6 +3216,113 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+/* Returns true, if the output plugin supports streaming, false, otherwise. */
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * We can't make any assumptions about base snapshot here, similar to what
+	 * ReorderBufferCommit() does. That relies on base_snapshot getting
+	 * transferred from subxact in ReorderBufferCommitChild(), but that was
+	 * not yet called as the transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 *
+	 * Unlike DecodeCommit which adds xids of all the subtransactions in
+	 * snapshot's xip array via SnapBuildCommittedTxn, we can't do that here
+	 * but we do add them to subxip array instead via ReorderBufferCopySnap.
+	 * This allows the catalog changes made in subtransactions decoded till
+	 * now to be visible.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		/*
+		 * If this transaction has no snapshot, it didn't make any changes to
+		 * the database till now, so there's nothing to decode.
+		 */
+		if (txn->base_snapshot == NULL)
+		{
+			Assert(txn->ninvalidations == 0);
+			return;
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -3812,6 +4422,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases, we assume the CID is from the future
+	 * command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 42bc817..ae1759f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -162,6 +162,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -181,6 +182,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -249,6 +268,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
-- 
1.8.3.1

v32-0006-Bugfix-handling-of-incomplete-toast-spec-insert.patch000664 000765 000024 00000070206 13703317267 025520 0ustar00amitkapilastaff000000 000000 From 5b99bddd8cf643a69aa92c4b7ea4aeaab12f9da1 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 08:41:47 +0530
Subject: [PATCH v32 06/12] Bugfix handling of incomplete toast/spec insert.

---
 src/backend/access/heap/heapam.c                |   3 +
 src/backend/replication/logical/decode.c        |  17 +-
 src/backend/replication/logical/reorderbuffer.c | 441 ++++++++++++++++++------
 src/include/access/heapam_xlog.h                |   1 +
 src/include/replication/reorderbuffer.h         |  50 ++-
 5 files changed, 395 insertions(+), 117 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b53f99a..e09c810 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1955,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7153eba..2010d5a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index fcdc91f..4a94766 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -179,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -237,7 +252,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool partial_truncate);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -432,62 +448,71 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 /*
  * Free an ReorderBufferChange.
  */
-void
-ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+static void
+ReorderBufferFreeChange(ReorderBuffer *rb, ReorderBufferChange *change)
 {
-	/* update memory accounting info */
-	ReorderBufferChangeMemoryUpdate(rb, change, false);
-
 	/* free contained data */
 	switch (change->action)
 	{
-		case REORDER_BUFFER_CHANGE_INSERT:
-		case REORDER_BUFFER_CHANGE_UPDATE:
-		case REORDER_BUFFER_CHANGE_DELETE:
-		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
-			if (change->data.tp.newtuple)
-			{
-				ReorderBufferReturnTupleBuf(rb, change->data.tp.newtuple);
-				change->data.tp.newtuple = NULL;
-			}
+	case REORDER_BUFFER_CHANGE_INSERT:
+	case REORDER_BUFFER_CHANGE_UPDATE:
+	case REORDER_BUFFER_CHANGE_DELETE:
+	case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
+		if (change->data.tp.newtuple)
+		{
+			ReorderBufferReturnTupleBuf(rb, change->data.tp.newtuple);
+			change->data.tp.newtuple = NULL;
+		}
 
-			if (change->data.tp.oldtuple)
-			{
-				ReorderBufferReturnTupleBuf(rb, change->data.tp.oldtuple);
-				change->data.tp.oldtuple = NULL;
-			}
-			break;
-		case REORDER_BUFFER_CHANGE_MESSAGE:
-			if (change->data.msg.prefix != NULL)
-				pfree(change->data.msg.prefix);
-			change->data.msg.prefix = NULL;
-			if (change->data.msg.message != NULL)
-				pfree(change->data.msg.message);
-			change->data.msg.message = NULL;
-			break;
-		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
-			if (change->data.snapshot)
-			{
-				ReorderBufferFreeSnap(rb, change->data.snapshot);
-				change->data.snapshot = NULL;
-			}
-			break;
-			/* no data in addition to the struct itself */
-		case REORDER_BUFFER_CHANGE_TRUNCATE:
-			if (change->data.truncate.relids != NULL)
-			{
-				ReorderBufferReturnRelids(rb, change->data.truncate.relids);
-				change->data.truncate.relids = NULL;
-			}
-			break;
-		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
-		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
-		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
-			break;
+		if (change->data.tp.oldtuple)
+		{
+			ReorderBufferReturnTupleBuf(rb, change->data.tp.oldtuple);
+			change->data.tp.oldtuple = NULL;
+		}
+		break;
+	case REORDER_BUFFER_CHANGE_MESSAGE:
+		if (change->data.msg.prefix != NULL)
+			pfree(change->data.msg.prefix);
+		change->data.msg.prefix = NULL;
+		if (change->data.msg.message != NULL)
+			pfree(change->data.msg.message);
+		change->data.msg.message = NULL;
+		break;
+	case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
+		if (change->data.snapshot)
+		{
+			ReorderBufferFreeSnap(rb, change->data.snapshot);
+			change->data.snapshot = NULL;
+		}
+		break;
+		/* no data in addition to the struct itself */
+	case REORDER_BUFFER_CHANGE_TRUNCATE:
+		if (change->data.truncate.relids != NULL)
+		{
+			ReorderBufferReturnRelids(rb, change->data.truncate.relids);
+			change->data.truncate.relids = NULL;
+		}
+		break;
+	case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+	case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
+	case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		break;
 	}
 
 	pfree(change);
 }
+/*
+ * Free an ReorderBufferChange and update memory accounting.
+ */
+void
+ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+{
+	/* update memory accounting info */
+	ReorderBufferChangeMemoryUpdate(rb, change, false);
+
+	/* free contained data */
+	ReorderBufferFreeChange(rb, change);
+}
 
 /*
  * Get a fresh ReorderBufferTupleBuf fitting at least a tuple of size
@@ -638,16 +663,104 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 }
 
 /*
+ * Handle incomplete tuple during streaming.  If streaming is enabled then we
+ * might need to stream the in-progress transaction.  So the problem is that
+ * sometime we might get some incomplete changes which we can not stream
+ * until we get the complete change. e.g.  toast table insert without the main
+ * table insert.  So this function remember the lsn of the last complete change
+ * and the complete size upto last complete lsn so that if we need to stream
+ * we can only stream upto last complete lsn.
+ */
+static void
+ReorderBufferHandleIncompleteTuple(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   ReorderBufferChange *change,
+								   bool toast_insert, Size total_size)
+{
+	ReorderBufferTXN *toptxn;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * If this is a first incomplete change then set the size of the complete
+	 * change.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		(toast_insert || IsSpecInsert(change->action)))
+		toptxn->complete_size = total_size;
+
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Basically,
+	 * both update and insert will do the insert in the toast table.  And as
+	 * explained in the function header we can not stream the only toast
+	 * changes.  So whenever we get the toast insert we set the flag and clear
+	 * the same whenever we get the next insert or update on the main table.
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial tuple and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+
+	/*
+	 * If we don't have any incomplete change after this change then set this
+	 * LSN as last complete lsn.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(toptxn)))
+	{
+		toptxn->last_complete_lsn = change->lsn;
+
+		/*
+		 * If the transaction is serialized and the the changes are complete in
+		 * the top level transaction then immediately stream the transaction.
+		 * The reason for not waiting for memory limit to get full is that in
+		 * the streaming mode, if the transaction serialized that means we have
+		 * already reached the memory limit but that time we could not stream
+		 * this due to incomplete tuple so now stream it as soon as the tuple
+		 * is complete.  Also, if we don't stream the serialized changes then
+		 * if we get some more incomplete changes in this transaction then we
+		 * don't have a way to partly truncate the serialized changes.
+		 */
+		if (rbtxn_is_serialized(txn))
+			ReorderBufferStreamTXN(rb, toptxn);
+	}
+}
+
+/*
  * Queue a change into a transaction so it can be replayed upon commit.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
+	Size	total_size = 0;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the last changes we have detected that the transaction
+	 * is aborted.  So there is no point in collecting further changes for the
+	 * transaction.
+	 */
+	if (txn->concurrent_abort)
+	{
+		ReorderBufferFreeChange(rb, change);
+		return;
+	}
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -656,9 +769,28 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * Get the total size of the top transaction before updating the size for
+	 * current change so that if this is the incomplete tuple we know the size
+	 * prior to this change.  That will be used for updating the size of the
+	 * complete changes in the top transaction for streaming.
+	 */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			total_size = txn->toptxn->total_size;
+		else
+			total_size = txn->total_size;
+	}
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* Handle the incomplete tuple, If streaming is enabled */
+	if (ReorderBufferCanStream(rb))
+		ReorderBufferHandleIncompleteTuple(rb, txn, change, toast_insert,
+										   total_size);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -688,7 +820,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1399,11 +1531,46 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
  * Discard changes from a transaction (and subtransactions), after streaming
  * them. Keep the remaining info - transactions, tuplecids, invalidations and
  * snapshots.
+ *
+ * If partial_truncate is false we completely truncate the transaction,
+ * otherwise we truncate upto last_complete_lsn
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 bool partial_truncate)
 {
 	dlist_mutable_iter iter;
+	ReorderBufferTXN *toptxn;
+
+	/* Get the top transaction */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * The serialized transaction should never be partly truncated, because if
+	 * it is serialized then we stream it as soon as its changes get completed.
+	 */
+	Assert(!(rbtxn_is_serialized(txn) && partial_truncate));
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
 
 	/* cleanup subtransactions & their changes */
 	dlist_foreach_modify(iter, &txn->subtxns)
@@ -1420,7 +1587,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, partial_truncate);
 	}
 
 	/* cleanup changes in the toplevel txn */
@@ -1433,6 +1600,14 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
+		/* We have truncated upto last complete lsn so stop. */
+		if (partial_truncate && (change->lsn > toptxn->last_complete_lsn))
+		{
+			/* The transaction must have incomplete changes. */
+			Assert(rbtxn_has_incomplete_tuple(toptxn));
+			break;
+		}
+
 		/* remove the change from it's containing list */
 		dlist_delete(&change->node);
 
@@ -1440,24 +1615,6 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
-	 * streamed always, even if it does not contain any changes (that is, when
-	 * all the changes are in subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts for
-	 * XIDs the downstream is not aware of. And of course, it always knows
-	 * about the toplevel xact (we send the XID in all messages), but we never
-	 * stream XIDs of empty subxacts.
-	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
-	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
 	 * values, but this seems simpler and good enough for now.
@@ -1468,9 +1625,39 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
-	/* also reset the number of entries in the transaction */
-	txn->nentries_mem = 0;
-	txn->nentries = 0;
+	/*
+	 * Adjust nentries/nentries_mem based on the changes processed.  See
+	 * comments where nprocessed is declared.
+	 */
+	if (partial_truncate)
+	{
+		txn->nentries -= txn->nprocessed;
+		txn->nentries_mem -= txn->nprocessed;
+	}
+	else
+	{
+		txn->nentries = 0;
+		txn->nentries_mem = 0;
+	}
+	txn->nprocessed = 0;
+
+	/*
+	 * If this is a top transaction then we can reset the
+	 * last_complete_lsn and complete_size, because by now we would
+	 * have stream all the changes upto last_complete_lsn.
+	 */
+	if (partial_truncate && (txn->toptxn == NULL))
+	{
+		toptxn->last_complete_lsn = InvalidXLogRecPtr;
+		toptxn->complete_size = 0;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
 }
 
 /*
@@ -1754,7 +1941,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed. */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Stop the stream. */
 	rb->stream_stop(rb, txn, last_lsn);
@@ -1789,6 +1976,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
 	ReorderBufferChange *volatile specinsert = NULL;
 	volatile bool stream_started = false;
+	volatile bool	partial_truncate = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1851,7 +2040,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			/* Set the xid for concurrent abort check. */
 			if (streaming)
-				SetupCheckXidLive(change->txn->xid);
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
 
 			switch (change->action)
 			{
@@ -2108,6 +2300,27 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
 			}
+
+			if (streaming)
+			{
+				/*
+				 * Increment the nprocessed count.  See the detailed comment
+				 * for usage of this in ReorderBufferTXN structure.
+				 */
+				curtxn->nprocessed++;
+
+				/*
+				 * If the transaction contains incomplete tuple and this is the
+				 * last complete change then stop further processing of the
+				 * transaction.  And, set the partial truncate flag to true.
+				 */
+				if (rbtxn_has_incomplete_tuple(txn) &&
+					prev_lsn == txn->last_complete_lsn)
+				{
+					partial_truncate = true;
+					break;
+				}
+			}
 		}
 
 		/*
@@ -2179,7 +2392,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming)
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, partial_truncate);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
@@ -2228,6 +2441,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
+			curtxn->concurrent_abort = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2506,7 +2720,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2555,7 +2769,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2578,6 +2792,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2592,8 +2807,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2601,12 +2821,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2848,18 +3076,28 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 	dlist_foreach(iter, &rb->toplevel_by_lsn)
 	{
 		ReorderBufferTXN *txn;
+		Size	size = 0;
+		Size	largest_size = 0;
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
+		/*
+		 * If this transaction have some incomplete changes then only consider
+		 * the size upto last complete lsn.
+		 */
+		if (rbtxn_has_incomplete_tuple(txn))
+			size = txn->complete_size;
+		else
+			size = txn->total_size;
+
+		/* If the current transaction is larger then remember it. */
+		if ((largest != NULL || size > largest_size) && size > 0)
+		{
 			largest = txn;
+			largest_size = size;
+		}
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2897,18 +3135,13 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 		 * Pick the largest transaction (or subtransaction) and evict it from
 		 * memory by streaming, if supported. Otherwise, spill to disk.
 		 */
-		if (ReorderBufferCanStream(rb))
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
 		{
-			/*
-			 * Pick the largest toplevel transaction and evict it from memory
-			 * by streaming the already decoded part.
-			 */
-			txn = ReorderBufferLargestTopTXN(rb);
-
 			/* we know there has to be one, because the size is not zero */
 			Assert(txn && !txn->toptxn);
-			Assert(txn->size > 0);
-			Assert(rb->size >= txn->size);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
 			ReorderBufferStreamTXN(rb, txn);
 		}
@@ -2926,14 +3159,14 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(rb->size >= txn->size);
 
 			ReorderBufferSerializeTXN(rb, txn);
-		}
 
-		/*
-		 * After eviction, the transaction should have no entries in memory,
-		 * and should use 0 bytes for changes.
-		 */
-		Assert(txn->size == 0);
-		Assert(txn->nentries_mem == 0);
+			/*
+			 * After eviction, the transaction should have no entries in memory, and
+			 * should use 0 bytes for changes.
+			 */
+			Assert(txn->size == 0);
+			Assert(txn->nentries_mem == 0);
+		}
 	}
 
 	/* We must be under the memory limit now. */
@@ -3317,10 +3550,6 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	/* Process and send the changes to output plugin. */
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
-
-	Assert(dlist_is_empty(&txn->changes));
-	Assert(txn->nentries == 0);
-	Assert(txn->nentries_mem == 0);
 }
 
 /*
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ae1759f..cbed1e8 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -163,6 +163,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -182,6 +184,26 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -190,10 +212,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -339,6 +357,26 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* Size of the complete changes. */
+	Size		complete_size;
+
+	/* LSN of the last complete change. */
+	XLogRecPtr	last_complete_lsn;
+
+	/*
+	 * Number of changes processed.  This is used to keep track of changes that
+	 * remained to be streamed.  As of now, this can happen either due to toast
+	 * tuples or speculative insertions where we need to wait for multiple
+	 * changes before we can send them.
+	 */
+	uint64		nprocessed;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -515,7 +553,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v32-0007-Extend-the-BufFile-interface-for-the-streaming-o.patch000664 000765 000024 00000037514 13703317267 025441 0ustar00amitkapilastaff000000 000000 From 3ac666268832074dcdff594d4a2a938450ffbb86 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v32 07/12] Extend the BufFile interface for the streaming of
 in-progress transactions.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times.

Implement the interface for BufFileTruncate interface to allow files to be
truncated up to a particular offset.  Extend BufFileSeek API to support
SEEK_END case.  Add an option to provide a mode while opening the shared
BufFiles instead of always opening in read-only mode.
---
 src/backend/postmaster/pgstat.c           |  3 +
 src/backend/storage/file/buffile.c        | 86 +++++++++++++++++++++++----
 src/backend/storage/file/fd.c             |  9 ++-
 src/backend/storage/file/sharedfileset.c  | 98 ++++++++++++++++++++++++++++---
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  4 +-
 10 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 88992c2..6c97f68 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 3907349..c08ff4f 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile supports temporary files that can be used by the single backend when
+ * the corresponding files need to be survived across the transaction and need
+ * to be opened and closed multiple times.  Such files need to be created as a
+ * member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -364,6 +368,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -666,11 +673,22 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
+			break;
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +856,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420e..f376a97 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594..9a3dc10 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can also be used by backends when the temporary files need
+ * to be opened/closed multiple times and the underlying files need to survive
+ * across transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,19 +29,29 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -223,6 +255,58 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 }
 
 /*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
+/*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
  */
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59..788815c 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a43..b83fb50 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201..807a9c1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752ba..fc34c49 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d..e209f04 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf07..d5edb60 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
1.8.3.1

v32-0008-Add-support-for-streaming-to-built-in-replicatio.patch000664 000765 000024 00000261644 13703317267 025652 0ustar00amitkapilastaff000000 000000 From ca5b5e5c12e59bd6d964f4d352dc3a84dd4c39a5 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 08:57:16 +0530
Subject: [PATCH v32 08/12] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |   4 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  45 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   3 +
 src/backend/replication/logical/proto.c            | 140 +++-
 src/backend/replication/logical/worker.c           | 917 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 318 ++++++-
 src/backend/replication/slotfuncs.c                |   6 +
 src/backend/replication/walsender.c                |   6 +
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  42 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/009_stream_simple.pl       |  86 ++
 src/test/subscription/t/010_stream_subxact.pl      | 102 +++
 src/test/subscription/t/011_stream_ddl.pl          |  95 +++
 .../subscription/t/012_stream_subxact_abort.pl     |  82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |  84 ++
 20 files changed, 1923 insertions(+), 41 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index c24ace1..d8de56c 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 5bbc165..c25b7c5 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731..f28482f 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9ebb026..9065a1b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 6c97f68..479e3ca 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9..5257ab0 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd..83d0642 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f90a896..ea5874c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,62 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +291,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -553,6 +693,305 @@ apply_handle_origin(StringInfo s)
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -565,6 +1004,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1022,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1061,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1179,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1324,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1697,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1838,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1493,6 +1966,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1597,7 +2078,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1941,6 +2422,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3020,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 15379e3..1509f9b 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +122,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +148,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +211,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +240,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +263,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +373,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +423,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +465,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +486,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +518,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +538,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +562,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +582,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +607,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +639,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -588,6 +720,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -624,6 +841,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -753,12 +1002,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -793,7 +1075,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 9fe147b..d93312c 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -158,6 +158,12 @@ create_logical_replication_slot(char *name, char *plugin,
 									NULL, NULL, NULL);
 
 	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 5e2210d..bc36c78 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndUpdateProgress);
 
 		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
+		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
 		 * messages or send keepalives. As we possibly need to wait for
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d4..617b909 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561..89158ed 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c75dceb..56517a9 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v32-0009-Enable-streaming-for-all-subscription-TAP-tests.patch000664 000765 000024 00000035344 13703317267 025413 0ustar00amitkapilastaff000000 000000 From 4471ce7a90c9fb5124a2b37dd69d02f78ffef1e6 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v32 09/12] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318f..6f7bedc 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f8..4ba8086 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334e..cdf9b8e 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v32-0010-Add-TAP-test-for-streaming-vs.-DDL.patch000664 000765 000024 00000010623 13703317267 022330 0ustar00amitkapilastaff000000 000000 From 7ca75ca8d67e0661d31fa3e9f17498e8be434c14 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v32 10/12] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v32-0011-Provide-new-api-to-get-the-streaming-changes.patch000664 000765 000024 00000014506 13703317267 024704 0ustar00amitkapilastaff000000 000000 From 35afa74035d625921ed4a4049ff09af4131ad250 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Sat, 2 May 2020 11:41:59 +0530
Subject: [PATCH v32 11/12] Provide new api to get the streaming changes

---
 .gitignore                                     |  1 +
 doc/src/sgml/test-decoding.sgml                | 22 ++++++++++++++++++++++
 src/backend/catalog/system_views.sql           |  8 ++++++++
 src/backend/replication/logical/logicalfuncs.c | 23 ++++++++++++++++++-----
 src/include/catalog/pg_proc.dat                |  9 +++++++++
 5 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/.gitignore b/.gitignore
index 794e35b..6083744 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d..eed6e9d 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_streaming_changes('test_slot', NULL, NULL);
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b6d35c2..eed7b7f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1237,6 +1237,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
 
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_peek_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index b99c94e..70c28ff 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -108,7 +108,8 @@ check_permissions(void)
  * Helper function for the various SQL callable logical decoding functions.
  */
 static Datum
-pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool binary)
+pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
+								 bool binary, bool streaming)
 {
 	Name		name;
 	XLogRecPtr	upto_lsn;
@@ -252,6 +253,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or has been invalidated.")));
 
+		/* If called has not asked for streaming changes then disable it. */
+		ctx->streaming &= streaming;
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
@@ -362,7 +366,16 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 Datum
 pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, false);
+}
+
+/*
+ * SQL function to get the streaming changes as text, consuming the data.
+ */
+Datum
+pg_logical_slot_get_streaming_changes(PG_FUNCTION_ARGS)
+{
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, true);
 }
 
 /*
@@ -371,7 +384,7 @@ pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, false, false);
 }
 
 /*
@@ -380,7 +393,7 @@ pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, true, false);
 }
 
 /*
@@ -389,7 +402,7 @@ pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, true, false);
 }
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 95604e9..6eebfbb 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10136,6 +10136,15 @@
   proargmodes => '{i,i,i,v,o,o,o}',
   proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
   prosrc => 'pg_logical_slot_get_binary_changes' },
+{ oid => '6150', descr => 'get streaming changes from replication slot',
+  proname => 'pg_logical_slot_get_streaming_changes', procost => '1000',
+  prorows => '1000', provariadic => 'text', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'name pg_lsn int4 _text',
+  proallargtypes => '{name,pg_lsn,int4,_text,pg_lsn,xid,text}',
+  proargmodes => '{i,i,i,v,o,o,o}',
+  proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
+  prosrc => 'pg_logical_slot_get_streaming_changes' },
 { oid => '3784', descr => 'peek at changes from replication slot',
   proname => 'pg_logical_slot_peek_changes', procost => '1000',
   prorows => '1000', provariadic => 'text', proisstrict => 'f',
-- 
1.8.3.1

v32-0012-Add-streaming-option-in-pg_dump.patch000664 000765 000024 00000005324 13703317267 022407 0ustar00amitkapilastaff000000 000000 From 33d939b9a97c7cf16e5363271791d503a5499333 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v32 12/12] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index e758b5c..ff2ae37 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4235,8 +4236,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4249,6 +4250,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4265,6 +4267,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4342,6 +4345,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0c2fcfb..af64270 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
1.8.3.1

v32.tar.gz000644 000765 000024 00001073000 13703317441 014102 0ustar00amitkapilastaff000000 000000 v32-0001-Immediately-WAL-log-subtransaction-and-top-level.patch000664 000765 000024 00000026121 13703317267 025526 0ustar00amitkapilastaff000000 000000 From 437994738124a4fbf07a8148c6d89f34fc2edc05 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 5 Jun 2020 09:03:16 +0530
Subject: [PATCH v32 01/12] Immediately WAL-log subtransaction and top-level
 XID association.

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead) only when wal_level=logical.
We can not remove the existing XLOG_XACT_ASSIGNMENT WAL as that is
required for avoiding overflow in the hot standby snapshot.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/transam/xact.c        | 50 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 23 +++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 44 ++++++++++++++--------------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 107 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b3ee7fa..bd4c3cf 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to top-level XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5120,6 +5122,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6022,3 +6025,50 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * IsSubTransactionAssignmentPending
+ *
+ * This is used to decide whether we need to WAL log the top-level XID for
+ * operation in a subtransaction.  We require that for logical decoding, see
+ * LogicalDecodingProcessRecord.
+ *
+ * This returns true if wal_level >= logical and we are inside a valid
+ * subtransaction, for which the assignment was not yet written to any WAL
+ * record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	/* wal_level has to be logical */
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it should not be already 'assigned' */
+	return !CurrentTransactionState->assigned;
+}
+
+/*
+ * MarkSubTransactionAssigned
+ *
+ * Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f..c526bb1 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,19 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogResetInsertion) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cb76be4..a757bac 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1197,6 +1197,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1235,6 +1236,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3a..0c0c371 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -94,11 +94,27 @@ void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
 	XLogRecordBuffer buf;
+	TransactionId txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the top-level xid is valid, we need to assign the subxact to the
+	 * top-level xact. We need to do this for all records, hence we do it
+	 * before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -216,13 +232,8 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	/*
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
-	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
-	 * when the snapshot is being built: it is possible to get later records
-	 * that require subxids to be properly assigned.
 	 */
-	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
 		return;
 
 	switch (info)
@@ -264,22 +275,13 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
 
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			/*
+			 * We assign subxact to the toplevel xact while processing each
+			 * record if required.  So, we don't need to do anything here.
+			 * See LogicalDecodingProcessRecord.
+			 */
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index db19187..aef8555 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 5b14334..d8391aa 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of top-level xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index b0f2a6e..b976882 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -191,6 +191,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -304,6 +306,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v32-0002-WAL-Log-invalidations-at-command-end-with-wal_le.patch000664 000765 000024 00000035200 13703317267 025360 0ustar00amitkapilastaff000000 000000 From f72d77f52beb1335d3e245391ef85665d950be65 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 6 Jun 2020 09:54:21 +0530
Subject: [PATCH v32 02/12] WAL Log invalidations at command end with
 wal_level=logical.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When wal_level=logical, write invalidations at command end into WAL so
that decoding can use this information.

This patch is required to allow the streaming of in-progress transactions
in logical decoding.  The actual work to allow streaming will be committed
as a separate patch.

We still add the invalidations to the cache and write them to WAL at
commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not need
to be changed.

The invalidations written at command end uses a new xlog record type
XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See
LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures, which
still rely on the invalidations written to commit records.

The invalidations are decoded and accumulated in top-transaction, and then
executed during replay.  This obviates the need to decode the
invalidations as part of a commit record.

LogStandbyInvalidations was accumulating all the invalidations in memory,
and then only wrote them once at commit time, which may reduce the
performance impact by amortizing the overhead and deduplicating the
invalidations.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/rmgrdesc/xactdesc.c          | 10 +++++
 src/backend/access/transam/xact.c               | 14 ++++++
 src/backend/replication/logical/decode.c        | 58 +++++++++++++++----------
 src/backend/replication/logical/reorderbuffer.c | 52 ++++++++++++++++++----
 src/backend/utils/cache/inval.c                 | 54 +++++++++++++++++++++++
 src/include/access/xact.h                       | 13 +++++-
 src/include/replication/reorderbuffer.h         |  3 ++
 src/include/utils/inval.h                       |  2 +
 8 files changed, 173 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce755..68aa994 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -396,6 +396,13 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		standby_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs, InvalidOid,
+								   InvalidOid, false);
+	}
 }
 
 const char *
@@ -423,6 +430,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index bd4c3cf..cd24359 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1224,6 +1224,13 @@ RecordTransactionCommit(void)
 	bool		RelcacheInitFileInval = false;
 	bool		wrote_xlog;
 
+	/*
+	 * Log any pending invalidations which are adding between the last
+	 * command counter increment and the commit.
+	 */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	/* Get data needed for commit record */
 	nrels = smgrGetPendingDeletes(true, &rels);
 	nchildren = xactGetCommittedChildren(&children);
@@ -6022,6 +6029,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0c0c371..7153eba 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -278,10 +278,39 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 			/*
 			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here.
-			 * See LogicalDecodingProcessRecord.
+			 * record if required.  So, we don't need to do anything here. See
+			 * LogicalDecodingProcessRecord.
 			 */
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/*
+				 * Execute the invalidations for xid-less transactions,
+				 * otherwise, accumulate them so that they can be processed at
+				 * the commit time.
+				 */
+				if (!ctx->fast_forward)
+				{
+					if (TransactionIdIsValid(xid))
+					{
+						ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+													  invals->nmsgs, invals->msgs);
+						ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+														  buf->origptr);
+					}
+					else
+						ReorderBufferImmediateInvalidation(ctx->reorder,
+														   invals->nmsgs,
+														   invals->msgs);
+				}
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
@@ -334,15 +363,11 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_STANDBY_LOCK:
 			break;
 		case XLOG_INVALIDATIONS:
-			{
-				xl_invalidations *invalidations =
-				(xl_invalidations *) XLogRecGetData(r);
 
-				if (!ctx->fast_forward)
-					ReorderBufferImmediateInvalidation(ctx->reorder,
-													   invalidations->nmsgs,
-													   invalidations->msgs);
-			}
+			/*
+			 * We are processing the invalidations at the command level via
+			 * XLOG_XACT_INVALIDATIONS.  So we don't need to do anything here.
+			 */
 			break;
 		default:
 			elog(ERROR, "unexpected RM_STANDBY_ID record type: %u", info);
@@ -573,19 +598,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
-	/*
-	 * Process invalidation messages, even if we're not interested in the
-	 * transaction's contents, since the various caches need to always be
-	 * consistent.
-	 */
-	if (parsed->nmsgs > 0)
-	{
-		if (!ctx->fast_forward)
-			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
-										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
-	}
-
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5251932..1661190 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -856,6 +856,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to top-level transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -2201,7 +2204,11 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 /*
  * Setup the invalidation of the toplevel transaction.
  *
- * This needs to be done before ReorderBufferCommit is called!
+ * This needs to be called for each XLOG_XACT_INVALIDATIONS message and
+ * accumulates all the invalidation messages in the toplevel transaction.
+ * This is required because in some cases where we skip processing the
+ * transaction (see ReorderBufferForget), we need to execute all the
+ * invalidations together.
  */
 void
 ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
@@ -2212,17 +2219,35 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	if (txn->ninvalidations != 0)
-		elog(ERROR, "only ever add one set of invalidations");
+	/*
+	 * We collect all the invalidations under the top transaction so that we
+	 * can execute them all together.
+	 */
+	if (txn->toptxn)
+		txn = txn->toptxn;
 
 	Assert(nmsgs > 0);
 
-	txn->ninvalidations = nmsgs;
-	txn->invalidations = (SharedInvalidationMessage *)
-		MemoryContextAlloc(rb->context,
-						   sizeof(SharedInvalidationMessage) * nmsgs);
-	memcpy(txn->invalidations, msgs,
-		   sizeof(SharedInvalidationMessage) * nmsgs);
+	/* Accumulate invalidations. */
+	if (txn->ninvalidations == 0)
+	{
+		txn->ninvalidations = nmsgs;
+		txn->invalidations = (SharedInvalidationMessage *)
+			MemoryContextAlloc(rb->context,
+							   sizeof(SharedInvalidationMessage) * nmsgs);
+		memcpy(txn->invalidations, msgs,
+			   sizeof(SharedInvalidationMessage) * nmsgs);
+	}
+	else
+	{
+		txn->invalidations = (SharedInvalidationMessage *)
+			repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
+					 (txn->ninvalidations + nmsgs));
+
+		memcpy(txn->invalidations + txn->ninvalidations, msgs,
+			   nmsgs * sizeof(SharedInvalidationMessage));
+		txn->ninvalidations += nmsgs;
+	}
 }
 
 /*
@@ -2250,6 +2275,15 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * Mark top-level transaction as having catalog changes too if one of its
+	 * children has so that the ReorderBufferBuildTupleCidHash can
+	 * conveniently check just top-level transaction and decide whether to
+	 * build the hash table or not.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..e3fa723 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,9 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transactions.  See
+ *      CommandEndInvalidationMessages.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +107,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -1094,6 +1098,11 @@ CommandEndInvalidationMessages(void)
 
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
+
+	/* WAL Log per-command invalidation messages for wal_level=logical */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
 							   &transInvalInfo->CurrentCmdInvalidMsgs);
 }
@@ -1501,3 +1510,48 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * LogLogicalInvalidations
+ *
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end.
+ */
+void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage *invalMessages;
+	int			nmsgs = 0;
+
+	/* Quick exit if we haven't done anything with invalidation messages. */
+	if (transInvalInfo == NULL)
+		return;
+
+	ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+									 MakeSharedInvalidMessagesArray);
+
+	Assert(!(numSharedInvalidMessagesArray > 0 &&
+			 SharedInvalidMessagesArray == NULL));
+
+	invalMessages = SharedInvalidMessagesArray;
+	nmsgs = numSharedInvalidMessagesArray;
+	SharedInvalidMessagesArray = NULL;
+	numSharedInvalidMessagesArray = 0;
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						 nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index aef8555..ac3f5e3 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,17 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 019bd38..1055e99 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -220,6 +220,9 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	end_lsn;
 
+	/* Toplevel transaction for this subxact (NULL for top-level). */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN of the last lsn at which snapshot information reside, so we can
 	 * restart decoding from there and fully recover this transaction from
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index bc5081c..463888c 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -61,4 +61,6 @@ extern void CacheRegisterRelcacheCallback(RelcacheCallbackFunction func,
 extern void CallSyscacheCallbacks(int cacheid, uint32 hashvalue);
 
 extern void InvalidateSystemCaches(void);
+
+extern void LogLogicalInvalidations(void);
 #endif							/* INVAL_H */
-- 
1.8.3.1

v32-0003-Extend-the-logical-decoding-output-plugin-API-wi.patch000664 000765 000024 00000110356 13703317267 025434 0ustar00amitkapilastaff000000 000000 From f98fdd969536a0b2dde27cd6b2a8207a2c17f932 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:01:24 +0530
Subject: [PATCH v32 03/12] Extend the logical decoding output plugin API with
 stream methods.

This adds seven methods to the output plugin API, adding support for
streaming changes for large transactions.

* stream_start
* stream_stop
* stream_abort
* stream_commit
* stream_change
* stream_message
* stream_truncate

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate a chunk of changes
streamed for a particular toplevel transaction.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/test_decoding.c     | 123 +++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 218 +++++++++++++++++++
 src/backend/replication/logical/logical.c | 351 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  59 +++++
 6 files changed, 825 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..4a71826 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
 }
 
 
@@ -540,3 +569,97 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the changes as the transaction can abort
+ * at a later point in time.  We don't want users to see the changes until the
+ * transaction is committed.
+ */
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the contents for transactional messages
+ * as the transaction can abort at a later point in time.  We don't want users to
+ * see the message contents until the transaction is committed.
+ */
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (transactional)
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu",
+						 transactional, prefix, sz);
+	}
+	else
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+						 transactional, prefix, sz);
+		appendBinaryStringInfo(ctx->out, message, sz);
+	}
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the detailed information of Truncate.
+ * See pg_decode_stream_change.
+ */
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index c89f93c..18116c8 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,6 +389,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -401,6 +408,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -679,6 +695,117 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+    
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The actual changes are not displayed as the transaction can abort at a later
+      point in time and we don't decode changes for aborted transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The message contents for transactional messages are not displayed as the transaction
+      can abort at a later point in time and we don't decode changes for aborted
+      transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -747,4 +874,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and two optional callbacks
+    (<function>stream_message_cb</function>) and (<function>stream_truncate_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be..6ee59bd 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr message_lsn, bool transactional,
+									  const char *prefix, Size message_size, const char *message);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require start/stop/abort/commit/change
+	 * callbacks. The message and truncate callback are optional, similar to
+	 * regular output plugins. We however enable streaming when at least one
+	 * of the methods is enabled, so that we can easily identify missing
+	 * methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional, so we do not
+	 * fail with ERROR when missing, but the wrappers simply do nothing. We
+	 * must set the ReorderBuffer callbacks to something, otherwise the calls
+	 * from there will crash (we don't want to move the checks there).
+	 */
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,307 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_start_cb callback")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_stop_cb callback")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_abort_cb callback")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_commit_cb callback")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_change_cb callback")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475..deef318 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..2d9aa11 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.commit_lsn
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1055e99..42bc817 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -348,6 +348,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											   ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
 struct ReorderBuffer
 {
 	/*
@@ -387,6 +435,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamTruncateCB stream_truncate;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v32-0004-Gracefully-handle-concurrent-aborts-of-transacti.patch000664 000765 000024 00000035744 13703317267 025765 0ustar00amitkapilastaff000000 000000 From f00e48115c804b5c798021ef53d46a42a3bf7129 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:49:40 +0530
Subject: [PATCH v32 04/12] Gracefully handle concurrent aborts of transactions
 being decoded.

When decoding committed transactions this is not an issue, and we never
decode transactions that abort before the decoding starts.

But for an upcoming patch that allows decoding of in-progress
transactions, this may cause failures when the output plugin consults
catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.

Author: Dilip Kumar, Nikhil Sontakke, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 doc/src/sgml/logicaldecoding.sgml         |  9 +++--
 src/backend/access/heap/heapam.c          | 10 ++++++
 src/backend/access/index/genam.c          | 53 +++++++++++++++++++++++++++++
 src/backend/access/table/tableam.c        |  8 +++++
 src/backend/access/transam/xact.c         | 19 +++++++++++
 src/backend/replication/logical/logical.c | 10 ++++++
 src/include/access/tableam.h              | 55 +++++++++++++++++++++++++++++++
 src/include/access/xact.h                 |  4 +++
 src/include/replication/logical.h         |  1 +
 9 files changed, 166 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 18116c8..98b47b0 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7bd4570..b53f99a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments in xact.c where
+	 * these variables are declared.  Normally we have such a check at tableam
+	 * level API but this is called from many places so we need to ensure it
+	 * here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..9d9a70a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,10 +430,37 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments in xact.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments in xact.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments in xact.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 4b2bb29..a61e279 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -235,6 +235,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
 	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
+	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index cd24359..f8cf3bf 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -83,6 +83,19 @@ bool		XactDeferrable;
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
 /*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool		bsysscan = false;
+
+/*
  * When running as a parallel worker, we place only a single
  * TransactionStateData on the parallel worker's state stack, and the XID
  * reflected there will be that of the *innermost* currently-active
@@ -2677,6 +2690,9 @@ AbortTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* If in parallel mode, clean up workers and exit parallel mode. */
 	if (IsInParallelMode())
 	{
@@ -4979,6 +4995,9 @@ AbortSubTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* Exit from parallel mode, if necessary. */
 	if (IsInParallelMode())
 	{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6ee59bd..8deff89 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1442,3 +1442,13 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Clear logical streaming state during (sub)transaction abort.
+ */
+void
+ResetLogicalStreamingState(void)
+{
+	CheckXidAlive = InvalidTransactionId;
+	bsysscan = false;
+}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b3d2a6d..acb6c38 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -19,6 +19,7 @@
 
 #include "access/relscan.h"
 #include "access/sdir.h"
+#include "access/xact.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1017,6 +1027,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1056,6 +1073,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with
+	 * valid CheckXidAlive for catalog or regular tables.  See detailed
+	 * comments in xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1712,6 +1737,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1729,6 +1762,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1747,6 +1788,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1763,6 +1811,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index ac3f5e3..5f767eb 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,10 @@ typedef enum
 /* Synchronous commit level */
 extern int	synchronous_commit;
 
+/* used during logical streaming of a transaction */
+extern TransactionId CheckXidAlive;
+extern bool bsysscan;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index deef318..b0fae98 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -121,5 +121,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+extern void ResetLogicalStreamingState(void);
 
 #endif
-- 
1.8.3.1

v32-0005-Implement-streaming-mode-in-ReorderBuffer.patch000664 000765 000024 00000113244 13703317267 024371 0ustar00amitkapilastaff000000 000000 From a3d70dd34805535d8f93f208fafe68d27f02c133 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 08:40:51 +0530
Subject: [PATCH v32 05/12] Implement streaming mode in ReorderBuffer.

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke new stream API methods. However, sometimes if we have
incomplete toast or speculative insert we spill to the disk because we can
not generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/heap/heapam_visibility.c     |  42 +-
 src/backend/replication/logical/reorderbuffer.c | 765 +++++++++++++++++++++---
 src/include/replication/reorderbuffer.h         |  26 +
 3 files changed, 756 insertions(+), 77 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..c771280 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the tuple
+		 * yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1659,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1661190..fcdc91f 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -236,6 +237,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +246,15 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -367,6 +378,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -764,6 +778,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the (sub)transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -1018,6 +1064,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1032,6 +1081,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1310,6 +1362,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1335,6 +1396,84 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
+	 * streamed always, even if it does not contain any changes (that is, when
+	 * all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the toplevel xact (we send the XID in all messages), but we never
+	 * stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
+	 * memory. We could also keep the hash table and update it with new ctid
+	 * values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
  */
@@ -1485,57 +1624,171 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
-	}
 
-	snapshot_now = txn->base_snapshot;
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the
+	 * xid aborted.  That will happen during catalog access.
+	 */
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						  ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction. This resets the TXN such that it
+ * can be used to stream the remaining data of transaction being processed.
+ */
+static void
+ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+					  Snapshot snapshot_now,
+					  CommandId command_id,
+					  XLogRecPtr last_lsn,
+					  ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+
+	ReorderBufferToastReset(rb, txn);
+	if (specinsert != NULL)
+		ReorderBufferReturnChange(rb, specinsert);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. We iterate over the top and subtransactions (using a k-way
+ * merge) and replay the changes in lsn order.
+ *
+ * If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool stream_started = false;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1558,14 +1811,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming ? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* We only need to send begin/commit for non-streamed transactions. */
+		if (!streaming)
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1573,6 +1827,32 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * We can't call start stream callback before processing first
+			 * change.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1649,7 +1929,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1689,7 +1970,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1747,7 +2028,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1756,10 +2040,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1790,7 +2071,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1845,14 +2125,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, send the last message for this set of
+		 * changes depending upon streaming mode.
+		 */
+		if (streaming)
+		{
+			if (stream_started)
+			{
+				rb->stream_stop(rb, txn, prev_lsn);
+				stream_started = false;
+			}
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot for the next set of changes in
+		 * streaming mode.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1870,14 +2171,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as
+		 * streamed (if they contained changes). Otherwise, remove all the
+		 * changes and deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1896,15 +2210,105 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
+		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * This error can only occur when we are sending the data in
+			 * streaming mode and the streaming in not finished yet.
+			 */
+			Assert(streaming);
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+
+			/* Reset the TXN so that it is allowed to stream remaining data. */
+			ReorderBufferResetTXN(rb, txn, snapshot_now,
+								  command_id, prev_lsn,
+								  specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+	}
+	PG_END_TRY();
+}
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot	snapshot_now;
+	CommandId	command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
 
-		PG_RE_THROW();
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
 	}
-	PG_END_TRY();
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
 }
 
 /*
@@ -1931,6 +2335,22 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_abort(rb, txn, lsn);
+
+		/*
+		 * We might have decoded changes for this transaction that could load
+		 * the cache as per the current transaction's view (consider DDL's
+		 * happened in this transaction). We don't want the decoding of future
+		 * transactions to use those cache entries so execute invalidations.
+		 */
+		if (txn->ninvalidations > 0)
+			ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+											   txn->invalidations);
+	}
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2000,6 +2420,10 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2135,8 +2559,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2144,6 +2577,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2155,19 +2589,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2196,6 +2639,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2388,6 +2832,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
  * disk until we reach under the memory limit.
@@ -2419,11 +2895,38 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if supported. Otherwise, spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStream(rb))
+		{
+			/*
+			 * Pick the largest toplevel transaction and evict it from memory
+			 * by streaming the already decoded part.
+			 */
+			txn = ReorderBufferLargestTopTXN(rb);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it
+			 * from memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2713,6 +3216,113 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+/* Returns true, if the output plugin supports streaming, false, otherwise. */
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * We can't make any assumptions about base snapshot here, similar to what
+	 * ReorderBufferCommit() does. That relies on base_snapshot getting
+	 * transferred from subxact in ReorderBufferCommitChild(), but that was
+	 * not yet called as the transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 *
+	 * Unlike DecodeCommit which adds xids of all the subtransactions in
+	 * snapshot's xip array via SnapBuildCommittedTxn, we can't do that here
+	 * but we do add them to subxip array instead via ReorderBufferCopySnap.
+	 * This allows the catalog changes made in subtransactions decoded till
+	 * now to be visible.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		/*
+		 * If this transaction has no snapshot, it didn't make any changes to
+		 * the database till now, so there's nothing to decode.
+		 */
+		if (txn->base_snapshot == NULL)
+		{
+			Assert(txn->ninvalidations == 0);
+			return;
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -3812,6 +4422,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases, we assume the CID is from the future
+	 * command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 42bc817..ae1759f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -162,6 +162,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -181,6 +182,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -249,6 +268,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
-- 
1.8.3.1

v32-0006-Bugfix-handling-of-incomplete-toast-spec-insert.patch000664 000765 000024 00000070206 13703317267 025520 0ustar00amitkapilastaff000000 000000 From 5b99bddd8cf643a69aa92c4b7ea4aeaab12f9da1 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 08:41:47 +0530
Subject: [PATCH v32 06/12] Bugfix handling of incomplete toast/spec insert.

---
 src/backend/access/heap/heapam.c                |   3 +
 src/backend/replication/logical/decode.c        |  17 +-
 src/backend/replication/logical/reorderbuffer.c | 441 ++++++++++++++++++------
 src/include/access/heapam_xlog.h                |   1 +
 src/include/replication/reorderbuffer.h         |  50 ++-
 5 files changed, 395 insertions(+), 117 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b53f99a..e09c810 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1955,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7153eba..2010d5a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index fcdc91f..4a94766 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -179,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -237,7 +252,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool partial_truncate);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -432,62 +448,71 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 /*
  * Free an ReorderBufferChange.
  */
-void
-ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+static void
+ReorderBufferFreeChange(ReorderBuffer *rb, ReorderBufferChange *change)
 {
-	/* update memory accounting info */
-	ReorderBufferChangeMemoryUpdate(rb, change, false);
-
 	/* free contained data */
 	switch (change->action)
 	{
-		case REORDER_BUFFER_CHANGE_INSERT:
-		case REORDER_BUFFER_CHANGE_UPDATE:
-		case REORDER_BUFFER_CHANGE_DELETE:
-		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
-			if (change->data.tp.newtuple)
-			{
-				ReorderBufferReturnTupleBuf(rb, change->data.tp.newtuple);
-				change->data.tp.newtuple = NULL;
-			}
+	case REORDER_BUFFER_CHANGE_INSERT:
+	case REORDER_BUFFER_CHANGE_UPDATE:
+	case REORDER_BUFFER_CHANGE_DELETE:
+	case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
+		if (change->data.tp.newtuple)
+		{
+			ReorderBufferReturnTupleBuf(rb, change->data.tp.newtuple);
+			change->data.tp.newtuple = NULL;
+		}
 
-			if (change->data.tp.oldtuple)
-			{
-				ReorderBufferReturnTupleBuf(rb, change->data.tp.oldtuple);
-				change->data.tp.oldtuple = NULL;
-			}
-			break;
-		case REORDER_BUFFER_CHANGE_MESSAGE:
-			if (change->data.msg.prefix != NULL)
-				pfree(change->data.msg.prefix);
-			change->data.msg.prefix = NULL;
-			if (change->data.msg.message != NULL)
-				pfree(change->data.msg.message);
-			change->data.msg.message = NULL;
-			break;
-		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
-			if (change->data.snapshot)
-			{
-				ReorderBufferFreeSnap(rb, change->data.snapshot);
-				change->data.snapshot = NULL;
-			}
-			break;
-			/* no data in addition to the struct itself */
-		case REORDER_BUFFER_CHANGE_TRUNCATE:
-			if (change->data.truncate.relids != NULL)
-			{
-				ReorderBufferReturnRelids(rb, change->data.truncate.relids);
-				change->data.truncate.relids = NULL;
-			}
-			break;
-		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
-		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
-		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
-			break;
+		if (change->data.tp.oldtuple)
+		{
+			ReorderBufferReturnTupleBuf(rb, change->data.tp.oldtuple);
+			change->data.tp.oldtuple = NULL;
+		}
+		break;
+	case REORDER_BUFFER_CHANGE_MESSAGE:
+		if (change->data.msg.prefix != NULL)
+			pfree(change->data.msg.prefix);
+		change->data.msg.prefix = NULL;
+		if (change->data.msg.message != NULL)
+			pfree(change->data.msg.message);
+		change->data.msg.message = NULL;
+		break;
+	case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
+		if (change->data.snapshot)
+		{
+			ReorderBufferFreeSnap(rb, change->data.snapshot);
+			change->data.snapshot = NULL;
+		}
+		break;
+		/* no data in addition to the struct itself */
+	case REORDER_BUFFER_CHANGE_TRUNCATE:
+		if (change->data.truncate.relids != NULL)
+		{
+			ReorderBufferReturnRelids(rb, change->data.truncate.relids);
+			change->data.truncate.relids = NULL;
+		}
+		break;
+	case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+	case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
+	case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		break;
 	}
 
 	pfree(change);
 }
+/*
+ * Free an ReorderBufferChange and update memory accounting.
+ */
+void
+ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+{
+	/* update memory accounting info */
+	ReorderBufferChangeMemoryUpdate(rb, change, false);
+
+	/* free contained data */
+	ReorderBufferFreeChange(rb, change);
+}
 
 /*
  * Get a fresh ReorderBufferTupleBuf fitting at least a tuple of size
@@ -638,16 +663,104 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 }
 
 /*
+ * Handle incomplete tuple during streaming.  If streaming is enabled then we
+ * might need to stream the in-progress transaction.  So the problem is that
+ * sometime we might get some incomplete changes which we can not stream
+ * until we get the complete change. e.g.  toast table insert without the main
+ * table insert.  So this function remember the lsn of the last complete change
+ * and the complete size upto last complete lsn so that if we need to stream
+ * we can only stream upto last complete lsn.
+ */
+static void
+ReorderBufferHandleIncompleteTuple(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   ReorderBufferChange *change,
+								   bool toast_insert, Size total_size)
+{
+	ReorderBufferTXN *toptxn;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * If this is a first incomplete change then set the size of the complete
+	 * change.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		(toast_insert || IsSpecInsert(change->action)))
+		toptxn->complete_size = total_size;
+
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Basically,
+	 * both update and insert will do the insert in the toast table.  And as
+	 * explained in the function header we can not stream the only toast
+	 * changes.  So whenever we get the toast insert we set the flag and clear
+	 * the same whenever we get the next insert or update on the main table.
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial tuple and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+
+	/*
+	 * If we don't have any incomplete change after this change then set this
+	 * LSN as last complete lsn.
+	 */
+	if (!(rbtxn_has_incomplete_tuple(toptxn)))
+	{
+		toptxn->last_complete_lsn = change->lsn;
+
+		/*
+		 * If the transaction is serialized and the the changes are complete in
+		 * the top level transaction then immediately stream the transaction.
+		 * The reason for not waiting for memory limit to get full is that in
+		 * the streaming mode, if the transaction serialized that means we have
+		 * already reached the memory limit but that time we could not stream
+		 * this due to incomplete tuple so now stream it as soon as the tuple
+		 * is complete.  Also, if we don't stream the serialized changes then
+		 * if we get some more incomplete changes in this transaction then we
+		 * don't have a way to partly truncate the serialized changes.
+		 */
+		if (rbtxn_is_serialized(txn))
+			ReorderBufferStreamTXN(rb, toptxn);
+	}
+}
+
+/*
  * Queue a change into a transaction so it can be replayed upon commit.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
+	Size	total_size = 0;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the last changes we have detected that the transaction
+	 * is aborted.  So there is no point in collecting further changes for the
+	 * transaction.
+	 */
+	if (txn->concurrent_abort)
+	{
+		ReorderBufferFreeChange(rb, change);
+		return;
+	}
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -656,9 +769,28 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries++;
 	txn->nentries_mem++;
 
+	/*
+	 * Get the total size of the top transaction before updating the size for
+	 * current change so that if this is the incomplete tuple we know the size
+	 * prior to this change.  That will be used for updating the size of the
+	 * complete changes in the top transaction for streaming.
+	 */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			total_size = txn->toptxn->total_size;
+		else
+			total_size = txn->total_size;
+	}
+
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* Handle the incomplete tuple, If streaming is enabled */
+	if (ReorderBufferCanStream(rb))
+		ReorderBufferHandleIncompleteTuple(rb, txn, change, toast_insert,
+										   total_size);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -688,7 +820,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1399,11 +1531,46 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
  * Discard changes from a transaction (and subtransactions), after streaming
  * them. Keep the remaining info - transactions, tuplecids, invalidations and
  * snapshots.
+ *
+ * If partial_truncate is false we completely truncate the transaction,
+ * otherwise we truncate upto last_complete_lsn
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 bool partial_truncate)
 {
 	dlist_mutable_iter iter;
+	ReorderBufferTXN *toptxn;
+
+	/* Get the top transaction */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * The serialized transaction should never be partly truncated, because if
+	 * it is serialized then we stream it as soon as its changes get completed.
+	 */
+	Assert(!(rbtxn_is_serialized(txn) && partial_truncate));
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked
+	 * as streamed always, even if it does not contain any changes (that
+	 * is, when all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts
+	 * for XIDs the downstream is not aware of. And of course, it always
+	 * knows about the toplevel xact (we send the XID in all messages),
+	 * but we never stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
 
 	/* cleanup subtransactions & their changes */
 	dlist_foreach_modify(iter, &txn->subtxns)
@@ -1420,7 +1587,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, partial_truncate);
 	}
 
 	/* cleanup changes in the toplevel txn */
@@ -1433,6 +1600,14 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
+		/* We have truncated upto last complete lsn so stop. */
+		if (partial_truncate && (change->lsn > toptxn->last_complete_lsn))
+		{
+			/* The transaction must have incomplete changes. */
+			Assert(rbtxn_has_incomplete_tuple(toptxn));
+			break;
+		}
+
 		/* remove the change from it's containing list */
 		dlist_delete(&change->node);
 
@@ -1440,24 +1615,6 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
-	 * streamed always, even if it does not contain any changes (that is, when
-	 * all the changes are in subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts for
-	 * XIDs the downstream is not aware of. And of course, it always knows
-	 * about the toplevel xact (we send the XID in all messages), but we never
-	 * stream XIDs of empty subxacts.
-	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
-	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
 	 * values, but this seems simpler and good enough for now.
@@ -1468,9 +1625,39 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
-	/* also reset the number of entries in the transaction */
-	txn->nentries_mem = 0;
-	txn->nentries = 0;
+	/*
+	 * Adjust nentries/nentries_mem based on the changes processed.  See
+	 * comments where nprocessed is declared.
+	 */
+	if (partial_truncate)
+	{
+		txn->nentries -= txn->nprocessed;
+		txn->nentries_mem -= txn->nprocessed;
+	}
+	else
+	{
+		txn->nentries = 0;
+		txn->nentries_mem = 0;
+	}
+	txn->nprocessed = 0;
+
+	/*
+	 * If this is a top transaction then we can reset the
+	 * last_complete_lsn and complete_size, because by now we would
+	 * have stream all the changes upto last_complete_lsn.
+	 */
+	if (partial_truncate && (txn->toptxn == NULL))
+	{
+		toptxn->last_complete_lsn = InvalidXLogRecPtr;
+		toptxn->complete_size = 0;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
 }
 
 /*
@@ -1754,7 +1941,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed. */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Stop the stream. */
 	rb->stream_stop(rb, txn, last_lsn);
@@ -1789,6 +1976,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
 	ReorderBufferChange *volatile specinsert = NULL;
 	volatile bool stream_started = false;
+	volatile bool	partial_truncate = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1851,7 +2040,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			/* Set the xid for concurrent abort check. */
 			if (streaming)
-				SetupCheckXidLive(change->txn->xid);
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
 
 			switch (change->action)
 			{
@@ -2108,6 +2300,27 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					elog(ERROR, "tuplecid value in changequeue");
 					break;
 			}
+
+			if (streaming)
+			{
+				/*
+				 * Increment the nprocessed count.  See the detailed comment
+				 * for usage of this in ReorderBufferTXN structure.
+				 */
+				curtxn->nprocessed++;
+
+				/*
+				 * If the transaction contains incomplete tuple and this is the
+				 * last complete change then stop further processing of the
+				 * transaction.  And, set the partial truncate flag to true.
+				 */
+				if (rbtxn_has_incomplete_tuple(txn) &&
+					prev_lsn == txn->last_complete_lsn)
+				{
+					partial_truncate = true;
+					break;
+				}
+			}
 		}
 
 		/*
@@ -2179,7 +2392,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming)
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, partial_truncate);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
@@ -2228,6 +2441,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
+			curtxn->concurrent_abort = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2506,7 +2720,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2555,7 +2769,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2578,6 +2792,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2592,8 +2807,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	txn = change->txn;
 
 	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2601,12 +2821,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2848,18 +3076,28 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 	dlist_foreach(iter, &rb->toplevel_by_lsn)
 	{
 		ReorderBufferTXN *txn;
+		Size	size = 0;
+		Size	largest_size = 0;
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
+		/*
+		 * If this transaction have some incomplete changes then only consider
+		 * the size upto last complete lsn.
+		 */
+		if (rbtxn_has_incomplete_tuple(txn))
+			size = txn->complete_size;
+		else
+			size = txn->total_size;
+
+		/* If the current transaction is larger then remember it. */
+		if ((largest != NULL || size > largest_size) && size > 0)
+		{
 			largest = txn;
+			largest_size = size;
+		}
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2897,18 +3135,13 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 		 * Pick the largest transaction (or subtransaction) and evict it from
 		 * memory by streaming, if supported. Otherwise, spill to disk.
 		 */
-		if (ReorderBufferCanStream(rb))
+		if (ReorderBufferCanStream(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
 		{
-			/*
-			 * Pick the largest toplevel transaction and evict it from memory
-			 * by streaming the already decoded part.
-			 */
-			txn = ReorderBufferLargestTopTXN(rb);
-
 			/* we know there has to be one, because the size is not zero */
 			Assert(txn && !txn->toptxn);
-			Assert(txn->size > 0);
-			Assert(rb->size >= txn->size);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
 			ReorderBufferStreamTXN(rb, txn);
 		}
@@ -2926,14 +3159,14 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(rb->size >= txn->size);
 
 			ReorderBufferSerializeTXN(rb, txn);
-		}
 
-		/*
-		 * After eviction, the transaction should have no entries in memory,
-		 * and should use 0 bytes for changes.
-		 */
-		Assert(txn->size == 0);
-		Assert(txn->nentries_mem == 0);
+			/*
+			 * After eviction, the transaction should have no entries in memory, and
+			 * should use 0 bytes for changes.
+			 */
+			Assert(txn->size == 0);
+			Assert(txn->nentries_mem == 0);
+		}
 	}
 
 	/* We must be under the memory limit now. */
@@ -3317,10 +3550,6 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	/* Process and send the changes to output plugin. */
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
-
-	Assert(dlist_is_empty(&txn->changes));
-	Assert(txn->nentries == 0);
-	Assert(txn->nentries_mem == 0);
 }
 
 /*
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ae1759f..cbed1e8 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -163,6 +163,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -182,6 +184,26 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -190,10 +212,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -339,6 +357,26 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* Size of the complete changes. */
+	Size		complete_size;
+
+	/* LSN of the last complete change. */
+	XLogRecPtr	last_complete_lsn;
+
+	/*
+	 * Number of changes processed.  This is used to keep track of changes that
+	 * remained to be streamed.  As of now, this can happen either due to toast
+	 * tuples or speculative insertions where we need to wait for multiple
+	 * changes before we can send them.
+	 */
+	uint64		nprocessed;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -515,7 +553,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v32-0007-Extend-the-BufFile-interface-for-the-streaming-o.patch000664 000765 000024 00000037514 13703317267 025441 0ustar00amitkapilastaff000000 000000 From 3ac666268832074dcdff594d4a2a938450ffbb86 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v32 07/12] Extend the BufFile interface for the streaming of
 in-progress transactions.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times.

Implement the interface for BufFileTruncate interface to allow files to be
truncated up to a particular offset.  Extend BufFileSeek API to support
SEEK_END case.  Add an option to provide a mode while opening the shared
BufFiles instead of always opening in read-only mode.
---
 src/backend/postmaster/pgstat.c           |  3 +
 src/backend/storage/file/buffile.c        | 86 +++++++++++++++++++++++----
 src/backend/storage/file/fd.c             |  9 ++-
 src/backend/storage/file/sharedfileset.c  | 98 ++++++++++++++++++++++++++++---
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  4 +-
 10 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 88992c2..6c97f68 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 3907349..c08ff4f 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile supports temporary files that can be used by the single backend when
+ * the corresponding files need to be survived across the transaction and need
+ * to be opened and closed multiple times.  Such files need to be created as a
+ * member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -364,6 +368,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -666,11 +673,22 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
+			break;
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +856,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420e..f376a97 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594..9a3dc10 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can also be used by backends when the temporary files need
+ * to be opened/closed multiple times and the underlying files need to survive
+ * across transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,19 +29,29 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -223,6 +255,58 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 }
 
 /*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
+/*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
  */
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59..788815c 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a43..b83fb50 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201..807a9c1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752ba..fc34c49 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d..e209f04 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf07..d5edb60 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
1.8.3.1

v32-0008-Add-support-for-streaming-to-built-in-replicatio.patch000664 000765 000024 00000261644 13703317267 025652 0ustar00amitkapilastaff000000 000000 From ca5b5e5c12e59bd6d964f4d352dc3a84dd4c39a5 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 08:57:16 +0530
Subject: [PATCH v32 08/12] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |   4 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  45 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   3 +
 src/backend/replication/logical/proto.c            | 140 +++-
 src/backend/replication/logical/worker.c           | 917 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 318 ++++++-
 src/backend/replication/slotfuncs.c                |   6 +
 src/backend/replication/walsender.c                |   6 +
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  42 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/009_stream_simple.pl       |  86 ++
 src/test/subscription/t/010_stream_subxact.pl      | 102 +++
 src/test/subscription/t/011_stream_ddl.pl          |  95 +++
 .../subscription/t/012_stream_subxact_abort.pl     |  82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |  84 ++
 20 files changed, 1923 insertions(+), 41 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index c24ace1..d8de56c 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 5bbc165..c25b7c5 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731..f28482f 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9ebb026..9065a1b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 6c97f68..479e3ca 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9..5257ab0 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd..83d0642 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f90a896..ea5874c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,62 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +291,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -553,6 +693,305 @@ apply_handle_origin(StringInfo s)
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -565,6 +1004,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1022,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1061,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1179,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1324,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1697,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1838,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1493,6 +1966,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1597,7 +2078,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1941,6 +2422,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3020,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 15379e3..1509f9b 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +122,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +148,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +211,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +240,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +263,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -290,9 +373,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +423,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +465,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +486,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +518,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +538,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +562,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +582,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +607,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +639,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -588,6 +720,91 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -624,6 +841,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -753,12 +1002,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -793,7 +1075,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 9fe147b..d93312c 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -158,6 +158,12 @@ create_logical_replication_slot(char *name, char *plugin,
 									NULL, NULL, NULL);
 
 	/*
+	 * Make sure streaming is disabled here - we may have the methods,
+	 * but we don't have anywhere to send the data yet.
+	 */
+	ctx->streaming = false;
+
+	/*
 	 * If caller needs us to determine the decoding start point, do so now.
 	 * This might take a while.
 	 */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 5e2210d..bc36c78 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										WalSndUpdateProgress);
 
 		/*
+		 * Make sure streaming is disabled here - we may have the methods,
+		 * but we don't have anywhere to send the data yet.
+		 */
+		ctx->streaming = false;
+
+		/*
 		 * Signal that we don't need the timeout mechanism. We're just
 		 * creating the replication slot and don't yet accept feedback
 		 * messages or send keepalives. As we possibly need to wait for
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d4..617b909 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561..89158ed 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c75dceb..56517a9 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v32-0009-Enable-streaming-for-all-subscription-TAP-tests.patch000664 000765 000024 00000035344 13703317267 025413 0ustar00amitkapilastaff000000 000000 From 4471ce7a90c9fb5124a2b37dd69d02f78ffef1e6 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v32 09/12] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318f..6f7bedc 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f8..4ba8086 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334e..cdf9b8e 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v32-0010-Add-TAP-test-for-streaming-vs.-DDL.patch000664 000765 000024 00000010623 13703317267 022330 0ustar00amitkapilastaff000000 000000 From 7ca75ca8d67e0661d31fa3e9f17498e8be434c14 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v32 10/12] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v32-0011-Provide-new-api-to-get-the-streaming-changes.patch000664 000765 000024 00000014506 13703317267 024704 0ustar00amitkapilastaff000000 000000 From 35afa74035d625921ed4a4049ff09af4131ad250 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Sat, 2 May 2020 11:41:59 +0530
Subject: [PATCH v32 11/12] Provide new api to get the streaming changes

---
 .gitignore                                     |  1 +
 doc/src/sgml/test-decoding.sgml                | 22 ++++++++++++++++++++++
 src/backend/catalog/system_views.sql           |  8 ++++++++
 src/backend/replication/logical/logicalfuncs.c | 23 ++++++++++++++++++-----
 src/include/catalog/pg_proc.dat                |  9 +++++++++
 5 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/.gitignore b/.gitignore
index 794e35b..6083744 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d..eed6e9d 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_streaming_changes('test_slot', NULL, NULL);
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b6d35c2..eed7b7f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1237,6 +1237,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';
 
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_peek_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index b99c94e..70c28ff 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -108,7 +108,8 @@ check_permissions(void)
  * Helper function for the various SQL callable logical decoding functions.
  */
 static Datum
-pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool binary)
+pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm,
+								 bool binary, bool streaming)
 {
 	Name		name;
 	XLogRecPtr	upto_lsn;
@@ -252,6 +253,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or has been invalidated.")));
 
+		/* If called has not asked for streaming changes then disable it. */
+		ctx->streaming &= streaming;
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
@@ -362,7 +366,16 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 Datum
 pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, false);
+}
+
+/*
+ * SQL function to get the streaming changes as text, consuming the data.
+ */
+Datum
+pg_logical_slot_get_streaming_changes(PG_FUNCTION_ARGS)
+{
+	return pg_logical_slot_get_changes_guts(fcinfo, true, false, true);
 }
 
 /*
@@ -371,7 +384,7 @@ pg_logical_slot_get_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, false);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, false, false);
 }
 
 /*
@@ -380,7 +393,7 @@ pg_logical_slot_peek_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, true, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, true, true, false);
 }
 
 /*
@@ -389,7 +402,7 @@ pg_logical_slot_get_binary_changes(PG_FUNCTION_ARGS)
 Datum
 pg_logical_slot_peek_binary_changes(PG_FUNCTION_ARGS)
 {
-	return pg_logical_slot_get_changes_guts(fcinfo, false, true);
+	return pg_logical_slot_get_changes_guts(fcinfo, false, true, false);
 }
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 95604e9..6eebfbb 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10136,6 +10136,15 @@
   proargmodes => '{i,i,i,v,o,o,o}',
   proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
   prosrc => 'pg_logical_slot_get_binary_changes' },
+{ oid => '6150', descr => 'get streaming changes from replication slot',
+  proname => 'pg_logical_slot_get_streaming_changes', procost => '1000',
+  prorows => '1000', provariadic => 'text', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'name pg_lsn int4 _text',
+  proallargtypes => '{name,pg_lsn,int4,_text,pg_lsn,xid,text}',
+  proargmodes => '{i,i,i,v,o,o,o}',
+  proargnames => '{slot_name,upto_lsn,upto_nchanges,options,lsn,xid,data}',
+  prosrc => 'pg_logical_slot_get_streaming_changes' },
 { oid => '3784', descr => 'peek at changes from replication slot',
   proname => 'pg_logical_slot_peek_changes', procost => '1000',
   prorows => '1000', provariadic => 'text', proisstrict => 'f',
-- 
1.8.3.1

v32-0012-Add-streaming-option-in-pg_dump.patch000664 000765 000024 00000005324 13703317267 022407 0ustar00amitkapilastaff000000 000000 From 33d939b9a97c7cf16e5363271791d503a5499333 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v32 12/12] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index e758b5c..ff2ae37 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4235,8 +4236,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4249,6 +4250,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4265,6 +4267,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4342,6 +4345,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0c2fcfb..af64270 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
1.8.3.1

#434Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#433)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jul 14, 2020 at 5:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 13, 2020 at 4:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 13, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, in that case, we can do both enable and disable streaming in
this function itself rather than allow the caller to later modify it.
I suggest similarly we can enable/disable it for SQL API in
pg_decode_startup via output_plugin_options. This way it will look
consistent for both SQL APIs and for command-based replication. If we
can do so, then probably adding an Assert for Consistent Snapshot
while performing streaming should be okay.

Sounds good to me.

Please find the latest patches. I have made changes only in the
subscriber-side patches (0007 and 0008 as per the current patch-set).
The main changes are:
1. As discussed above, remove SendFeedback call from apply_handle_stream_commit
2. In SharedFilesetInit, ensure to register callback once
3. In stream_open_file, change slight handling around MemoryContexts
4. Merged the subscriber-side patches.
5. Added/Edited comments in 0007 and 0008.

I have reviewed your changes and those look good to me, please find
the latest version of the patch set. The major changes
- A couple of review comments fixed suggested upthread in 0003 and 0005.
- Handle the case of stop streaming until we reach to the
start_decoding_at LSN in 0005
- Simplified the 0006 by avoiding sending the transaction with
incomplete changes and added the comment atop
ReorderBufferLargestTopTXN
- Moved 0010 as 0007 and handled pending comments in the same.
- In 0009 I have fixed a couple of defects mentioned above. And, one
additional defect that is, if we do alter subscription streaming
off/on then it was not working.
- In 0009 sending the origin id.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v33.tarapplication/x-tar; name=v33.tarDownload
._v33000755 000765 000024 00000000334 13703355213 013142 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��H�Hcom.apple.macl��^���OU�=C�6�PaxHeader/v33000755 000765 000024 00000000036 13703355213 014675 xustar00dilipkumarstaff000000 000000 30 mtime=1594743435.510511568
v33/000755 000765 000024 00000000000 13703355213 012777 5ustar00dilipkumarstaff000000 000000 v33/v33-0007-Provide-a-new-option-to-get-the-streaming-change.patch000644 000765 000024 00000016766 13703355213 026145 0ustar00dilipkumarstaff000000 000000 From de1c87f722cb1561e89e4e4e4a740c81ac466e17 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Sat, 2 May 2020 11:41:59 +0530
Subject: [PATCH v33 07/12] Provide a new option to get the streaming changes

---
 .gitignore                                  |  1 +
 contrib/test_decoding/Makefile              |  2 +-
 contrib/test_decoding/expected/stream.out   | 40 +++++++++++++++++++++
 contrib/test_decoding/expected/truncate.out |  6 ++++
 contrib/test_decoding/sql/stream.sql        | 21 +++++++++++
 contrib/test_decoding/sql/truncate.sql      |  1 +
 contrib/test_decoding/test_decoding.c       | 13 +++++++
 doc/src/sgml/test-decoding.sgml             | 22 ++++++++++++
 8 files changed, 105 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/stream.out
 create mode 100644 contrib/test_decoding/sql/stream.sql

diff --git a/.gitignore b/.gitignore
index 794e35b73c..6083744c07 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c582a5..ed9a3d6c0e 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,7 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate
+	spill slot truncate stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
new file mode 100644
index 0000000000..7a78c5b43c
--- /dev/null
+++ b/contrib/test_decoding/expected/stream.out
@@ -0,0 +1,40 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'streaming-changes', '1');
+ count 
+-------
+   157
+(1 row)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/truncate.out b/contrib/test_decoding/expected/truncate.out
index 1cf2ae835c..e64d377214 100644
--- a/contrib/test_decoding/expected/truncate.out
+++ b/contrib/test_decoding/expected/truncate.out
@@ -25,3 +25,9 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
  COMMIT
 (9 rows)
 
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
new file mode 100644
index 0000000000..838824bbd6
--- /dev/null
+++ b/contrib/test_decoding/sql/stream.sql
@@ -0,0 +1,21 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'streaming-changes', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/truncate.sql b/contrib/test_decoding/sql/truncate.sql
index 5aecdf0881..5633854e0d 100644
--- a/contrib/test_decoding/sql/truncate.sql
+++ b/contrib/test_decoding/sql/truncate.sql
@@ -11,3 +11,4 @@ TRUNCATE tab1, tab1 RESTART IDENTITY CASCADE;
 TRUNCATE tab1, tab2;
 
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 4a718263c2..4616df038c 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -122,6 +122,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	ListCell   *option;
 	TestDecodingData *data;
+	bool enable_streaming = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -212,6 +213,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "streaming-changes") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_streaming))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -221,6 +232,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 							elem->arg ? strVal(elem->arg) : "(null)")));
 		}
 	}
+
+	ctx->streaming &= enable_streaming;
 }
 
 /* cleanup this plugin's resources */
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d67b..c9b090f004 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'streaming-changes', '1');
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
-- 
2.23.0

v33/v33-0001-Immediately-WAL-log-subtransaction-and-top-level.patch000644 000765 000024 00000026154 13703355213 026164 0ustar00dilipkumarstaff000000 000000 From 340bfa7ea83be902a2783e7a3fd95b40185d3860 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 5 Jun 2020 09:03:16 +0530
Subject: [PATCH v33 01/12] Immediately WAL-log subtransaction and top-level
 XID association.

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead) only when wal_level=logical.
We can not remove the existing XLOG_XACT_ASSIGNMENT WAL as that is
required for avoiding overflow in the hot standby snapshot.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/transam/xact.c        | 50 ++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 23 ++++++++++-
 src/backend/access/transam/xlogreader.c  |  5 +++
 src/backend/replication/logical/decode.c | 44 +++++++++++----------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 107 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b3ee7fa7ea..bd4c3cf325 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to top-level XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5120,6 +5122,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6022,3 +6025,50 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * IsSubTransactionAssignmentPending
+ *
+ * This is used to decide whether we need to WAL log the top-level XID for
+ * operation in a subtransaction.  We require that for logical decoding, see
+ * LogicalDecodingProcessRecord.
+ *
+ * This returns true if wal_level >= logical and we are inside a valid
+ * subtransaction, for which the assignment was not yet written to any WAL
+ * record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	/* wal_level has to be logical */
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it should not be already 'assigned' */
+	return !CurrentTransactionState->assigned;
+}
+
+/*
+ * MarkSubTransactionAssigned
+ *
+ * Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f09e..c526bb1928 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,19 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogResetInsertion) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cb76be4f46..a757baccfc 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1197,6 +1197,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1235,6 +1236,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3abf8..0c0c371739 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -94,11 +94,27 @@ void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
 	XLogRecordBuffer buf;
+	TransactionId txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the top-level xid is valid, we need to assign the subxact to the
+	 * top-level xact. We need to do this for all records, hence we do it
+	 * before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -216,13 +232,8 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	/*
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
-	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
-	 * when the snapshot is being built: it is possible to get later records
-	 * that require subxids to be properly assigned.
 	 */
-	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
 		return;
 
 	switch (info)
@@ -264,22 +275,13 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
 
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			/*
+			 * We assign subxact to the toplevel xact while processing each
+			 * record if required.  So, we don't need to do anything here.
+			 * See LogicalDecodingProcessRecord.
+			 */
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index db191879b9..aef8555367 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 5b14334887..d8391aa378 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of top-level xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index b0f2a6ed43..b976882229 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -191,6 +191,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -304,6 +306,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0194..2f0c8bf589 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
2.23.0

v33/v33-0005-Implement-streaming-mode-in-ReorderBuffer.patch000644 000765 000024 00000115307 13703355213 025023 0ustar00dilipkumarstaff000000 000000 From 8f8e19afc06fa151883fa9d2ed95c79e6762ccd6 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 9 Jul 2020 14:13:02 +0530
Subject: [PATCH v33 05/12] Implement streaming mode in ReorderBuffer.

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke new stream API methods. However, sometimes if we have
incomplete toast or speculative insert we spill to the disk because we can
not generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/heap/heapam_visibility.c   |  42 +-
 .../replication/logical/reorderbuffer.c       | 796 ++++++++++++++++--
 src/include/replication/reorderbuffer.h       |  26 +
 3 files changed, 788 insertions(+), 76 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aa..c77128087c 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the tuple
+		 * yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1659,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 72e5dd1fd4..14b258a026 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -236,6 +237,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +246,16 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static inline bool ReorderBufferStartStreaming(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -371,6 +383,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -767,6 +782,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+/*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the (sub)transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -1022,6 +1069,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1036,6 +1086,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1313,6 +1366,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		dlist_delete(&txn->base_snapshot_node);
 	}
 
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
 	/*
 	 * Remove TXN from its containing list.
 	 *
@@ -1338,6 +1400,84 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferReturnTXN(rb, txn);
 }
 
+/*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
+	 * streamed always, even if it does not contain any changes (that is, when
+	 * all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the toplevel xact (we send the XID in all messages), but we never
+	 * stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
+	 * memory. We could also keep the hash table and update it with new ctid
+	 * values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
 /*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1489,57 +1629,177 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the
+	 * xid aborted.  That will happen during catalog access.
+	 */
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						  ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction. This resets the TXN such that it
+ * can be used to stream the remaining data of transaction being processed.
+ */
+static void
+ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+					  Snapshot snapshot_now,
+					  CommandId command_id,
+					  XLogRecPtr last_lsn,
+					  ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Free all resources allocated for toast reconstruction */
+	ReorderBufferToastReset(rb, txn);
+
+	/* Return the spec insert change if it is not NULL */
+	if (specinsert != NULL)
+	{
+		ReorderBufferReturnChange(rb, specinsert);
+		specinsert = NULL;
 	}
 
-	snapshot_now = txn->base_snapshot;
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. We iterate over the top and subtransactions (using a k-way
+ * merge) and replay the changes in lsn order.
+ *
+ * If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool stream_started = false;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1562,14 +1822,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming ? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* We only need to send begin/commit for non-streamed transactions. */
+		if (!streaming)
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1577,6 +1838,33 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * We can't call start stream callback before processing first
+			 * change.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					txn->origin_id = change->origin_id;
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1653,7 +1941,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1693,7 +1982,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1751,7 +2040,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1760,10 +2052,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1794,7 +2083,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1849,14 +2137,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, send the last message for this set of
+		 * changes depending upon streaming mode.
+		 */
+		if (streaming)
+		{
+			if (stream_started)
+			{
+				rb->stream_stop(rb, txn, prev_lsn);
+				stream_started = false;
+			}
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot for the next set of changes in
+		 * streaming mode.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1874,14 +2183,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as
+		 * streamed (if they contained changes). Otherwise, remove all the
+		 * changes and deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1900,15 +2222,105 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
+		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * This error can only occur when we are sending the data in
+			 * streaming mode and the streaming in not finished yet.
+			 */
+			Assert(streaming);
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+
+			/* Reset the TXN so that it is allowed to stream remaining data. */
+			ReorderBufferResetTXN(rb, txn, snapshot_now,
+								  command_id, prev_lsn,
+								  specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+	}
+	PG_END_TRY();
+}
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot	snapshot_now;
+	CommandId	command_id = FirstCommandId;
 
-		PG_RE_THROW();
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
 	}
-	PG_END_TRY();
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
 }
 
 /*
@@ -1935,6 +2347,22 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_abort(rb, txn, lsn);
+
+		/*
+		 * We might have decoded changes for this transaction that could load
+		 * the cache as per the current transaction's view (consider DDL's
+		 * happened in this transaction). We don't want the decoding of future
+		 * transactions to use those cache entries so execute invalidations.
+		 */
+		if (txn->ninvalidations > 0)
+			ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+											   txn->invalidations);
+	}
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2004,6 +2432,10 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2139,8 +2571,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2148,6 +2589,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2159,19 +2601,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2200,6 +2651,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2391,6 +2843,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
@@ -2423,11 +2907,38 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if supported. Otherwise, spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferStartStreaming(rb))
+		{
+			/*
+			 * Pick the largest toplevel transaction and evict it from memory
+			 * by streaming the already decoded part.
+			 */
+			txn = ReorderBufferLargestTopTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it
+			 * from memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2725,6 +3236,138 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+/* Returns true, if the output plugin supports streaming, false, otherwise. */
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/* Returns true, if the streaming can be started now, false, otherwise. */
+static inline bool
+ReorderBufferStartStreaming(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild *builder = ctx->snapshot_builder;
+
+	/*
+	 * We can't start streaming immediately even if the streaming is enabled
+	 * because the changes for this transaction might have been already
+	 * decoded.  So as soon as the current decoding LSN is >= the
+	 * start_decoding_at LSN we can start streaming because commit for any
+	 * active transaction will be after that LSN.
+	 */
+	if (ReorderBufferCanStream(rb) &&
+		!SnapBuildXactNeedsSkip(builder, ctx->reader->ReadRecPtr))
+	{
+		/* We must have a consistent snapshot by this time */
+		Assert(SnapBuildCurrentState(builder) == SNAPBUILD_CONSISTENT);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * We can't make any assumptions about base snapshot here, similar to what
+	 * ReorderBufferCommit() does. That relies on base_snapshot getting
+	 * transferred from subxact in ReorderBufferCommitChild(), but that was
+	 * not yet called as the transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 *
+	 * Unlike DecodeCommit which adds xids of all the subtransactions in
+	 * snapshot's xip array via SnapBuildCommittedTxn, we can't do that here
+	 * but we do add them to subxip array instead via ReorderBufferCopySnap.
+	 * This allows the catalog changes made in subtransactions decoded till
+	 * now to be visible.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		/*
+		 * If this transaction has no snapshot, it didn't make any changes to
+		 * the database till now, so there's nothing to decode.
+		 */
+		if (txn->base_snapshot == NULL)
+		{
+			Assert(txn->ninvalidations == 0);
+			return;
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -3824,6 +4467,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases, we assume the CID is from the future
+	 * command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 9d60ed8a89..b1d48c4d58 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -162,6 +162,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -181,6 +182,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -248,6 +267,13 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
+	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
 	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
-- 
2.23.0

v33/v33-0003-Extend-the-logical-decoding-output-plugin-API-wi.patch000644 000765 000024 00000110322 13703355213 026055 0ustar00dilipkumarstaff000000 000000 From e1b1c63515d17d18761799824fd5e1f0e5553eb9 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:01:24 +0530
Subject: [PATCH v33 03/12] Extend the logical decoding output plugin API with
 stream methods.

This adds seven methods to the output plugin API, adding support for
streaming changes for large transactions.

* stream_start
* stream_stop
* stream_abort
* stream_commit
* stream_change
* stream_message
* stream_truncate

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate a chunk of changes
streamed for a particular toplevel transaction.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/test_decoding.c     | 123 ++++++++
 doc/src/sgml/logicaldecoding.sgml         | 218 ++++++++++++++
 src/backend/replication/logical/logical.c | 351 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 +++++
 src/include/replication/reorderbuffer.h   |  59 ++++
 6 files changed, 825 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c948856e..4a718263c2 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
 }
 
 
@@ -540,3 +569,97 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the changes as the transaction can abort
+ * at a later point in time.  We don't want users to see the changes until the
+ * transaction is committed.
+ */
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the contents for transactional messages
+ * as the transaction can abort at a later point in time.  We don't want users to
+ * see the message contents until the transaction is committed.
+ */
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (transactional)
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu",
+						 transactional, prefix, sz);
+	}
+	else
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+						 transactional, prefix, sz);
+		appendBinaryStringInfo(ctx->out, message, sz);
+	}
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the detailed information of Truncate.
+ * See pg_decode_stream_change.
+ */
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index c89f93cf6b..18116c8f3c 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,6 +389,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -401,6 +408,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -679,6 +695,117 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+    
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The actual changes are not displayed as the transaction can abort at a later
+      point in time and we don't decode changes for aborted transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The message contents for transactional messages are not displayed as the transaction
+      can abort at a later point in time and we don't decode changes for aborted
+      transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -747,4 +874,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and two optional callbacks
+    (<function>stream_message_cb</function>) and (<function>stream_truncate_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be3b0..6ee59bd38b 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr message_lsn, bool transactional,
+									  const char *prefix, Size message_size, const char *message);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require start/stop/abort/commit/change
+	 * callbacks. The message and truncate callback are optional, similar to
+	 * regular output plugins. We however enable streaming when at least one
+	 * of the methods is enabled, so that we can easily identify missing
+	 * methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional, so we do not
+	 * fail with ERROR when missing, but the wrappers simply do nothing. We
+	 * must set the ReorderBuffer callbacks to something, otherwise the calls
+	 * from there will crash (we don't want to move the checks there).
+	 */
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,307 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_start_cb callback")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_stop_cb callback")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_abort_cb callback")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_commit_cb callback")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_change_cb callback")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475e5d..deef31825d 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -79,6 +79,11 @@ typedef struct LogicalDecodingContext
 	 */
 	void	   *output_writer_private;
 
+	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236c57..2d9aa1172a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.commit_lsn
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
 /*
  * Output plugin callbacks
  */
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 74ffe7852f..9d60ed8a89 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -348,6 +348,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											   ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
 struct ReorderBuffer
 {
 	/*
@@ -386,6 +434,17 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamTruncateCB stream_truncate;
+
 	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
-- 
2.23.0

v33/v33-0002-WAL-Log-invalidations-at-command-end-with-wal_le.patch000644 000765 000024 00000035526 13703355213 026023 0ustar00dilipkumarstaff000000 000000 From 460d2de9f9003bc6089bd4ef90b8879bf6bdef89 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 6 Jun 2020 09:54:21 +0530
Subject: [PATCH v33 02/12] WAL Log invalidations at command end with
 wal_level=logical.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When wal_level=logical, write invalidations at command end into WAL so
that decoding can use this information.

This patch is required to allow the streaming of in-progress transactions
in logical decoding. ��The actual work to allow streaming will be committed
as a separate patch.

We still add the invalidations to the cache and write them to WAL at
commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not need
to be changed.

The invalidations written at command end uses a new xlog record type
XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See
LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures, which
still rely on the invalidations written to commit records.

The invalidations are decoded and accumulated in top-transaction, and then
executed during replay. ��This obviates the need to decode the
invalidations as part of a commit record.

LogStandbyInvalidations was accumulating all the invalidations in memory,
and then only wrote them once at commit time, which may reduce the
performance impact by amortizing the overhead and deduplicating the
invalidations.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/rmgrdesc/xactdesc.c        | 10 ++++
 src/backend/access/transam/xact.c             | 16 +++++
 src/backend/replication/logical/decode.c      | 58 +++++++++++--------
 .../replication/logical/reorderbuffer.c       | 52 ++++++++++++++---
 src/backend/utils/cache/inval.c               | 55 ++++++++++++++++++
 src/include/access/xact.h                     | 13 ++++-
 src/include/replication/reorderbuffer.h       |  3 +
 src/include/utils/inval.h                     |  2 +
 8 files changed, 176 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce75565f..68aa994c9e 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -396,6 +396,13 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		standby_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs, InvalidOid,
+								   InvalidOid, false);
+	}
 }
 
 const char *
@@ -423,6 +430,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index bd4c3cf325..26e3c4dc4e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1224,6 +1224,15 @@ RecordTransactionCommit(void)
 	bool		RelcacheInitFileInval = false;
 	bool		wrote_xlog;
 
+	/*
+	 * Log any pending invalidations which are added between the last
+	 * CommandCounterIncrement and the commit.  Normally for DDLs, we log this
+	 * at each command end, however for certain cases where we directly update
+	 * the system table the invalidations were not logged at command end.
+	 */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	/* Get data needed for commit record */
 	nrels = smgrGetPendingDeletes(true, &rels);
 	nchildren = xactGetCommittedChildren(&children);
@@ -6022,6 +6031,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0c0c371739..7153ebaa96 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -278,10 +278,39 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 			/*
 			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here.
-			 * See LogicalDecodingProcessRecord.
+			 * record if required.  So, we don't need to do anything here. See
+			 * LogicalDecodingProcessRecord.
 			 */
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/*
+				 * Execute the invalidations for xid-less transactions,
+				 * otherwise, accumulate them so that they can be processed at
+				 * the commit time.
+				 */
+				if (!ctx->fast_forward)
+				{
+					if (TransactionIdIsValid(xid))
+					{
+						ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+													  invals->nmsgs, invals->msgs);
+						ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+														  buf->origptr);
+					}
+					else
+						ReorderBufferImmediateInvalidation(ctx->reorder,
+														   invals->nmsgs,
+														   invals->msgs);
+				}
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
@@ -334,15 +363,11 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_STANDBY_LOCK:
 			break;
 		case XLOG_INVALIDATIONS:
-			{
-				xl_invalidations *invalidations =
-				(xl_invalidations *) XLogRecGetData(r);
 
-				if (!ctx->fast_forward)
-					ReorderBufferImmediateInvalidation(ctx->reorder,
-													   invalidations->nmsgs,
-													   invalidations->msgs);
-			}
+			/*
+			 * We are processing the invalidations at the command level via
+			 * XLOG_XACT_INVALIDATIONS.  So we don't need to do anything here.
+			 */
 			break;
 		default:
 			elog(ERROR, "unexpected RM_STANDBY_ID record type: %u", info);
@@ -573,19 +598,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
-	/*
-	 * Process invalidation messages, even if we're not interested in the
-	 * transaction's contents, since the various caches need to always be
-	 * consistent.
-	 */
-	if (parsed->nmsgs > 0)
-	{
-		if (!ctx->fast_forward)
-			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
-										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
-	}
-
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 7afa2271bd..72e5dd1fd4 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -860,6 +860,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to top-level transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -2205,7 +2208,11 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 /*
  * Setup the invalidation of the toplevel transaction.
  *
- * This needs to be done before ReorderBufferCommit is called!
+ * This needs to be called for each XLOG_XACT_INVALIDATIONS message and
+ * accumulates all the invalidation messages in the toplevel transaction.
+ * This is required because in some cases where we skip processing the
+ * transaction (see ReorderBufferForget), we need to execute all the
+ * invalidations together.
  */
 void
 ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
@@ -2216,17 +2223,35 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	if (txn->ninvalidations != 0)
-		elog(ERROR, "only ever add one set of invalidations");
+	/*
+	 * We collect all the invalidations under the top transaction so that we
+	 * can execute them all together.
+	 */
+	if (txn->toptxn)
+		txn = txn->toptxn;
 
 	Assert(nmsgs > 0);
 
-	txn->ninvalidations = nmsgs;
-	txn->invalidations = (SharedInvalidationMessage *)
-		MemoryContextAlloc(rb->context,
-						   sizeof(SharedInvalidationMessage) * nmsgs);
-	memcpy(txn->invalidations, msgs,
-		   sizeof(SharedInvalidationMessage) * nmsgs);
+	/* Accumulate invalidations. */
+	if (txn->ninvalidations == 0)
+	{
+		txn->ninvalidations = nmsgs;
+		txn->invalidations = (SharedInvalidationMessage *)
+			MemoryContextAlloc(rb->context,
+							   sizeof(SharedInvalidationMessage) * nmsgs);
+		memcpy(txn->invalidations, msgs,
+			   sizeof(SharedInvalidationMessage) * nmsgs);
+	}
+	else
+	{
+		txn->invalidations = (SharedInvalidationMessage *)
+			repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
+					 (txn->ninvalidations + nmsgs));
+
+		memcpy(txn->invalidations + txn->ninvalidations, msgs,
+			   nmsgs * sizeof(SharedInvalidationMessage));
+		txn->ninvalidations += nmsgs;
+	}
 }
 
 /*
@@ -2254,6 +2279,15 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * Mark top-level transaction as having catalog changes too if one of its
+	 * children has so that the ReorderBufferBuildTupleCidHash can
+	 * conveniently check just top-level transaction and decide whether to
+	 * build the hash table or not.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33be6..edd90773eb 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,9 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transactions.  See
+ *      CommandEndInvalidationMessages.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +107,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -1094,6 +1098,11 @@ CommandEndInvalidationMessages(void)
 
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
+
+	/* WAL Log per-command invalidation messages for wal_level=logical */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
 							   &transInvalInfo->CurrentCmdInvalidMsgs);
 }
@@ -1501,3 +1510,49 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * LogLogicalInvalidations
+ *
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end or at commit time if any invalidations
+ * are pending.
+ */
+void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage *invalMessages;
+	int			nmsgs = 0;
+
+	/* Quick exit if we haven't done anything with invalidation messages. */
+	if (transInvalInfo == NULL)
+		return;
+
+	ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+									 MakeSharedInvalidMessagesArray);
+
+	Assert(!(numSharedInvalidMessagesArray > 0 &&
+			 SharedInvalidMessagesArray == NULL));
+
+	invalMessages = SharedInvalidMessagesArray;
+	nmsgs = numSharedInvalidMessagesArray;
+	SharedInvalidMessagesArray = NULL;
+	numSharedInvalidMessagesArray = 0;
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						 nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index aef8555367..ac3f5e3b60 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -197,6 +197,17 @@ typedef struct xl_xact_assignment
 
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
+/*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 626ecf4dc9..74ffe7852f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -220,6 +220,9 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	end_lsn;
 
+	/* Toplevel transaction for this subxact (NULL for top-level). */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN of the last lsn at which snapshot information reside, so we can
 	 * restart decoding from there and fully recover this transaction from
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index bc5081cf72..463888c389 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -61,4 +61,6 @@ extern void CacheRegisterRelcacheCallback(RelcacheCallbackFunction func,
 extern void CallSyscacheCallbacks(int cacheid, uint32 hashvalue);
 
 extern void InvalidateSystemCaches(void);
+
+extern void LogLogicalInvalidations(void);
 #endif							/* INVAL_H */
-- 
2.23.0

v33/v33-0011-Add-TAP-test-for-streaming-vs.-DDL.patch000644 000765 000024 00000010620 13703355213 022756 0ustar00dilipkumarstaff000000 000000 From aa0f7744d000fe90e4ad314b7616a01306226dae Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v33 11/12] Add TAP test for streaming vs. DDL

---
 .../subscription/t/014_stream_through_ddl.pl  | 98 +++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000000..b8d78b1972
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v33/v33-0006-Bugfix-handling-of-incomplete-toast-spec-insert.patch000644 000765 000024 00000051062 13703355213 026147 0ustar00dilipkumarstaff000000 000000 From 2d02f5f53b5a09fce43fb69ce9f0c4254fd56eac Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Mon, 13 Jul 2020 08:47:07 +0530
Subject: [PATCH v33 06/12] Bugfix handling of incomplete toast/spec insert

---
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/replication/logical/decode.c      |  17 +-
 .../replication/logical/reorderbuffer.c       | 271 +++++++++++++-----
 src/include/access/heapam_xlog.h              |   1 +
 src/include/replication/reorderbuffer.h       |  36 ++-
 5 files changed, 247 insertions(+), 81 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b53f99a5c5..e09c8101e7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1955,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7153ebaa96..2010d5a786 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 14b258a026..0416b20787 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -179,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -437,62 +452,71 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 /*
  * Free an ReorderBufferChange.
  */
-void
-ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+static void
+ReorderBufferFreeChange(ReorderBuffer *rb, ReorderBufferChange *change)
 {
-	/* update memory accounting info */
-	ReorderBufferChangeMemoryUpdate(rb, change, false);
-
 	/* free contained data */
 	switch (change->action)
 	{
-		case REORDER_BUFFER_CHANGE_INSERT:
-		case REORDER_BUFFER_CHANGE_UPDATE:
-		case REORDER_BUFFER_CHANGE_DELETE:
-		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
-			if (change->data.tp.newtuple)
-			{
-				ReorderBufferReturnTupleBuf(rb, change->data.tp.newtuple);
-				change->data.tp.newtuple = NULL;
-			}
+	case REORDER_BUFFER_CHANGE_INSERT:
+	case REORDER_BUFFER_CHANGE_UPDATE:
+	case REORDER_BUFFER_CHANGE_DELETE:
+	case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
+		if (change->data.tp.newtuple)
+		{
+			ReorderBufferReturnTupleBuf(rb, change->data.tp.newtuple);
+			change->data.tp.newtuple = NULL;
+		}
 
-			if (change->data.tp.oldtuple)
-			{
-				ReorderBufferReturnTupleBuf(rb, change->data.tp.oldtuple);
-				change->data.tp.oldtuple = NULL;
-			}
-			break;
-		case REORDER_BUFFER_CHANGE_MESSAGE:
-			if (change->data.msg.prefix != NULL)
-				pfree(change->data.msg.prefix);
-			change->data.msg.prefix = NULL;
-			if (change->data.msg.message != NULL)
-				pfree(change->data.msg.message);
-			change->data.msg.message = NULL;
-			break;
-		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
-			if (change->data.snapshot)
-			{
-				ReorderBufferFreeSnap(rb, change->data.snapshot);
-				change->data.snapshot = NULL;
-			}
-			break;
-			/* no data in addition to the struct itself */
-		case REORDER_BUFFER_CHANGE_TRUNCATE:
-			if (change->data.truncate.relids != NULL)
-			{
-				ReorderBufferReturnRelids(rb, change->data.truncate.relids);
-				change->data.truncate.relids = NULL;
-			}
-			break;
-		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
-		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
-		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
-			break;
+		if (change->data.tp.oldtuple)
+		{
+			ReorderBufferReturnTupleBuf(rb, change->data.tp.oldtuple);
+			change->data.tp.oldtuple = NULL;
+		}
+		break;
+	case REORDER_BUFFER_CHANGE_MESSAGE:
+		if (change->data.msg.prefix != NULL)
+			pfree(change->data.msg.prefix);
+		change->data.msg.prefix = NULL;
+		if (change->data.msg.message != NULL)
+			pfree(change->data.msg.message);
+		change->data.msg.message = NULL;
+		break;
+	case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
+		if (change->data.snapshot)
+		{
+			ReorderBufferFreeSnap(rb, change->data.snapshot);
+			change->data.snapshot = NULL;
+		}
+		break;
+		/* no data in addition to the struct itself */
+	case REORDER_BUFFER_CHANGE_TRUNCATE:
+		if (change->data.truncate.relids != NULL)
+		{
+			ReorderBufferReturnRelids(rb, change->data.truncate.relids);
+			change->data.truncate.relids = NULL;
+		}
+		break;
+	case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+	case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
+	case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+		break;
 	}
 
 	pfree(change);
 }
+/*
+ * Free an ReorderBufferChange and update memory accounting.
+ */
+void
+ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+{
+	/* update memory accounting info */
+	ReorderBufferChangeMemoryUpdate(rb, change, false);
+
+	/* free contained data */
+	ReorderBufferFreeChange(rb, change);
+}
 
 /*
  * Get a fresh ReorderBufferTupleBuf fitting at least a tuple of size
@@ -642,17 +666,84 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 	return txn;
 }
 
+/*
+ * Handle incomplete tuple during streaming.  If streaming is enabled then we
+ * might need to stream the in-progress transaction.  So the problem is that
+ * sometime we might get some incomplete changes which we can not stream
+ * until we get the complete change. e.g.  toast table insert without the main
+ * table insert.
+ */
+static void
+ReorderBufferHandleIncompleteTuple(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								   ReorderBufferChange *change,
+								   bool toast_insert)
+{
+	ReorderBufferTXN *toptxn;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * If this is a toast insert then set the corresponding bit.  Basically,
+	 * both update and insert will do the insert in the toast table.  And as
+	 * explained in the function header we can not stream the only toast
+	 * changes.  So whenever we get the toast insert we set the flag and clear
+	 * the same whenever we get the next insert or update on the main table.
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial tuple and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+
+	/*
+	 * If the transaction is serialized and the the changes are complete in
+	 * the top level transaction then immediately stream the transaction.
+	 * The reason for not waiting for memory limit to get full is that in
+	 * the streaming mode, if the transaction serialized that means we have
+	 * already reached the memory limit but that time we could not stream
+	 * this due to incomplete tuple so now stream it as soon as the tuple
+	 * is complete.
+	 */
+	if (ReorderBufferStartStreaming(rb) &&
+		!(rbtxn_has_incomplete_tuple(toptxn)) && rbtxn_is_serialized(txn))
+		ReorderBufferStreamTXN(rb, toptxn);
+}
+
 /*
  * Queue a change into a transaction so it can be replayed upon commit.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the last changes we have detected that the transaction
+	 * is aborted.  So there is no point in collecting further changes for the
+	 * transaction.
+	 */
+	if (txn->concurrent_abort)
+	{
+		ReorderBufferFreeChange(rb, change);
+		return;
+	}
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -664,6 +755,10 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* Handle the incomplete tuple, If streaming is enabled */
+	if (ReorderBufferCanStream(rb))
+		ReorderBufferHandleIncompleteTuple(rb, txn, change, toast_insert);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -693,7 +788,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1473,6 +1568,13 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
+
 	/* also reset the number of entries in the transaction */
 	txn->nentries_mem = 0;
 	txn->nentries = 0;
@@ -1800,6 +1902,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
 	ReorderBufferChange *volatile specinsert = NULL;
 	volatile bool stream_started = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1863,7 +1966,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			/* Set the xid for concurrent abort check. */
 			if (streaming)
-				SetupCheckXidLive(change->txn->xid);
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
 
 			switch (change->action)
 			{
@@ -2240,6 +2346,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
+			curtxn->concurrent_abort = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2518,7 +2625,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2567,7 +2674,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2590,6 +2697,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2603,9 +2711,14 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 	txn = change->txn;
 
-	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	/* If streaming supported, update the total size in top level as well. */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2613,12 +2726,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2850,11 +2971,26 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
  * should give us the same transaction (because we don't update memory account
  * for subtransaction with streaming, so it's always 0). But we can simply
  * iterate over the limited number of toplevel transactions.
+ *
+ * XXX There is a scope of optimization by selecting the largest transaction
+ * by comparing the size of all the complete changes in the transaction instead
+ * of directly ignoring the transactions which have any incomplete change.  But
+ * we need to add a lot of complexities in the code and that might not be worth
+ * the benefit.  Basically, If we plan to stream the partial transactions then
+ * we need to find a way to partially stream/truncate the transaction moreover,
+ * if the transaction is already spilled then we might need a way to somehow
+ * partially truncate the spilled files.  Also, whenever we partially stream
+ * the transaction we need to maintain the last steam lsn and next time we need
+ * to restore from that segment and the offset.  And things become more complex
+ * because we stream the changes from the top transaction whereas we restore
+ * changes subtransaction wise so we will have to even remember the subxact
+ * also from which the last streamed belongs to.
  */
 static ReorderBufferTXN *
 ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 {
 	dlist_iter	iter;
+	Size largest_size = 0;
 	ReorderBufferTXN *largest = NULL;
 
 	dlist_foreach(iter, &rb->toplevel_by_lsn)
@@ -2863,15 +2999,15 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
+		/* If the current transaction is larger then remember it */
+		if ((largest != NULL || txn->total_size > largest_size) &&
+			(txn->total_size > 0) && !(rbtxn_has_incomplete_tuple(txn)))
+		{
 			largest = txn;
+			largest_size = txn->total_size;
+		}
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2909,18 +3045,13 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 		 * Pick the largest transaction (or subtransaction) and evict it from
 		 * memory by streaming, if supported. Otherwise, spill to disk.
 		 */
-		if (ReorderBufferStartStreaming(rb))
+		if (ReorderBufferStartStreaming(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
 		{
-			/*
-			 * Pick the largest toplevel transaction and evict it from memory
-			 * by streaming the already decoded part.
-			 */
-			txn = ReorderBufferLargestTopTXN(rb);
-
 			/* we know there has to be one, because the size is not zero */
 			Assert(txn && !txn->toptxn);
-			Assert(txn->size > 0);
-			Assert(rb->size >= txn->size);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
 			ReorderBufferStreamTXN(rb, txn);
 		}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cdb12..aa17f7df84 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index b1d48c4d58..1589ffce30 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -163,6 +163,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -182,6 +184,26 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -190,10 +212,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -339,6 +357,12 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -526,7 +550,9 @@ void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool incomplte_data);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
2.23.0

v33/v33-0009-Add-support-for-streaming-to-built-in-replicatio.patch000644 000765 000024 00000263107 13703355213 026277 0ustar00dilipkumarstaff000000 000000 From cda98d5f81f020f74ab7e7b6744d631ca6739f82 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 08:57:16 +0530
Subject: [PATCH v33 09/12] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml      |   4 +-
 doc/src/sgml/ref/create_subscription.sgml     |  11 +
 src/backend/catalog/pg_subscription.c         |   1 +
 src/backend/commands/subscriptioncmds.c       |  45 +-
 src/backend/postmaster/pgstat.c               |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |   3 +
 src/backend/replication/logical/proto.c       | 140 ++-
 src/backend/replication/logical/worker.c      | 946 +++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c   | 345 ++++++-
 src/include/catalog/pg_subscription.h         |   3 +
 src/include/pgstat.h                          |   6 +-
 src/include/replication/logicalproto.h        |  42 +-
 src/include/replication/walreceiver.h         |   1 +
 src/test/subscription/t/009_stream_simple.pl  |  86 ++
 src/test/subscription/t/010_stream_subxact.pl | 102 ++
 src/test/subscription/t/011_stream_ddl.pl     |  95 ++
 .../t/012_stream_subxact_abort.pl             |  82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |  84 ++
 18 files changed, 1963 insertions(+), 45 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index c24ace14d1..d8de56c928 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 5bbc165f70..c25b7c5962 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731115..f28482f0f4 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9ebb026187..9065a1be1b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 6c97f68671..479e3cadf9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9bb6..5257ab0394 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd171..83d0642cf3 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f90a896fc3..f0c3278cd9 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,62 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +291,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -542,16 +682,322 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or
+	 * inside streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		!in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -565,6 +1011,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1029,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1068,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1186,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1331,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1704,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1845,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1493,6 +1973,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1597,7 +2085,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1909,6 +2397,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1941,6 +2443,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3041,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 15379e3118..8785d87c35 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +122,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +148,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +211,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +240,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +263,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -232,6 +315,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -290,9 +378,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +428,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +470,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +491,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +523,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +543,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +567,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +587,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +612,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +644,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -587,6 +724,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -623,6 +867,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -753,11 +1029,44 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -793,7 +1102,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d42d8..617b9094d4 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1edf..0dfbac46b4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561be9..89158ed46f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c75dcebea0..56517a9147 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v33/v33-0008-Extend-the-BufFile-interface-for-the-streaming-o.patch000644 000765 000024 00000037627 13703355213 026077 0ustar00dilipkumarstaff000000 000000 From d3f1c960278ab6f3e27c60fd0c4038c8634637f2 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v33 08/12] Extend the BufFile interface for the streaming of
 in-progress transactions.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times.

Implement the interface for BufFileTruncate interface to allow files to be
truncated up to a particular offset.  Extend BufFileSeek API to support
SEEK_END case.  Add an option to provide a mode while opening the shared
BufFiles instead of always opening in read-only mode.
---
 src/backend/postmaster/pgstat.c           |  3 +
 src/backend/storage/file/buffile.c        | 86 +++++++++++++++++---
 src/backend/storage/file/fd.c             |  9 +--
 src/backend/storage/file/sharedfileset.c  | 98 +++++++++++++++++++++--
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  4 +-
 10 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 88992c2da2..6c97f68671 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 3907349b69..c08ff4fd21 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile supports temporary files that can be used by the single backend when
+ * the corresponding files need to be survived across the transaction and need
+ * to be opened and closed multiple times.  Such files need to be created as a
+ * member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -364,6 +368,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -666,11 +673,22 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
+			break;
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +856,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420efb2..f376a97ed6 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594756..9a3dc102f5 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can also be used by backends when the temporary files need
+ * to be opened/closed multiple times and the underlying files need to survive
+ * across transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,19 +29,29 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -222,6 +254,58 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 		SharedFileSetDeleteAll(fileset);
 }
 
+/*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
 /*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59c50..788815cdab 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a4303b..b83fb50dac 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..807a9c1edf 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752bab0d..fc34c49522 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d7df..e209f047e8 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf077e5..d5edb600af 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
2.23.0

v33/v33-0004-Gracefully-handle-concurrent-aborts-of-transacti.patch000644 000765 000024 00000035650 13703355213 026411 0ustar00dilipkumarstaff000000 000000 From 087c267fb7f64791b670a6df2d60b3257e1f3f21 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:49:40 +0530
Subject: [PATCH v33 04/12] Gracefully handle concurrent aborts of transactions
 being decoded.

When decoding committed transactions this is not an issue, and we never
decode transactions that abort before the decoding starts.

But for an upcoming patch that allows decoding of in-progress
transactions, this may cause failures when the output plugin consults
catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.

Author: Dilip Kumar, Nikhil Sontakke, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 doc/src/sgml/logicaldecoding.sgml         |  9 ++--
 src/backend/access/heap/heapam.c          | 10 +++++
 src/backend/access/index/genam.c          | 53 ++++++++++++++++++++++
 src/backend/access/table/tableam.c        |  8 ++++
 src/backend/access/transam/xact.c         | 19 ++++++++
 src/backend/replication/logical/logical.c | 10 +++++
 src/include/access/tableam.h              | 55 +++++++++++++++++++++++
 src/include/access/xact.h                 |  4 ++
 src/include/replication/logical.h         |  1 +
 9 files changed, 166 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 18116c8f3c..98b47b011f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7bd45703aa..b53f99a5c5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments in xact.c where
+	 * these variables are declared.  Normally we have such a check at tableam
+	 * level API but this is called from many places so we need to ensure it
+	 * here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae39a..9d9a70a354 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,9 +430,36 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments in xact.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
+/*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments in xact.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
 /*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments in xact.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 4b2bb29559..a61e279d68 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -234,6 +234,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	Relation	rel = scan->rs_rd;
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
+	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
 	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 26e3c4dc4e..1b92d6f603 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -82,6 +82,19 @@ bool		XactDeferrable;
 
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction. ��Currently, it is used in logical decoding. ��It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables. ��We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool		bsysscan = false;
+
 /*
  * When running as a parallel worker, we place only a single
  * TransactionStateData on the parallel worker's state stack, and the XID
@@ -2679,6 +2692,9 @@ AbortTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* If in parallel mode, clean up workers and exit parallel mode. */
 	if (IsInParallelMode())
 	{
@@ -4981,6 +4997,9 @@ AbortSubTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* Exit from parallel mode, if necessary. */
 	if (IsInParallelMode())
 	{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6ee59bd38b..8deff89a2b 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1442,3 +1442,13 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Clear logical streaming state during (sub)transaction abort.
+ */
+void
+ResetLogicalStreamingState(void)
+{
+	CheckXidAlive = InvalidTransactionId;
+	bsysscan = false;
+}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b3d2a6dd31..acb6c38648 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -19,6 +19,7 @@
 
 #include "access/relscan.h"
 #include "access/sdir.h"
+#include "access/xact.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1017,6 +1027,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1056,6 +1073,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with
+	 * valid CheckXidAlive for catalog or regular tables.  See detailed
+	 * comments in xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1712,6 +1737,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1729,6 +1762,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1747,6 +1788,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1763,6 +1811,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index ac3f5e3b60..5f767eb0b9 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,10 @@ typedef enum
 /* Synchronous commit level */
 extern int	synchronous_commit;
 
+/* used during logical streaming of a transaction */
+extern TransactionId CheckXidAlive;
+extern bool bsysscan;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index deef31825d..b0fae9808b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -121,5 +121,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+extern void ResetLogicalStreamingState(void);
 
 #endif
-- 
2.23.0

v33/v33-0010-Enable-streaming-for-all-subscription-TAP-tests.patch000644 000765 000024 00000035517 13703355213 026035 0ustar00dilipkumarstaff000000 000000 From c57060c8cc5980bba8f5678b1e98f99f2fed3643 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v33 10/12] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318fc7c..6f7bedc130 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2fbc..94c71f8ae2 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f871a..4ba80869b9 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9181..a6fae9c3f1 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17a78..202871a658 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10a19..70c86b22ac 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6d63..f9c8d1d348 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334ed89..cdf9b8e7bb 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bdba9..21f50c7012 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133f69..30561d8f96 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae38592b..9a6bac6822 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..ed56fbf96c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cba4c..4df1ddef63 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1a8a..c3caff6149 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7c2f..c62eb521e7 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30f59..2be7542831 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0578..2da9607a7d 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9435..96ffc091b0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
2.23.0

v33/v33-0012-Add-streaming-option-in-pg_dump.patch000644 000765 000024 00000005337 13703355213 023043 0ustar00dilipkumarstaff000000 000000 From 71887c42aaa54f4ee3fbf90d87a7c6ac26c0333f Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v33 12/12] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index e222e68437..da6c2f8d1e 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4235,8 +4236,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4249,6 +4250,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4265,6 +4267,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4342,6 +4345,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0c2fcfb3a9..af64270c55 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
2.23.0

#435Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#431)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jul 13, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 13, 2020 at 3:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 13, 2020 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 13, 2020 at 2:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think you can refer to commit message as well for that "We however
must explicitly disable streaming replication during replication slot
creation, even if the plugin supports it. We don't need to replicate
the changes accumulated during this phase, and moreover, we don't have
a replication connection open so we don't have where to send the data
anyway.". I don't think this is a good way to hack the streaming flag
because for SQL API's, we don't have a good reason to disable the
streaming in this way. I guess if we had a condition related to
reaching CONSISTENT snapshot during streaming then we won't need to
hack the streaming flag in this way. Once we reach the CONSISTENT
snapshot state, we come out of the creation of a replication slot (see
how we use DecodingContextReady to achieve that) phase. So, I feel we
should remove the ctx->streaming setting to false and add a CONSISTENT
snapshot check during streaming unless you have a reason for not doing
so.

I was worried about the point that streaming on/off is sent by the
subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if
we keep streaming on during create then it may not be right.

Then, how is that used on the publisher-side? AFAICS, the streaming
is enabled based on whether streaming callbacks are provided and we do
that in 0003-Extend-the-logical-decoding-output-plugin-API-wi patch.

Basically, first, we enable based on whether we have the callbacks or
not but later once we get the START REPLICATION command from the
subscriber then we set it to false if the streaming is not enabled
from the subscriber side. You can refer below code in patch 0007.

pgoutput_startup
{
parse_output_parameters(ctx->output_plugin_options,
&data->protocol_version,
- &data->publication_names);
+ &data->publication_names,
+ &enable_streaming);
/* Check if we support requested protocol */
if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("publication_names parameter missing")));
+ /*
+ * Decide whether to enable streaming. It is disabled by default, in
+ * which case we just update the flag in decoding context. Otherwise
+ * we only allow it with sufficient version of the protocol, and when
+ * the output plugin supports it.
+ */
+ if (!enable_streaming)
+ ctx->streaming = false;
}

Okay, in that case, we can do both enable and disable streaming in
this function itself rather than allow the caller to later modify it.
I suggest similarly we can enable/disable it for SQL API in
pg_decode_startup via output_plugin_options. This way it will look
consistent for both SQL APIs and for command-based replication. If we
can do so, then probably adding an Assert for Consistent Snapshot
while performing streaming should be okay.

Done this way In the latest patch set.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#436Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#425)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jul 13, 2020 at 10:47 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sun, Jul 12, 2020 at 9:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

9.
+ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
{
..
+ ReorderBufferToastReset(rb, txn);
+ if (specinsert != NULL)
+ ReorderBufferReturnChange(rb, specinsert);
..
}

Why do we need to do these here when we wouldn't have been done for
any exception other than ERRCODE_TRANSACTION_ROLLBACK?

Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
gracefully and we are continuing with further decoding so we need to
return this change back.

Okay, then I suggest we should do these before calling stream_stop and
also move ReorderBufferResetTXN after calling stream_stop to follow a
pattern similar to try block unless there is a reason for not doing
so. Also, it would be good if we can initialize specinsert with NULL
after returning the change as we are doing at other places.

Okay

10. I have got the below failure once. I have not investigated this
in detail as the patch is still under progress. See, if you have any
idea?
# Failed test 'check extra columns contain local defaults'
# at t/013_stream_subxact_ddl_abort.pl line 81.
# got: '2|0'
# expected: '1000|500'
# Looks like you failed 1 test of 2.
make[2]: *** [check] Error 1
make[1]: *** [check-subscription-recurse] Error 2
make[1]: *** Waiting for unfinished jobs....
make: *** [check-world-src/test-recurse] Error 2

Even I got the failure once and after that, it did not reproduce. I
have executed it multiple time but it did not reproduce again. Are
you able to reproduce it consistently?

No, I am also not able to reproduce it consistently but I think this
can fail if a subscriber sends the replay_location before actually
replaying the changes. First, I thought that extra send_feedback we
have in apply_handle_stream_commit might have caused this but I guess
that can't happen because we need the commit time location for that
and we are storing the same at the end of apply_handle_stream_commit
after applying all messages. I am not sure what is going on here. I
think we somehow need to reproduce this or some variant of this test
consistently to find the root cause.

And I think it appeared first time for me, so maybe either induced
from past few versions so some changes in the last few versions might
have exposed it. I have noticed that almost 50% of the time I am able
to reproduce after the clean build so I can trace back from which
version it started appearing that way it will be easy to narrow down.

I think the reason for the failure is that we are not setting
remote_final_lsn, in the streaming mode. I have put multiple logs and
executed in log and from logs it appeared that some of the logical wal
did not get replayed due to below check in
should_apply_changes_for_rel.
return (rel->state == SUBREL_STATE_READY || (rel->state ==
SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn));

I still need to do the detailed analysis that why does this fail in
some cases, basically, most of the time the rel->state is
SUBREL_STATE_READY so this check passes but whenever the state is
SUBREL_STATE_SYNCDONE it failed because we never update
remote_final_lsn. I will try to set this value in
apply_handle_stream_commit and see whether it ever fails or not.

I have verified that after setting the remote_final_lsn in the
apply_handle_stream_commit, I don't see that regression failure in
over 70 runs whereas without that change it failed 6 times in 50 runs.
Apart from this, I have noticed one more thing related to the same
point. Basically, in the apply_handle_commit, we are calling
process_syncing_tables whereas we are not calling the same in
apply_handle_stream_commit.

I have set the remote_final_lsn as well as called
process_syncing_tables, in apply_handle_stream_commit. Please see the
latest patch set v33.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#437Ajin Cherian
itsajin@gmail.com
In reply to: Dilip Kumar (#436)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jul 15, 2020 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Please see the
latest patch set v33.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

I have a minor comment. You've defined a new
function ReorderBufferStartStreaming() but the function doesn't actually
start streaming but is used to find out if you can start streaming and it
returns a boolean. Can't you name it accordingly?
Probably ReorderBufferCanStartStreaming(). I understand that it internally
calls ReorderBufferCanStream() which is similar sounding but I think that
should not matter.

regards,
Ajin Cherian
Fujitsu Australia

#438Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#437)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jul 15, 2020 at 4:51 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Wed, Jul 15, 2020 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Please see the
latest patch set v33.

I have a minor comment. You've defined a new function ReorderBufferStartStreaming() but the function doesn't actually start streaming but is used to find out if you can start streaming and it returns a boolean. Can't you name it accordingly?
Probably ReorderBufferCanStartStreaming(). I understand that it internally calls ReorderBufferCanStream() which is similar sounding but I think that should not matter.

+1. I am actually editing some of the patches and I have already
named it as you are suggesting.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#439Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#434)
2 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jul 15, 2020 at 9:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have reviewed your changes and those look good to me, please find
the latest version of the patch set.

I have done an additional round of review and below are the changes I
made in the attached patch-set.
1. Changed comments in 0002.
2. In 0005, apart from changing a few comments and function name, I
have changed below code:
+ if (ReorderBufferCanStream(rb) &&
+ !SnapBuildXactNeedsSkip(builder, ctx->reader->ReadRecPtr))
Here, I think it is better to compare it with EndRecPtr.  I feel in
boundary case the next record could be the same as start_decoding_at,
so why to avoid streaming in that case?
3. In 0006, made below changes:
    a. Removed function ReorderBufferFreeChange and added a new
parameter in ReorderBufferReturnChange to achieve the same purpose.
    b. Changed quite a few comments, function names, added additional
Asserts, and few other cosmetic changes.
4. In 0007, made below changes:
    a. Removed the unnecessary change in .gitignore
    b. Changed the newly added option name to "stream-change".

Apart from above, I have merged patches 0004, 0005, 0006 and 0007 as
those seems one functionality to me. For the sake of review, the
patch-set that contains merged patches is attached separately as
v34-combined.

Let me know what you think of the changes?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v34.tarapplication/x-tar; name=v34.tarDownload
v34-0001-Immediately-WAL-log-subtransaction-and-top-level.patch000664 000765 000024 00000026121 13703574334 025530 0ustar00amitkapilastaff000000 000000 From d75f46a21512865d24303a0dd0ffe3d8efbd51d1 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 5 Jun 2020 09:03:16 +0530
Subject: [PATCH v34 01/12] Immediately WAL-log subtransaction and top-level
 XID association.

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead) only when wal_level=logical.
We can not remove the existing XLOG_XACT_ASSIGNMENT WAL as that is
required for avoiding overflow in the hot standby snapshot.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/transam/xact.c        | 50 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 23 +++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 44 ++++++++++++++--------------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 107 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b3ee7fa..bd4c3cf 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to top-level XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5120,6 +5122,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6022,3 +6025,50 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * IsSubTransactionAssignmentPending
+ *
+ * This is used to decide whether we need to WAL log the top-level XID for
+ * operation in a subtransaction.  We require that for logical decoding, see
+ * LogicalDecodingProcessRecord.
+ *
+ * This returns true if wal_level >= logical and we are inside a valid
+ * subtransaction, for which the assignment was not yet written to any WAL
+ * record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	/* wal_level has to be logical */
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it should not be already 'assigned' */
+	return !CurrentTransactionState->assigned;
+}
+
+/*
+ * MarkSubTransactionAssigned
+ *
+ * Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f..c526bb1 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,19 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogResetInsertion) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cb76be4..a757bac 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1197,6 +1197,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1235,6 +1236,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3a..0c0c371 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -94,11 +94,27 @@ void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
 	XLogRecordBuffer buf;
+	TransactionId txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the top-level xid is valid, we need to assign the subxact to the
+	 * top-level xact. We need to do this for all records, hence we do it
+	 * before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -216,13 +232,8 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	/*
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
-	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
-	 * when the snapshot is being built: it is possible to get later records
-	 * that require subxids to be properly assigned.
 	 */
-	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
 		return;
 
 	switch (info)
@@ -264,22 +275,13 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
 
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			/*
+			 * We assign subxact to the toplevel xact while processing each
+			 * record if required.  So, we don't need to do anything here.
+			 * See LogicalDecodingProcessRecord.
+			 */
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index db19187..aef8555 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 5b14334..d8391aa 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of top-level xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index b0f2a6e..b976882 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -191,6 +191,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -304,6 +306,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v34-0002-WAL-Log-invalidations-at-command-end-with-wal_le.patch000664 000765 000024 00000035562 13703574334 025375 0ustar00amitkapilastaff000000 000000 From 3450ad6fdace6b5d967a451ba471c4f3e3368011 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 6 Jun 2020 09:54:21 +0530
Subject: [PATCH v34 02/12] WAL Log invalidations at command end with
 wal_level=logical.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When wal_level=logical, write invalidations at command end into WAL so
that decoding can use this information.

This patch is required to allow the streaming of in-progress transactions
in logical decoding.  The actual work to allow streaming will be committed
as a separate patch.

We still add the invalidations to the cache and write them to WAL at
commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not need
to be changed.

The invalidations written at command end uses a new xlog record type
XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See
LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures, which
still rely on the invalidations written to commit records.

The invalidations are decoded and accumulated in top-transaction, and then
executed during replay.  This obviates the need to decode the
invalidations as part of a commit record.

LogStandbyInvalidations was accumulating all the invalidations in memory,
and then only wrote them once at commit time, which may reduce the
performance impact by amortizing the overhead and deduplicating the
invalidations.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/rmgrdesc/xactdesc.c          | 10 +++++
 src/backend/access/transam/xact.c               | 17 ++++++++
 src/backend/replication/logical/decode.c        | 58 +++++++++++++++----------
 src/backend/replication/logical/reorderbuffer.c | 52 ++++++++++++++++++----
 src/backend/utils/cache/inval.c                 | 55 +++++++++++++++++++++++
 src/include/access/xact.h                       | 13 +++++-
 src/include/replication/reorderbuffer.h         |  3 ++
 src/include/utils/inval.h                       |  2 +
 8 files changed, 177 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce755..68aa994 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -396,6 +396,13 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		standby_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs, InvalidOid,
+								   InvalidOid, false);
+	}
 }
 
 const char *
@@ -423,6 +430,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index bd4c3cf..d4f7c29 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1224,6 +1224,16 @@ RecordTransactionCommit(void)
 	bool		RelcacheInitFileInval = false;
 	bool		wrote_xlog;
 
+	/*
+	 * Log pending invalidations for logical decoding of in-progress
+	 * transactions.  Normally for DDLs, we log this at each command end,
+	 * however, for certain cases where we directly update the system table
+	 * without a transaction block, the invalidations are not logged till this
+	 * time.
+	 */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	/* Get data needed for commit record */
 	nrels = smgrGetPendingDeletes(true, &rels);
 	nchildren = xactGetCommittedChildren(&children);
@@ -6022,6 +6032,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0c0c371..7153eba 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -278,10 +278,39 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 			/*
 			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here.
-			 * See LogicalDecodingProcessRecord.
+			 * record if required.  So, we don't need to do anything here. See
+			 * LogicalDecodingProcessRecord.
 			 */
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/*
+				 * Execute the invalidations for xid-less transactions,
+				 * otherwise, accumulate them so that they can be processed at
+				 * the commit time.
+				 */
+				if (!ctx->fast_forward)
+				{
+					if (TransactionIdIsValid(xid))
+					{
+						ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+													  invals->nmsgs, invals->msgs);
+						ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+														  buf->origptr);
+					}
+					else
+						ReorderBufferImmediateInvalidation(ctx->reorder,
+														   invals->nmsgs,
+														   invals->msgs);
+				}
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
@@ -334,15 +363,11 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_STANDBY_LOCK:
 			break;
 		case XLOG_INVALIDATIONS:
-			{
-				xl_invalidations *invalidations =
-				(xl_invalidations *) XLogRecGetData(r);
 
-				if (!ctx->fast_forward)
-					ReorderBufferImmediateInvalidation(ctx->reorder,
-													   invalidations->nmsgs,
-													   invalidations->msgs);
-			}
+			/*
+			 * We are processing the invalidations at the command level via
+			 * XLOG_XACT_INVALIDATIONS.  So we don't need to do anything here.
+			 */
 			break;
 		default:
 			elog(ERROR, "unexpected RM_STANDBY_ID record type: %u", info);
@@ -573,19 +598,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
-	/*
-	 * Process invalidation messages, even if we're not interested in the
-	 * transaction's contents, since the various caches need to always be
-	 * consistent.
-	 */
-	if (parsed->nmsgs > 0)
-	{
-		if (!ctx->fast_forward)
-			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
-										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
-	}
-
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5251932..1661190 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -856,6 +856,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to top-level transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -2201,7 +2204,11 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 /*
  * Setup the invalidation of the toplevel transaction.
  *
- * This needs to be done before ReorderBufferCommit is called!
+ * This needs to be called for each XLOG_XACT_INVALIDATIONS message and
+ * accumulates all the invalidation messages in the toplevel transaction.
+ * This is required because in some cases where we skip processing the
+ * transaction (see ReorderBufferForget), we need to execute all the
+ * invalidations together.
  */
 void
 ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
@@ -2212,17 +2219,35 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	if (txn->ninvalidations != 0)
-		elog(ERROR, "only ever add one set of invalidations");
+	/*
+	 * We collect all the invalidations under the top transaction so that we
+	 * can execute them all together.
+	 */
+	if (txn->toptxn)
+		txn = txn->toptxn;
 
 	Assert(nmsgs > 0);
 
-	txn->ninvalidations = nmsgs;
-	txn->invalidations = (SharedInvalidationMessage *)
-		MemoryContextAlloc(rb->context,
-						   sizeof(SharedInvalidationMessage) * nmsgs);
-	memcpy(txn->invalidations, msgs,
-		   sizeof(SharedInvalidationMessage) * nmsgs);
+	/* Accumulate invalidations. */
+	if (txn->ninvalidations == 0)
+	{
+		txn->ninvalidations = nmsgs;
+		txn->invalidations = (SharedInvalidationMessage *)
+			MemoryContextAlloc(rb->context,
+							   sizeof(SharedInvalidationMessage) * nmsgs);
+		memcpy(txn->invalidations, msgs,
+			   sizeof(SharedInvalidationMessage) * nmsgs);
+	}
+	else
+	{
+		txn->invalidations = (SharedInvalidationMessage *)
+			repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
+					 (txn->ninvalidations + nmsgs));
+
+		memcpy(txn->invalidations + txn->ninvalidations, msgs,
+			   nmsgs * sizeof(SharedInvalidationMessage));
+		txn->ninvalidations += nmsgs;
+	}
 }
 
 /*
@@ -2250,6 +2275,15 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * Mark top-level transaction as having catalog changes too if one of its
+	 * children has so that the ReorderBufferBuildTupleCidHash can
+	 * conveniently check just top-level transaction and decide whether to
+	 * build the hash table or not.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..edd9077 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,9 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transactions.  See
+ *      CommandEndInvalidationMessages.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +107,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -1094,6 +1098,11 @@ CommandEndInvalidationMessages(void)
 
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
+
+	/* WAL Log per-command invalidation messages for wal_level=logical */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
 							   &transInvalInfo->CurrentCmdInvalidMsgs);
 }
@@ -1501,3 +1510,49 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * LogLogicalInvalidations
+ *
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end or at commit time if any invalidations
+ * are pending.
+ */
+void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage *invalMessages;
+	int			nmsgs = 0;
+
+	/* Quick exit if we haven't done anything with invalidation messages. */
+	if (transInvalInfo == NULL)
+		return;
+
+	ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+									 MakeSharedInvalidMessagesArray);
+
+	Assert(!(numSharedInvalidMessagesArray > 0 &&
+			 SharedInvalidMessagesArray == NULL));
+
+	invalMessages = SharedInvalidMessagesArray;
+	nmsgs = numSharedInvalidMessagesArray;
+	SharedInvalidMessagesArray = NULL;
+	numSharedInvalidMessagesArray = 0;
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						 nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index aef8555..ac3f5e3 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,17 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 019bd38..1055e99 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -220,6 +220,9 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	end_lsn;
 
+	/* Toplevel transaction for this subxact (NULL for top-level). */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN of the last lsn at which snapshot information reside, so we can
 	 * restart decoding from there and fully recover this transaction from
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index bc5081c..463888c 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -61,4 +61,6 @@ extern void CacheRegisterRelcacheCallback(RelcacheCallbackFunction func,
 extern void CallSyscacheCallbacks(int cacheid, uint32 hashvalue);
 
 extern void InvalidateSystemCaches(void);
+
+extern void LogLogicalInvalidations(void);
 #endif							/* INVAL_H */
-- 
1.8.3.1

v34-0003-Extend-the-logical-decoding-output-plugin-API-wi.patch000664 000765 000024 00000110356 13703574334 025436 0ustar00amitkapilastaff000000 000000 From 0f114f32bf021759087a77a2fadf3c528dfa8066 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:01:24 +0530
Subject: [PATCH v34 03/12] Extend the logical decoding output plugin API with
 stream methods.

This adds seven methods to the output plugin API, adding support for
streaming changes for large transactions.

* stream_start
* stream_stop
* stream_abort
* stream_commit
* stream_change
* stream_message
* stream_truncate

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate a chunk of changes
streamed for a particular toplevel transaction.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/test_decoding.c     | 123 +++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 218 +++++++++++++++++++
 src/backend/replication/logical/logical.c | 351 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  59 +++++
 6 files changed, 825 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..4a71826 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
 }
 
 
@@ -540,3 +569,97 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the changes as the transaction can abort
+ * at a later point in time.  We don't want users to see the changes until the
+ * transaction is committed.
+ */
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the contents for transactional messages
+ * as the transaction can abort at a later point in time.  We don't want users to
+ * see the message contents until the transaction is committed.
+ */
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (transactional)
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu",
+						 transactional, prefix, sz);
+	}
+	else
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+						 transactional, prefix, sz);
+		appendBinaryStringInfo(ctx->out, message, sz);
+	}
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the detailed information of Truncate.
+ * See pg_decode_stream_change.
+ */
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index c89f93c..18116c8 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,6 +389,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -401,6 +408,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -679,6 +695,117 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+    
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The actual changes are not displayed as the transaction can abort at a later
+      point in time and we don't decode changes for aborted transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The message contents for transactional messages are not displayed as the transaction
+      can abort at a later point in time and we don't decode changes for aborted
+      transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -747,4 +874,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and two optional callbacks
+    (<function>stream_message_cb</function>) and (<function>stream_truncate_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be..6ee59bd 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr message_lsn, bool transactional,
+									  const char *prefix, Size message_size, const char *message);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require start/stop/abort/commit/change
+	 * callbacks. The message and truncate callback are optional, similar to
+	 * regular output plugins. We however enable streaming when at least one
+	 * of the methods is enabled, so that we can easily identify missing
+	 * methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional, so we do not
+	 * fail with ERROR when missing, but the wrappers simply do nothing. We
+	 * must set the ReorderBuffer callbacks to something, otherwise the calls
+	 * from there will crash (we don't want to move the checks there).
+	 */
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,307 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_start_cb callback")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_stop_cb callback")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_abort_cb callback")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_commit_cb callback")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_change_cb callback")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475..deef318 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..2d9aa11 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.commit_lsn
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1055e99..42bc817 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -348,6 +348,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											   ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
 struct ReorderBuffer
 {
 	/*
@@ -387,6 +435,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamTruncateCB stream_truncate;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v34-0004-Gracefully-handle-concurrent-aborts-of-transacti.patch000664 000765 000024 00000035744 13703574334 025767 0ustar00amitkapilastaff000000 000000 From 7b5b035136391892157e4378b8093f5976e8b3e6 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:49:40 +0530
Subject: [PATCH v34 04/12] Gracefully handle concurrent aborts of transactions
 being decoded.

When decoding committed transactions this is not an issue, and we never
decode transactions that abort before the decoding starts.

But for an upcoming patch that allows decoding of in-progress
transactions, this may cause failures when the output plugin consults
catalogs (both system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.

Author: Dilip Kumar, Nikhil Sontakke, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 doc/src/sgml/logicaldecoding.sgml         |  9 +++--
 src/backend/access/heap/heapam.c          | 10 ++++++
 src/backend/access/index/genam.c          | 53 +++++++++++++++++++++++++++++
 src/backend/access/table/tableam.c        |  8 +++++
 src/backend/access/transam/xact.c         | 19 +++++++++++
 src/backend/replication/logical/logical.c | 10 ++++++
 src/include/access/tableam.h              | 55 +++++++++++++++++++++++++++++++
 src/include/access/xact.h                 |  4 +++
 src/include/replication/logical.h         |  1 +
 9 files changed, 166 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 18116c8..98b47b0 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d881f4c..b081fb1 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments in xact.c where
+	 * these variables are declared.  Normally we have such a check at tableam
+	 * level API but this is called from many places so we need to ensure it
+	 * here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..9d9a70a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,10 +430,37 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments in xact.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments in xact.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments in xact.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 4b2bb29..a61e279 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -235,6 +235,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
 	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
+	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d4f7c29..99722ee 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -83,6 +83,19 @@ bool		XactDeferrable;
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
 /*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool		bsysscan = false;
+
+/*
  * When running as a parallel worker, we place only a single
  * TransactionStateData on the parallel worker's state stack, and the XID
  * reflected there will be that of the *innermost* currently-active
@@ -2680,6 +2693,9 @@ AbortTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* If in parallel mode, clean up workers and exit parallel mode. */
 	if (IsInParallelMode())
 	{
@@ -4982,6 +4998,9 @@ AbortSubTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* Exit from parallel mode, if necessary. */
 	if (IsInParallelMode())
 	{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6ee59bd..8deff89 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1442,3 +1442,13 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Clear logical streaming state during (sub)transaction abort.
+ */
+void
+ResetLogicalStreamingState(void)
+{
+	CheckXidAlive = InvalidTransactionId;
+	bsysscan = false;
+}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 0d28f01..b188427 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -19,6 +19,7 @@
 
 #include "access/relscan.h"
 #include "access/sdir.h"
+#include "access/xact.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1017,6 +1027,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1056,6 +1073,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with
+	 * valid CheckXidAlive for catalog or regular tables.  See detailed
+	 * comments in xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1713,6 +1738,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1730,6 +1763,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1748,6 +1789,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1764,6 +1812,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index ac3f5e3..5f767eb 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,10 @@ typedef enum
 /* Synchronous commit level */
 extern int	synchronous_commit;
 
+/* used during logical streaming of a transaction */
+extern TransactionId CheckXidAlive;
+extern bool bsysscan;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index deef318..b0fae98 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -121,5 +121,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+extern void ResetLogicalStreamingState(void);
 
 #endif
-- 
1.8.3.1

v34-0005-Implement-streaming-mode-in-ReorderBuffer.patch000664 000765 000024 00000115144 13703574334 024374 0ustar00amitkapilastaff000000 000000 From 365e2a5be9dcc9c5db7179eeae86cc90f6aabdf0 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@Laptop309pnin.local>
Date: Thu, 9 Jul 2020 14:13:02 +0530
Subject: [PATCH v34 05/12] Implement streaming mode in ReorderBuffer.

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke new stream API methods. However, sometimes if we have
incomplete toast or speculative insert we spill to the disk because we can
not generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/heap/heapam_visibility.c     |  42 +-
 src/backend/replication/logical/reorderbuffer.c | 794 +++++++++++++++++++++---
 src/include/replication/reorderbuffer.h         |  26 +
 3 files changed, 786 insertions(+), 76 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..c771280 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the tuple
+		 * yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1659,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1661190..8450dc1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -236,6 +237,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +246,16 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static inline bool ReorderBufferCanStartStreaming(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -367,6 +379,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -764,6 +779,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the (sub)transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -1018,6 +1065,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1032,6 +1082,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1310,6 +1363,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1335,6 +1397,84 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
+	 * streamed always, even if it does not contain any changes (that is, when
+	 * all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the toplevel xact (we send the XID in all messages), but we never
+	 * stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
+	 * memory. We could also keep the hash table and update it with new ctid
+	 * values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
  */
@@ -1485,57 +1625,177 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the
+	 * xid aborted.  That will happen during catalog access.
+	 */
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						  ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction. This resets the TXN such that it
+ * can be used to stream the remaining data of transaction being processed.
+ */
+static void
+ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+					  Snapshot snapshot_now,
+					  CommandId command_id,
+					  XLogRecPtr last_lsn,
+					  ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Free all resources allocated for toast reconstruction */
+	ReorderBufferToastReset(rb, txn);
+
+	/* Return the spec insert change if it is not NULL */
+	if (specinsert != NULL)
+	{
+		ReorderBufferReturnChange(rb, specinsert);
+		specinsert = NULL;
 	}
 
-	snapshot_now = txn->base_snapshot;
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. We iterate over the top and subtransactions (using a k-way
+ * merge) and replay the changes in lsn order.
+ *
+ * If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool stream_started = false;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1558,14 +1818,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming ? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* We only need to send begin/commit for non-streamed transactions. */
+		if (!streaming)
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1573,6 +1834,33 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * We can't call start stream callback before processing first
+			 * change.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					txn->origin_id = change->origin_id;
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+				SetupCheckXidLive(change->txn->xid);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1649,7 +1937,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1689,7 +1978,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1747,7 +2036,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1756,10 +2048,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1790,7 +2079,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1845,14 +2133,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, send the last message for this set of
+		 * changes depending upon streaming mode.
+		 */
+		if (streaming)
+		{
+			if (stream_started)
+			{
+				rb->stream_stop(rb, txn, prev_lsn);
+				stream_started = false;
+			}
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot for the next set of changes in
+		 * streaming mode.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1870,14 +2179,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as
+		 * streamed (if they contained changes). Otherwise, remove all the
+		 * changes and deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1896,15 +2218,105 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
+		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * This error can only occur when we are sending the data in
+			 * streaming mode and the streaming in not finished yet.
+			 */
+			Assert(streaming);
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+
+			/* Reset the TXN so that it is allowed to stream remaining data. */
+			ReorderBufferResetTXN(rb, txn, snapshot_now,
+								  command_id, prev_lsn,
+								  specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+	}
+	PG_END_TRY();
+}
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot	snapshot_now;
+	CommandId	command_id = FirstCommandId;
 
-		PG_RE_THROW();
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
 	}
-	PG_END_TRY();
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
 }
 
 /*
@@ -1931,6 +2343,22 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_abort(rb, txn, lsn);
+
+		/*
+		 * We might have decoded changes for this transaction that could load
+		 * the cache as per the current transaction's view (consider DDL's
+		 * happened in this transaction). We don't want the decoding of future
+		 * transactions to use those cache entries so execute invalidations.
+		 */
+		if (txn->ninvalidations > 0)
+			ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+											   txn->invalidations);
+	}
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2000,6 +2428,10 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2135,8 +2567,17 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2144,6 +2585,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
 
 	Assert(change->txn);
 
@@ -2155,19 +2597,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* if subxact, and streaming supported, use the toplevel instead */
+	if (txn->toptxn && ReorderBufferCanStream(rb))
+		txn = txn->toptxn;
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2196,6 +2647,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2388,6 +2840,38 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	ReorderBufferTXN *largest = NULL;
+
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/* if the current transaction is larger, remember it */
+		if ((!largest) || (txn->size > largest->size))
+			largest = txn;
+	}
+
+	Assert(largest);
+	Assert(largest->size > 0);
+	Assert(largest->size <= rb->size);
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
  * disk until we reach under the memory limit.
@@ -2419,11 +2903,38 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if allowed. Otherwise, spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStartStreaming(rb))
+		{
+			/*
+			 * Pick the largest toplevel transaction and evict it from memory
+			 * by streaming the already decoded part.
+			 */
+			txn = ReorderBufferLargestTopTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it
+			 * from memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2713,6 +3224,136 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+/* Returns true, if the output plugin supports streaming, false, otherwise. */
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/* Returns true, if the streaming can be started now, false, otherwise. */
+static inline bool
+ReorderBufferCanStartStreaming(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild *builder = ctx->snapshot_builder;
+
+	/*
+	 * We can't start streaming immediately even if the streaming is enabled
+	 * because we previously decoded this transaction and now just are
+	 * restarting.
+	 */
+	if (ReorderBufferCanStream(rb) &&
+		!SnapBuildXactNeedsSkip(builder, ctx->reader->EndRecPtr))
+	{
+		/* We must have a consistent snapshot by this time */
+		Assert(SnapBuildCurrentState(builder) == SNAPBUILD_CONSISTENT);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * We can't make any assumptions about base snapshot here, similar to what
+	 * ReorderBufferCommit() does. That relies on base_snapshot getting
+	 * transferred from subxact in ReorderBufferCommitChild(), but that was
+	 * not yet called as the transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 *
+	 * Unlike DecodeCommit which adds xids of all the subtransactions in
+	 * snapshot's xip array via SnapBuildCommittedTxn, we can't do that here
+	 * but we do add them to subxip array instead via ReorderBufferCopySnap.
+	 * This allows the catalog changes made in subtransactions decoded till
+	 * now to be visible.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		/*
+		 * If this transaction has no snapshot, it didn't make any changes to
+		 * the database till now, so there's nothing to decode.
+		 */
+		if (txn->base_snapshot == NULL)
+		{
+			Assert(txn->ninvalidations == 0);
+			return;
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -3812,6 +4453,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases, we assume the CID is from the future
+	 * command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 42bc817..ae1759f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -162,6 +162,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -181,6 +182,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ *
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -249,6 +268,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
-- 
1.8.3.1

v34-0006-Process-Partial-Changes.patch000664 000765 000024 00000054130 13703574334 020747 0ustar00amitkapilastaff000000 000000 From 1d4eb675d1005450c1de082789e7328e4f24940e Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 15 Jul 2020 16:56:22 +0530
Subject: [PATCH v34 06/12] Process Partial Changes.

We can stream only complete changes so if we have a partial change like
toast table insert or speculative then we mark such a 'txn' so that it can't
be streamed.  We also ensure that if the changes in such a 'txn' are above
logical_decoding_work_mem threshold then we stream them as soon as we have a
complete change.
---
 src/backend/access/heap/heapam.c                |   3 +
 src/backend/replication/logical/decode.c        |  17 +-
 src/backend/replication/logical/reorderbuffer.c | 219 +++++++++++++++++++-----
 src/include/access/heapam_xlog.h                |   1 +
 src/include/replication/reorderbuffer.h         |  38 +++-
 5 files changed, 226 insertions(+), 52 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b081fb1..33a4580 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1955,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7153eba..2010d5a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8450dc1..27b4617 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -179,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -431,13 +446,15 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 }
 
 /*
- * Free an ReorderBufferChange.
+ * Free a ReorderBufferChange and update memory accounting, if requested.
  */
 void
-ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
+						  bool upd_mem)
 {
 	/* update memory accounting info */
-	ReorderBufferChangeMemoryUpdate(rb, change, false);
+	if (upd_mem)
+		ReorderBufferChangeMemoryUpdate(rb, change, false);
 
 	/* free contained data */
 	switch (change->action)
@@ -639,16 +656,101 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 }
 
 /*
- * Queue a change into a transaction so it can be replayed upon commit.
+ * Record the partial change for the streaming of in-progress transactions.  We
+ * can stream only complete changes so if we have a partial change like toast
+ * table insert or speculative then we mark such a 'txn' so that it can't be
+ * streamed.  We also ensure that if the changes in such a 'txn' are above
+ * logical_decoding_work_mem threshold then we stream them as soon as we have a
+ * complete change.
+ */
+static void
+ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								  ReorderBufferChange *change,
+								  bool toast_insert)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The partial changes need to be processed only while streaming
+	 * in-progress transactions.
+	 */
+	if (!ReorderBufferCanStream(rb))
+		return;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * Set the toast insert bit whenever we get toast insert to indicate a partial
+	 * change and clear it when we get the insert or update on main table (Both
+	 * update and insert will do the insert in the toast table).
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			 IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial change and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+	{
+		/*
+		 * Speculative confirm change must be preceded by speculative insertion.
+		 */
+		Assert(rbtxn_has_spec_insert(toptxn));
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+	}
+
+	/*
+	 * Stream the transaction if it is serialized before and the changes are
+	 * now complete in the top-level transaction.
+	 *
+	 * The reason for doing the streaming of such a transaction as soon as
+	 * we get the complete change for it is that previously it would have
+	 * reached the memory threshold and wouldn't get streamed because of
+	 * incomplete changes.  Delaying such transactions would increase apply
+	 * lag for them.
+	 */
+	if (ReorderBufferCanStartStreaming(rb) &&
+		!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		rbtxn_is_serialized(txn))
+		ReorderBufferStreamTXN(rb, toptxn);
+}
+
+/*
+ * Queue a change into a transaction so it can be replayed upon commit or will be
+ * streamed when we reach logical_decoding_work_mem threshold.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the previous changes we have detected that the transaction
+	 * is aborted.  So there is no point in collecting further changes for it.
+	 */
+	if (txn->concurrent_abort)
+	{
+		/*
+		 * We don't need to update memory accounting for this change as we have
+		 * not added it to the queue yet.
+		 */
+		ReorderBufferReturnChange(rb, change, false);
+		return;
+	}
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -660,6 +762,9 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* process partial change */
+	ReorderBufferProcessPartialChange(rb, txn, change, toast_insert);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -689,7 +794,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -1201,7 +1306,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 	{
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1287,7 +1392,7 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1333,7 +1438,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1350,7 +1455,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1437,7 +1542,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* remove the change from it's containing list */
 		dlist_delete(&change->node);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1469,6 +1574,13 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
+
 	/* also reset the number of entries in the transaction */
 	txn->nentries_mem = 0;
 	txn->nentries = 0;
@@ -1763,7 +1875,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	/* Return the spec insert change if it is not NULL */
 	if (specinsert != NULL)
 	{
-		ReorderBufferReturnChange(rb, specinsert);
+		ReorderBufferReturnChange(rb, specinsert, true);
 		specinsert = NULL;
 	}
 
@@ -1796,6 +1908,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
 	ReorderBufferChange *volatile specinsert = NULL;
 	volatile bool stream_started = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1859,7 +1972,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			/* Set the xid for concurrent abort check. */
 			if (streaming)
-				SetupCheckXidLive(change->txn->xid);
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
 
 			switch (change->action)
 			{
@@ -1974,7 +2090,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					 */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
@@ -2003,7 +2119,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					/* clear out a pending (and thus failed) speculation */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
@@ -2125,7 +2241,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (specinsert)
 		{
-			ReorderBufferReturnChange(rb, specinsert);
+			ReorderBufferReturnChange(rb, specinsert, true);
 			specinsert = NULL;
 		}
 
@@ -2236,6 +2352,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
+			curtxn->concurrent_abort = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2514,7 +2631,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2563,7 +2680,7 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2586,6 +2703,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 {
 	Size		sz;
 	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2599,9 +2717,14 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 	txn = change->txn;
 
-	/* if subxact, and streaming supported, use the toplevel instead */
-	if (txn->toptxn && ReorderBufferCanStream(rb))
-		txn = txn->toptxn;
+	/* If streaming supported, update the total size in top level as well. */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
 
 	sz = ReorderBufferChangeSize(change);
 
@@ -2609,12 +2732,20 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	{
 		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
 		Assert((rb->size >= sz) && (txn->size >= sz));
 		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
 
 	Assert(txn->size <= rb->size);
@@ -2846,28 +2977,41 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
  * should give us the same transaction (because we don't update memory account
  * for subtransaction with streaming, so it's always 0). But we can simply
  * iterate over the limited number of toplevel transactions.
+ *
+ * Note that, we skip transactions that contains incomplete changes. There
+ * is a scope of optimization here such that we can select the largest transaction
+ * which has complete changes.  But that will make the code and design quite complex
+ * and that might not be worth the benefit.  If we plan to stream the transactions
+ * that contains incomplete changes then we need to find a way to partially
+ * stream/truncate the transaction changes in-memory and build a mechanism to
+ * partially truncate the spilled files.  Additionally, whenever we partially
+ * stream the transaction we need to maintain the last streamed lsn and next time
+ * we need to restore from that segment and the offset in WAL.  As we stream the
+ * changes from the top transaction and restore them subtransaction wise, we need
+ * to even remember the subxact from where we streamed the last change.
  */
 static ReorderBufferTXN *
 ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 {
 	dlist_iter	iter;
+	Size largest_size = 0;
 	ReorderBufferTXN *largest = NULL;
 
+	/* Find the largest top-level transaction. */
 	dlist_foreach(iter, &rb->toplevel_by_lsn)
 	{
 		ReorderBufferTXN *txn;
 
 		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
+		if ((largest != NULL || txn->total_size > largest_size) &&
+			(txn->total_size > 0) && !(rbtxn_has_incomplete_tuple(txn)))
+		{
 			largest = txn;
+			largest_size = txn->total_size;
+		}
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
-
 	return largest;
 }
 
@@ -2903,20 +3047,15 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by streaming, if allowed. Otherwise, spill to disk.
+		 * memory by streaming, if possible.  Otherwise, spill to disk.
 		 */
-		if (ReorderBufferCanStartStreaming(rb))
+		if (ReorderBufferCanStartStreaming(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
 		{
-			/*
-			 * Pick the largest toplevel transaction and evict it from memory
-			 * by streaming the already decoded part.
-			 */
-			txn = ReorderBufferLargestTopTXN(rb);
-
 			/* we know there has to be one, because the size is not zero */
 			Assert(txn && !txn->toptxn);
-			Assert(txn->size > 0);
-			Assert(rb->size >= txn->size);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
 			ReorderBufferStreamTXN(rb, txn);
 		}
@@ -3012,7 +3151,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 
 		spilled++;
 	}
@@ -3454,7 +3593,7 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		dlist_container(ReorderBufferChange, node, cleanup_iter.cur);
 
 		dlist_delete(&cleanup->node);
-		ReorderBufferReturnChange(rb, cleanup);
+		ReorderBufferReturnChange(rb, cleanup, true);
 	}
 	txn->nentries_mem = 0;
 	Assert(dlist_is_empty(&txn->changes));
@@ -4163,7 +4302,7 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			dlist_container(ReorderBufferChange, node, it.cur);
 
 			dlist_delete(&change->node);
-			ReorderBufferReturnChange(rb, change);
+			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ae1759f..1ae17d5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -163,6 +163,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
 #define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -182,6 +184,26 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
 /*
  * Has this transaction been streamed to downstream?
  *
@@ -190,10 +212,6 @@ typedef struct ReorderBufferChange
  * which case we'd have nentries==0 for the toplevel one, which would say
  * nothing about the streaming. So we maintain this flag, but only for the
  * toplevel transaction.)
- *
- * Note: We never do both stream and serialize a transaction (we only spill
- * to disk when streaming is not supported by the plugin), so only one of
- * those two flags may be set at any given time.
  */
 #define rbtxn_is_streamed(txn) \
 ( \
@@ -339,6 +357,12 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -510,12 +534,14 @@ void		ReorderBufferFree(ReorderBuffer *);
 ReorderBufferTupleBuf *ReorderBufferGetTupleBuf(ReorderBuffer *, Size tuple_len);
 void		ReorderBufferReturnTupleBuf(ReorderBuffer *, ReorderBufferTupleBuf *tuple);
 ReorderBufferChange *ReorderBufferGetChange(ReorderBuffer *);
-void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
+void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *, bool);
 
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool toast_insert);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v34-0007-Provide-a-new-option-to-get-the-streaming-change.patch000664 000765 000024 00000016352 13703574334 025506 0ustar00amitkapilastaff000000 000000 From 74143fd193ba4f86aa070001a87e7fc9c2652b16 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 15 Jul 2020 17:25:47 +0530
Subject: [PATCH v34 07/12] Provide a new option to get the streaming changes.

---
 contrib/test_decoding/Makefile              |  2 +-
 contrib/test_decoding/expected/stream.out   | 40 +++++++++++++++++++++++++++++
 contrib/test_decoding/expected/truncate.out |  6 +++++
 contrib/test_decoding/sql/stream.sql        | 21 +++++++++++++++
 contrib/test_decoding/sql/truncate.sql      |  1 +
 contrib/test_decoding/test_decoding.c       | 13 ++++++++++
 doc/src/sgml/test-decoding.sgml             | 22 ++++++++++++++++
 7 files changed, 104 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/stream.out
 create mode 100644 contrib/test_decoding/sql/stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c58..ed9a3d6 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,7 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate
+	spill slot truncate stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
new file mode 100644
index 0000000..bdf7002
--- /dev/null
+++ b/contrib/test_decoding/expected/stream.out
@@ -0,0 +1,40 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+ count 
+-------
+   157
+(1 row)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/truncate.out b/contrib/test_decoding/expected/truncate.out
index 1cf2ae8..e64d377 100644
--- a/contrib/test_decoding/expected/truncate.out
+++ b/contrib/test_decoding/expected/truncate.out
@@ -25,3 +25,9 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
  COMMIT
 (9 rows)
 
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
new file mode 100644
index 0000000..24d41b1
--- /dev/null
+++ b/contrib/test_decoding/sql/stream.sql
@@ -0,0 +1,21 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/truncate.sql b/contrib/test_decoding/sql/truncate.sql
index 5aecdf0..5633854 100644
--- a/contrib/test_decoding/sql/truncate.sql
+++ b/contrib/test_decoding/sql/truncate.sql
@@ -11,3 +11,4 @@ TRUNCATE tab1, tab1 RESTART IDENTITY CASCADE;
 TRUNCATE tab1, tab2;
 
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 4a71826..bb3d9f3 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -122,6 +122,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	ListCell   *option;
 	TestDecodingData *data;
+	bool enable_streaming = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -212,6 +213,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "stream-changes") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_streaming))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -221,6 +232,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 							elem->arg ? strVal(elem->arg) : "(null)")));
 		}
 	}
+
+	ctx->streaming &= enable_streaming;
 }
 
 /* cleanup this plugin's resources */
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d..fe7c978 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes', '1');
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
-- 
1.8.3.1

v34-0008-Extend-the-BufFile-interface-for-the-streaming-o.patch000664 000765 000024 00000037514 13703574334 025444 0ustar00amitkapilastaff000000 000000 From b80ce05f8e7460df09d6aca0f65df55c4d4bdd26 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v34 08/12] Extend the BufFile interface for the streaming of
 in-progress transactions.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times.

Implement the interface for BufFileTruncate interface to allow files to be
truncated up to a particular offset.  Extend BufFileSeek API to support
SEEK_END case.  Add an option to provide a mode while opening the shared
BufFiles instead of always opening in read-only mode.
---
 src/backend/postmaster/pgstat.c           |  3 +
 src/backend/storage/file/buffile.c        | 86 +++++++++++++++++++++++----
 src/backend/storage/file/fd.c             |  9 ++-
 src/backend/storage/file/sharedfileset.c  | 98 ++++++++++++++++++++++++++++---
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  4 +-
 10 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 88992c2..6c97f68 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 3907349..c08ff4f 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile supports temporary files that can be used by the single backend when
+ * the corresponding files need to be survived across the transaction and need
+ * to be opened and closed multiple times.  Such files need to be created as a
+ * member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -364,6 +368,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -666,11 +673,22 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
+			break;
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +856,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420e..f376a97 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594..9a3dc10 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can also be used by backends when the temporary files need
+ * to be opened/closed multiple times and the underlying files need to survive
+ * across transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,19 +29,29 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -223,6 +255,58 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 }
 
 /*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
+/*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
  */
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59..788815c 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a43..b83fb50 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201..807a9c1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752ba..fc34c49 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d..e209f04 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf07..d5edb60 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
1.8.3.1

v34-0009-Add-support-for-streaming-to-built-in-replicatio.patch000664 000765 000024 00000263320 13703574334 025646 0ustar00amitkapilastaff000000 000000 From 049ea4896b6094cf3159e5094f742bffaac5487b Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 08:57:16 +0530
Subject: [PATCH v34 09/12] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |   4 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  45 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   3 +
 src/backend/replication/logical/proto.c            | 140 ++-
 src/backend/replication/logical/worker.c           | 946 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 345 +++++++-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  42 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/009_stream_simple.pl       |  86 ++
 src/test/subscription/t/010_stream_subxact.pl      | 102 +++
 src/test/subscription/t/011_stream_ddl.pl          |  95 +++
 .../subscription/t/012_stream_subxact_abort.pl     |  82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |  84 ++
 18 files changed, 1963 insertions(+), 45 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index c24ace1..d8de56c 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 5bbc165..c25b7c5 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731..f28482f 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9ebb026..9065a1b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 6c97f68..479e3ca 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9..5257ab0 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd..83d0642 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f90a896..f0c3278 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,62 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +291,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -542,17 +682,323 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or
+	 * inside streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		!in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -565,6 +1011,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1029,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1068,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1186,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1331,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1704,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1845,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1493,6 +1973,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1597,7 +2085,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1909,6 +2397,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1941,6 +2443,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3041,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 15379e3..8785d87 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +122,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +148,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +211,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +240,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +263,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -232,6 +315,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -290,9 +378,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +428,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +470,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +491,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +523,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +543,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +567,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +587,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +612,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +644,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -588,6 +725,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -624,6 +868,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -753,12 +1029,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -793,7 +1102,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d4..617b909 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561..89158ed 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c75dceb..56517a9 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v34-0010-Enable-streaming-for-all-subscription-TAP-tests.patch000664 000765 000024 00000035344 13703574334 025405 0ustar00amitkapilastaff000000 000000 From b29739726dc451787a458acafec3ea08dc38de19 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v34 10/12] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318f..6f7bedc 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f8..4ba8086 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334e..cdf9b8e 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v34-0011-Add-TAP-test-for-streaming-vs.-DDL.patch000664 000765 000024 00000010623 13703574334 022333 0ustar00amitkapilastaff000000 000000 From 749db0d4cc20e6996f9965f414cfe44219a4f5d8 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v34 11/12] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v34-0012-Add-streaming-option-in-pg_dump.patch000664 000765 000024 00000005324 13703574334 022411 0ustar00amitkapilastaff000000 000000 From 3d3e1a5442cb4016bc84b22c4bee10af285b93d4 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v34 12/12] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index e758b5c..ff2ae37 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4235,8 +4236,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4249,6 +4250,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4265,6 +4267,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4342,6 +4345,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0c2fcfb..af64270 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
1.8.3.1

v34-combined.tarapplication/x-tar; name=v34-combined.tarDownload
v34-0001-Immediately-WAL-log-subtransaction-and-top-level.patch000664 000765 000024 00000026117 13703601143 025522 0ustar00amitkapilastaff000000 000000 From bc3f0b3658df3f141505aaa37d973653de5b20fa Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 5 Jun 2020 09:03:16 +0530
Subject: [PATCH v34 1/9] Immediately WAL-log subtransaction and top-level XID
 association.

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead) only when wal_level=logical.
We can not remove the existing XLOG_XACT_ASSIGNMENT WAL as that is
required for avoiding overflow in the hot standby snapshot.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/transam/xact.c        | 50 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 23 +++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 44 ++++++++++++++--------------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 107 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b3ee7fa..bd4c3cf 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to top-level XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5120,6 +5122,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6022,3 +6025,50 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * IsSubTransactionAssignmentPending
+ *
+ * This is used to decide whether we need to WAL log the top-level XID for
+ * operation in a subtransaction.  We require that for logical decoding, see
+ * LogicalDecodingProcessRecord.
+ *
+ * This returns true if wal_level >= logical and we are inside a valid
+ * subtransaction, for which the assignment was not yet written to any WAL
+ * record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	/* wal_level has to be logical */
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it should not be already 'assigned' */
+	return !CurrentTransactionState->assigned;
+}
+
+/*
+ * MarkSubTransactionAssigned
+ *
+ * Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f..c526bb1 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,19 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogResetInsertion) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cb76be4..a757bac 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1197,6 +1197,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1235,6 +1236,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3a..0c0c371 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -94,11 +94,27 @@ void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
 	XLogRecordBuffer buf;
+	TransactionId txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the top-level xid is valid, we need to assign the subxact to the
+	 * top-level xact. We need to do this for all records, hence we do it
+	 * before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -216,13 +232,8 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	/*
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
-	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
-	 * when the snapshot is being built: it is possible to get later records
-	 * that require subxids to be properly assigned.
 	 */
-	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
 		return;
 
 	switch (info)
@@ -264,22 +275,13 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
 
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			/*
+			 * We assign subxact to the toplevel xact while processing each
+			 * record if required.  So, we don't need to do anything here.
+			 * See LogicalDecodingProcessRecord.
+			 */
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index db19187..aef8555 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 5b14334..d8391aa 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of top-level xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index b0f2a6e..b976882 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -191,6 +191,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -304,6 +306,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v34-0002-WAL-Log-invalidations-at-command-end-with-wal_le.patch000664 000765 000024 00000035560 13703601143 025360 0ustar00amitkapilastaff000000 000000 From ecd37570d595cfa69ba5aa77aa0012d129a06344 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 6 Jun 2020 09:54:21 +0530
Subject: [PATCH v34 2/9] WAL Log invalidations at command end with
 wal_level=logical.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When wal_level=logical, write invalidations at command end into WAL so
that decoding can use this information.

This patch is required to allow the streaming of in-progress transactions
in logical decoding.  The actual work to allow streaming will be committed
as a separate patch.

We still add the invalidations to the cache and write them to WAL at
commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not need
to be changed.

The invalidations written at command end uses a new xlog record type
XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See
LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures, which
still rely on the invalidations written to commit records.

The invalidations are decoded and accumulated in top-transaction, and then
executed during replay.  This obviates the need to decode the
invalidations as part of a commit record.

LogStandbyInvalidations was accumulating all the invalidations in memory,
and then only wrote them once at commit time, which may reduce the
performance impact by amortizing the overhead and deduplicating the
invalidations.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/rmgrdesc/xactdesc.c          | 10 +++++
 src/backend/access/transam/xact.c               | 17 ++++++++
 src/backend/replication/logical/decode.c        | 58 +++++++++++++++----------
 src/backend/replication/logical/reorderbuffer.c | 52 ++++++++++++++++++----
 src/backend/utils/cache/inval.c                 | 55 +++++++++++++++++++++++
 src/include/access/xact.h                       | 13 +++++-
 src/include/replication/reorderbuffer.h         |  3 ++
 src/include/utils/inval.h                       |  2 +
 8 files changed, 177 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce755..68aa994 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -396,6 +396,13 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		standby_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs, InvalidOid,
+								   InvalidOid, false);
+	}
 }
 
 const char *
@@ -423,6 +430,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index bd4c3cf..d4f7c29 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1224,6 +1224,16 @@ RecordTransactionCommit(void)
 	bool		RelcacheInitFileInval = false;
 	bool		wrote_xlog;
 
+	/*
+	 * Log pending invalidations for logical decoding of in-progress
+	 * transactions.  Normally for DDLs, we log this at each command end,
+	 * however, for certain cases where we directly update the system table
+	 * without a transaction block, the invalidations are not logged till this
+	 * time.
+	 */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	/* Get data needed for commit record */
 	nrels = smgrGetPendingDeletes(true, &rels);
 	nchildren = xactGetCommittedChildren(&children);
@@ -6022,6 +6032,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0c0c371..7153eba 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -278,10 +278,39 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 			/*
 			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here.
-			 * See LogicalDecodingProcessRecord.
+			 * record if required.  So, we don't need to do anything here. See
+			 * LogicalDecodingProcessRecord.
 			 */
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/*
+				 * Execute the invalidations for xid-less transactions,
+				 * otherwise, accumulate them so that they can be processed at
+				 * the commit time.
+				 */
+				if (!ctx->fast_forward)
+				{
+					if (TransactionIdIsValid(xid))
+					{
+						ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+													  invals->nmsgs, invals->msgs);
+						ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+														  buf->origptr);
+					}
+					else
+						ReorderBufferImmediateInvalidation(ctx->reorder,
+														   invals->nmsgs,
+														   invals->msgs);
+				}
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
@@ -334,15 +363,11 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_STANDBY_LOCK:
 			break;
 		case XLOG_INVALIDATIONS:
-			{
-				xl_invalidations *invalidations =
-				(xl_invalidations *) XLogRecGetData(r);
 
-				if (!ctx->fast_forward)
-					ReorderBufferImmediateInvalidation(ctx->reorder,
-													   invalidations->nmsgs,
-													   invalidations->msgs);
-			}
+			/*
+			 * We are processing the invalidations at the command level via
+			 * XLOG_XACT_INVALIDATIONS.  So we don't need to do anything here.
+			 */
 			break;
 		default:
 			elog(ERROR, "unexpected RM_STANDBY_ID record type: %u", info);
@@ -573,19 +598,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
-	/*
-	 * Process invalidation messages, even if we're not interested in the
-	 * transaction's contents, since the various caches need to always be
-	 * consistent.
-	 */
-	if (parsed->nmsgs > 0)
-	{
-		if (!ctx->fast_forward)
-			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
-										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
-	}
-
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5251932..1661190 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -856,6 +856,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to top-level transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -2201,7 +2204,11 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 /*
  * Setup the invalidation of the toplevel transaction.
  *
- * This needs to be done before ReorderBufferCommit is called!
+ * This needs to be called for each XLOG_XACT_INVALIDATIONS message and
+ * accumulates all the invalidation messages in the toplevel transaction.
+ * This is required because in some cases where we skip processing the
+ * transaction (see ReorderBufferForget), we need to execute all the
+ * invalidations together.
  */
 void
 ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
@@ -2212,17 +2219,35 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	if (txn->ninvalidations != 0)
-		elog(ERROR, "only ever add one set of invalidations");
+	/*
+	 * We collect all the invalidations under the top transaction so that we
+	 * can execute them all together.
+	 */
+	if (txn->toptxn)
+		txn = txn->toptxn;
 
 	Assert(nmsgs > 0);
 
-	txn->ninvalidations = nmsgs;
-	txn->invalidations = (SharedInvalidationMessage *)
-		MemoryContextAlloc(rb->context,
-						   sizeof(SharedInvalidationMessage) * nmsgs);
-	memcpy(txn->invalidations, msgs,
-		   sizeof(SharedInvalidationMessage) * nmsgs);
+	/* Accumulate invalidations. */
+	if (txn->ninvalidations == 0)
+	{
+		txn->ninvalidations = nmsgs;
+		txn->invalidations = (SharedInvalidationMessage *)
+			MemoryContextAlloc(rb->context,
+							   sizeof(SharedInvalidationMessage) * nmsgs);
+		memcpy(txn->invalidations, msgs,
+			   sizeof(SharedInvalidationMessage) * nmsgs);
+	}
+	else
+	{
+		txn->invalidations = (SharedInvalidationMessage *)
+			repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
+					 (txn->ninvalidations + nmsgs));
+
+		memcpy(txn->invalidations + txn->ninvalidations, msgs,
+			   nmsgs * sizeof(SharedInvalidationMessage));
+		txn->ninvalidations += nmsgs;
+	}
 }
 
 /*
@@ -2250,6 +2275,15 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * Mark top-level transaction as having catalog changes too if one of its
+	 * children has so that the ReorderBufferBuildTupleCidHash can
+	 * conveniently check just top-level transaction and decide whether to
+	 * build the hash table or not.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..edd9077 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,9 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transactions.  See
+ *      CommandEndInvalidationMessages.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +107,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -1094,6 +1098,11 @@ CommandEndInvalidationMessages(void)
 
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
+
+	/* WAL Log per-command invalidation messages for wal_level=logical */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
 							   &transInvalInfo->CurrentCmdInvalidMsgs);
 }
@@ -1501,3 +1510,49 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * LogLogicalInvalidations
+ *
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end or at commit time if any invalidations
+ * are pending.
+ */
+void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage *invalMessages;
+	int			nmsgs = 0;
+
+	/* Quick exit if we haven't done anything with invalidation messages. */
+	if (transInvalInfo == NULL)
+		return;
+
+	ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+									 MakeSharedInvalidMessagesArray);
+
+	Assert(!(numSharedInvalidMessagesArray > 0 &&
+			 SharedInvalidMessagesArray == NULL));
+
+	invalMessages = SharedInvalidMessagesArray;
+	nmsgs = numSharedInvalidMessagesArray;
+	SharedInvalidMessagesArray = NULL;
+	numSharedInvalidMessagesArray = 0;
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						 nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index aef8555..ac3f5e3 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,17 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 019bd38..1055e99 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -220,6 +220,9 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	end_lsn;
 
+	/* Toplevel transaction for this subxact (NULL for top-level). */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN of the last lsn at which snapshot information reside, so we can
 	 * restart decoding from there and fully recover this transaction from
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index bc5081c..463888c 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -61,4 +61,6 @@ extern void CacheRegisterRelcacheCallback(RelcacheCallbackFunction func,
 extern void CallSyscacheCallbacks(int cacheid, uint32 hashvalue);
 
 extern void InvalidateSystemCaches(void);
+
+extern void LogLogicalInvalidations(void);
 #endif							/* INVAL_H */
-- 
1.8.3.1

v34-0003-Extend-the-logical-decoding-output-plugin-API-wi.patch000664 000765 000024 00000110354 13703601143 025421 0ustar00amitkapilastaff000000 000000 From 7ad450c37f6b9e7d58af8f676d6cae98c676a5ee Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:01:24 +0530
Subject: [PATCH v34 3/9] Extend the logical decoding output plugin API with
 stream methods.

This adds seven methods to the output plugin API, adding support for
streaming changes for large transactions.

* stream_start
* stream_stop
* stream_abort
* stream_commit
* stream_change
* stream_message
* stream_truncate

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate a chunk of changes
streamed for a particular toplevel transaction.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/test_decoding.c     | 123 +++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 218 +++++++++++++++++++
 src/backend/replication/logical/logical.c | 351 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  59 +++++
 6 files changed, 825 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..4a71826 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
 }
 
 
@@ -540,3 +569,97 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the changes as the transaction can abort
+ * at a later point in time.  We don't want users to see the changes until the
+ * transaction is committed.
+ */
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the contents for transactional messages
+ * as the transaction can abort at a later point in time.  We don't want users to
+ * see the message contents until the transaction is committed.
+ */
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (transactional)
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu",
+						 transactional, prefix, sz);
+	}
+	else
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+						 transactional, prefix, sz);
+		appendBinaryStringInfo(ctx->out, message, sz);
+	}
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the detailed information of Truncate.
+ * See pg_decode_stream_change.
+ */
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index c89f93c..18116c8 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,6 +389,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -401,6 +408,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -679,6 +695,117 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+    
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The actual changes are not displayed as the transaction can abort at a later
+      point in time and we don't decode changes for aborted transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The message contents for transactional messages are not displayed as the transaction
+      can abort at a later point in time and we don't decode changes for aborted
+      transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -747,4 +874,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and two optional callbacks
+    (<function>stream_message_cb</function>) and (<function>stream_truncate_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be..6ee59bd 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr message_lsn, bool transactional,
+									  const char *prefix, Size message_size, const char *message);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require start/stop/abort/commit/change
+	 * callbacks. The message and truncate callback are optional, similar to
+	 * regular output plugins. We however enable streaming when at least one
+	 * of the methods is enabled, so that we can easily identify missing
+	 * methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional, so we do not
+	 * fail with ERROR when missing, but the wrappers simply do nothing. We
+	 * must set the ReorderBuffer callbacks to something, otherwise the calls
+	 * from there will crash (we don't want to move the checks there).
+	 */
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,307 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_start_cb callback")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_stop_cb callback")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_abort_cb callback")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_commit_cb callback")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_change_cb callback")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475..deef318 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..2d9aa11 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.commit_lsn
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1055e99..42bc817 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -348,6 +348,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											   ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
 struct ReorderBuffer
 {
 	/*
@@ -387,6 +435,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamTruncateCB stream_truncate;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v34-0004-Implement-streaming-mode-in-ReorderBuffer.patch000664 000765 000024 00000225463 13703601143 024366 0ustar00amitkapilastaff000000 000000 From 2d20baff9567c83ad96d032ceb5af4614ee4d0a8 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 15 Jul 2020 18:38:32 +0530
Subject: [PATCH v34 4/9] Implement streaming mode in ReorderBuffer.

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke new stream API methods. However, sometimes if we have
incomplete toast or speculative insert we spill to the disk because we can
not generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

Now that we can stream in-progress transactions, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

We also provide a new option via SQL APIs to fetch the changes being
streamed.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/Makefile                  |   2 +-
 contrib/test_decoding/expected/stream.out       |  40 +
 contrib/test_decoding/expected/truncate.out     |   6 +
 contrib/test_decoding/sql/stream.sql            |  21 +
 contrib/test_decoding/sql/truncate.sql          |   1 +
 contrib/test_decoding/test_decoding.c           |  13 +
 doc/src/sgml/logicaldecoding.sgml               |   9 +-
 doc/src/sgml/test-decoding.sgml                 |  22 +
 src/backend/access/heap/heapam.c                |  13 +
 src/backend/access/heap/heapam_visibility.c     |  42 +-
 src/backend/access/index/genam.c                |  53 ++
 src/backend/access/table/tableam.c              |   8 +
 src/backend/access/transam/xact.c               |  19 +
 src/backend/replication/logical/decode.c        |  17 +-
 src/backend/replication/logical/logical.c       |  10 +
 src/backend/replication/logical/reorderbuffer.c | 969 +++++++++++++++++++++---
 src/include/access/heapam_xlog.h                |   1 +
 src/include/access/tableam.h                    |  55 ++
 src/include/access/xact.h                       |   4 +
 src/include/replication/logical.h               |   1 +
 src/include/replication/reorderbuffer.h         |  56 +-
 21 files changed, 1256 insertions(+), 106 deletions(-)
 create mode 100644 contrib/test_decoding/expected/stream.out
 create mode 100644 contrib/test_decoding/sql/stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c58..ed9a3d6 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,7 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate
+	spill slot truncate stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
new file mode 100644
index 0000000..bdf7002
--- /dev/null
+++ b/contrib/test_decoding/expected/stream.out
@@ -0,0 +1,40 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+ count 
+-------
+   157
+(1 row)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/truncate.out b/contrib/test_decoding/expected/truncate.out
index 1cf2ae8..e64d377 100644
--- a/contrib/test_decoding/expected/truncate.out
+++ b/contrib/test_decoding/expected/truncate.out
@@ -25,3 +25,9 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
  COMMIT
 (9 rows)
 
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
new file mode 100644
index 0000000..24d41b1
--- /dev/null
+++ b/contrib/test_decoding/sql/stream.sql
@@ -0,0 +1,21 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/truncate.sql b/contrib/test_decoding/sql/truncate.sql
index 5aecdf0..5633854 100644
--- a/contrib/test_decoding/sql/truncate.sql
+++ b/contrib/test_decoding/sql/truncate.sql
@@ -11,3 +11,4 @@ TRUNCATE tab1, tab1 RESTART IDENTITY CASCADE;
 TRUNCATE tab1, tab2;
 
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 4a71826..bb3d9f3 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -122,6 +122,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	ListCell   *option;
 	TestDecodingData *data;
+	bool enable_streaming = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -212,6 +213,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "stream-changes") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_streaming))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -221,6 +232,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 							elem->arg ? strVal(elem->arg) : "(null)")));
 		}
 	}
+
+	ctx->streaming &= enable_streaming;
 }
 
 /* cleanup this plugin's resources */
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 18116c8..98b47b0 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d..fe7c978 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes', '1');
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d881f4c..33a4580 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments in xact.c where
+	 * these variables are declared.  Normally we have such a check at tableam
+	 * level API but this is called from many places so we need to ensure it
+	 * here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
@@ -1945,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..c771280 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the tuple
+		 * yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1659,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..9d9a70a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,10 +430,37 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments in xact.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments in xact.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments in xact.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 4b2bb29..a61e279 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -235,6 +235,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
 	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
+	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d4f7c29..99722ee 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -83,6 +83,19 @@ bool		XactDeferrable;
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
 /*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool		bsysscan = false;
+
+/*
  * When running as a parallel worker, we place only a single
  * TransactionStateData on the parallel worker's state stack, and the XID
  * reflected there will be that of the *innermost* currently-active
@@ -2680,6 +2693,9 @@ AbortTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* If in parallel mode, clean up workers and exit parallel mode. */
 	if (IsInParallelMode())
 	{
@@ -4982,6 +4998,9 @@ AbortSubTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* Exit from parallel mode, if necessary. */
 	if (IsInParallelMode())
 	{
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7153eba..2010d5a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6ee59bd..8deff89 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1442,3 +1442,13 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Clear logical streaming state during (sub)transaction abort.
+ */
+void
+ResetLogicalStreamingState(void)
+{
+	CheckXidAlive = InvalidTransactionId;
+	bsysscan = false;
+}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1661190..27b4617 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -178,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -236,6 +252,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +261,16 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static inline bool ReorderBufferCanStartStreaming(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -367,6 +394,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -416,13 +446,15 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 }
 
 /*
- * Free an ReorderBufferChange.
+ * Free a ReorderBufferChange and update memory accounting, if requested.
  */
 void
-ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
+						  bool upd_mem)
 {
 	/* update memory accounting info */
-	ReorderBufferChangeMemoryUpdate(rb, change, false);
+	if (upd_mem)
+		ReorderBufferChangeMemoryUpdate(rb, change, false);
 
 	/* free contained data */
 	switch (change->action)
@@ -624,16 +656,101 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 }
 
 /*
- * Queue a change into a transaction so it can be replayed upon commit.
+ * Record the partial change for the streaming of in-progress transactions.  We
+ * can stream only complete changes so if we have a partial change like toast
+ * table insert or speculative then we mark such a 'txn' so that it can't be
+ * streamed.  We also ensure that if the changes in such a 'txn' are above
+ * logical_decoding_work_mem threshold then we stream them as soon as we have a
+ * complete change.
+ */
+static void
+ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								  ReorderBufferChange *change,
+								  bool toast_insert)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The partial changes need to be processed only while streaming
+	 * in-progress transactions.
+	 */
+	if (!ReorderBufferCanStream(rb))
+		return;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * Set the toast insert bit whenever we get toast insert to indicate a partial
+	 * change and clear it when we get the insert or update on main table (Both
+	 * update and insert will do the insert in the toast table).
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			 IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial change and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+	{
+		/*
+		 * Speculative confirm change must be preceded by speculative insertion.
+		 */
+		Assert(rbtxn_has_spec_insert(toptxn));
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+	}
+
+	/*
+	 * Stream the transaction if it is serialized before and the changes are
+	 * now complete in the top-level transaction.
+	 *
+	 * The reason for doing the streaming of such a transaction as soon as
+	 * we get the complete change for it is that previously it would have
+	 * reached the memory threshold and wouldn't get streamed because of
+	 * incomplete changes.  Delaying such transactions would increase apply
+	 * lag for them.
+	 */
+	if (ReorderBufferCanStartStreaming(rb) &&
+		!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		rbtxn_is_serialized(txn))
+		ReorderBufferStreamTXN(rb, toptxn);
+}
+
+/*
+ * Queue a change into a transaction so it can be replayed upon commit or will be
+ * streamed when we reach logical_decoding_work_mem threshold.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the previous changes we have detected that the transaction
+	 * is aborted.  So there is no point in collecting further changes for it.
+	 */
+	if (txn->concurrent_abort)
+	{
+		/*
+		 * We don't need to update memory accounting for this change as we have
+		 * not added it to the queue yet.
+		 */
+		ReorderBufferReturnChange(rb, change, false);
+		return;
+	}
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -645,6 +762,9 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* process partial change */
+	ReorderBufferProcessPartialChange(rb, txn, change, toast_insert);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -674,7 +794,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -764,6 +884,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the (sub)transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -1018,6 +1170,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1032,6 +1187,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1148,7 +1306,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 	{
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1234,7 +1392,7 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1280,7 +1438,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1297,7 +1455,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1310,6 +1468,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1335,6 +1502,91 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change, true);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
+	 * streamed always, even if it does not contain any changes (that is, when
+	 * all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the toplevel xact (we send the XID in all messages), but we never
+	 * stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
+	 * memory. We could also keep the hash table and update it with new ctid
+	 * values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
  */
@@ -1485,57 +1737,178 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the
+	 * xid aborted.  That will happen during catalog access.
+	 */
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						  ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction. This resets the TXN such that it
+ * can be used to stream the remaining data of transaction being processed.
+ */
+static void
+ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+					  Snapshot snapshot_now,
+					  CommandId command_id,
+					  XLogRecPtr last_lsn,
+					  ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Free all resources allocated for toast reconstruction */
+	ReorderBufferToastReset(rb, txn);
+
+	/* Return the spec insert change if it is not NULL */
+	if (specinsert != NULL)
+	{
+		ReorderBufferReturnChange(rb, specinsert, true);
+		specinsert = NULL;
 	}
 
-	snapshot_now = txn->base_snapshot;
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. We iterate over the top and subtransactions (using a k-way
+ * merge) and replay the changes in lsn order.
+ *
+ * If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool stream_started = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1558,14 +1931,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming ? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* We only need to send begin/commit for non-streamed transactions. */
+		if (!streaming)
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1573,6 +1947,36 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * We can't call start stream callback before processing first
+			 * change.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					txn->origin_id = change->origin_id;
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1649,7 +2053,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1685,11 +2090,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1714,7 +2119,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					/* clear out a pending (and thus failed) speculation */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
@@ -1747,7 +2152,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1756,10 +2164,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1790,7 +2195,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1837,7 +2241,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		 */
 		if (specinsert)
 		{
-			ReorderBufferReturnChange(rb, specinsert);
+			ReorderBufferReturnChange(rb, specinsert, true);
 			specinsert = NULL;
 		}
 
@@ -1845,14 +2249,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, send the last message for this set of
+		 * changes depending upon streaming mode.
+		 */
+		if (streaming)
+		{
+			if (stream_started)
+			{
+				rb->stream_stop(rb, txn, prev_lsn);
+				stream_started = false;
+			}
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot for the next set of changes in
+		 * streaming mode.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1870,14 +2295,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as
+		 * streamed (if they contained changes). Otherwise, remove all the
+		 * changes and deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1896,15 +2334,106 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
+		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * This error can only occur when we are sending the data in
+			 * streaming mode and the streaming in not finished yet.
+			 */
+			Assert(streaming);
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+			curtxn->concurrent_abort = true;
+
+			/* Reset the TXN so that it is allowed to stream remaining data. */
+			ReorderBufferResetTXN(rb, txn, snapshot_now,
+								  command_id, prev_lsn,
+								  specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+	}
+	PG_END_TRY();
+}
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot	snapshot_now;
+	CommandId	command_id = FirstCommandId;
 
-		PG_RE_THROW();
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
 	}
-	PG_END_TRY();
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
 }
 
 /*
@@ -1931,6 +2460,22 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_abort(rb, txn, lsn);
+
+		/*
+		 * We might have decoded changes for this transaction that could load
+		 * the cache as per the current transaction's view (consider DDL's
+		 * happened in this transaction). We don't want the decoding of future
+		 * transactions to use those cache entries so execute invalidations.
+		 */
+		if (txn->ninvalidations > 0)
+			ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+											   txn->invalidations);
+	}
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2000,6 +2545,10 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2082,7 +2631,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2131,12 +2680,21 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2144,6 +2702,8 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2155,19 +2715,41 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* If streaming supported, update the total size in top level as well. */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2196,6 +2778,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2388,6 +2971,51 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ *
+ * Note that, we skip transactions that contains incomplete changes. There
+ * is a scope of optimization here such that we can select the largest transaction
+ * which has complete changes.  But that will make the code and design quite complex
+ * and that might not be worth the benefit.  If we plan to stream the transactions
+ * that contains incomplete changes then we need to find a way to partially
+ * stream/truncate the transaction changes in-memory and build a mechanism to
+ * partially truncate the spilled files.  Additionally, whenever we partially
+ * stream the transaction we need to maintain the last streamed lsn and next time
+ * we need to restore from that segment and the offset in WAL.  As we stream the
+ * changes from the top transaction and restore them subtransaction wise, we need
+ * to even remember the subxact from where we streamed the last change.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	Size largest_size = 0;
+	ReorderBufferTXN *largest = NULL;
+
+	/* Find the largest top-level transaction. */
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		if ((largest != NULL || txn->total_size > largest_size) &&
+			(txn->total_size > 0) && !(rbtxn_has_incomplete_tuple(txn)))
+		{
+			largest = txn;
+			largest_size = txn->total_size;
+		}
+	}
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
  * disk until we reach under the memory limit.
@@ -2419,11 +3047,33 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if possible.  Otherwise, spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStartStreaming(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it
+			 * from memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2501,7 +3151,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 
 		spilled++;
 	}
@@ -2713,6 +3363,136 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+/* Returns true, if the output plugin supports streaming, false, otherwise. */
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/* Returns true, if the streaming can be started now, false, otherwise. */
+static inline bool
+ReorderBufferCanStartStreaming(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild *builder = ctx->snapshot_builder;
+
+	/*
+	 * We can't start streaming immediately even if the streaming is enabled
+	 * because we previously decoded this transaction and now just are
+	 * restarting.
+	 */
+	if (ReorderBufferCanStream(rb) &&
+		!SnapBuildXactNeedsSkip(builder, ctx->reader->EndRecPtr))
+	{
+		/* We must have a consistent snapshot by this time */
+		Assert(SnapBuildCurrentState(builder) == SNAPBUILD_CONSISTENT);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * We can't make any assumptions about base snapshot here, similar to what
+	 * ReorderBufferCommit() does. That relies on base_snapshot getting
+	 * transferred from subxact in ReorderBufferCommitChild(), but that was
+	 * not yet called as the transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 *
+	 * Unlike DecodeCommit which adds xids of all the subtransactions in
+	 * snapshot's xip array via SnapBuildCommittedTxn, we can't do that here
+	 * but we do add them to subxip array instead via ReorderBufferCopySnap.
+	 * This allows the catalog changes made in subtransactions decoded till
+	 * now to be visible.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		/*
+		 * If this transaction has no snapshot, it didn't make any changes to
+		 * the database till now, so there's nothing to decode.
+		 */
+		if (txn->base_snapshot == NULL)
+		{
+			Assert(txn->ninvalidations == 0);
+			return;
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -2813,7 +3593,7 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		dlist_container(ReorderBufferChange, node, cleanup_iter.cur);
 
 		dlist_delete(&cleanup->node);
-		ReorderBufferReturnChange(rb, cleanup);
+		ReorderBufferReturnChange(rb, cleanup, true);
 	}
 	txn->nentries_mem = 0;
 	Assert(dlist_is_empty(&txn->changes));
@@ -3522,7 +4302,7 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			dlist_container(ReorderBufferChange, node, it.cur);
 
 			dlist_delete(&change->node);
-			ReorderBufferReturnChange(rb, change);
+			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
 
@@ -3812,6 +4592,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases, we assume the CID is from the future
+	 * command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 0d28f01..b188427 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -19,6 +19,7 @@
 
 #include "access/relscan.h"
 #include "access/sdir.h"
+#include "access/xact.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1017,6 +1027,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1056,6 +1073,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with
+	 * valid CheckXidAlive for catalog or regular tables.  See detailed
+	 * comments in xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1713,6 +1738,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1730,6 +1763,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1748,6 +1789,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1764,6 +1812,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index ac3f5e3..5f767eb 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,10 @@ typedef enum
 /* Synchronous commit level */
 extern int	synchronous_commit;
 
+/* used during logical streaming of a transaction */
+extern TransactionId CheckXidAlive;
+extern bool bsysscan;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index deef318..b0fae98 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -121,5 +121,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+extern void ResetLogicalStreamingState(void);
 
 #endif
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 42bc817..1ae17d5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -162,6 +162,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -181,6 +184,40 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -249,6 +286,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
@@ -313,6 +357,12 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -484,12 +534,14 @@ void		ReorderBufferFree(ReorderBuffer *);
 ReorderBufferTupleBuf *ReorderBufferGetTupleBuf(ReorderBuffer *, Size tuple_len);
 void		ReorderBufferReturnTupleBuf(ReorderBuffer *, ReorderBufferTupleBuf *tuple);
 ReorderBufferChange *ReorderBufferGetChange(ReorderBuffer *);
-void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
+void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *, bool);
 
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool toast_insert);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v34-0005-Extend-the-BufFile-interface-for-the-streaming-o.patch000664 000765 000024 00000037512 13703601143 025424 0ustar00amitkapilastaff000000 000000 From 9031c5124feb4bb42b7ce29d509fc96228c56429 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v34 5/9] Extend the BufFile interface for the streaming of
 in-progress transactions.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times.

Implement the interface for BufFileTruncate interface to allow files to be
truncated up to a particular offset.  Extend BufFileSeek API to support
SEEK_END case.  Add an option to provide a mode while opening the shared
BufFiles instead of always opening in read-only mode.
---
 src/backend/postmaster/pgstat.c           |  3 +
 src/backend/storage/file/buffile.c        | 86 +++++++++++++++++++++++----
 src/backend/storage/file/fd.c             |  9 ++-
 src/backend/storage/file/sharedfileset.c  | 98 ++++++++++++++++++++++++++++---
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  4 +-
 10 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 88992c2..6c97f68 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 3907349..c08ff4f 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile supports temporary files that can be used by the single backend when
+ * the corresponding files need to be survived across the transaction and need
+ * to be opened and closed multiple times.  Such files need to be created as a
+ * member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -364,6 +368,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -666,11 +673,22 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
+			break;
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +856,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420e..f376a97 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594..9a3dc10 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can also be used by backends when the temporary files need
+ * to be opened/closed multiple times and the underlying files need to survive
+ * across transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,19 +29,29 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -223,6 +255,58 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 }
 
 /*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
+/*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
  */
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59..788815c 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a43..b83fb50 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201..807a9c1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752ba..fc34c49 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d..e209f04 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf07..d5edb60 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
1.8.3.1

v34-0006-Add-support-for-streaming-to-built-in-replicatio.patch000664 000765 000024 00000263316 13703601143 025635 0ustar00amitkapilastaff000000 000000 From 083949b44a0ab10b126ce6be75f1b68e1de52891 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 08:57:16 +0530
Subject: [PATCH v34 6/9] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |   4 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  45 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   3 +
 src/backend/replication/logical/proto.c            | 140 ++-
 src/backend/replication/logical/worker.c           | 946 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 345 +++++++-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  42 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/009_stream_simple.pl       |  86 ++
 src/test/subscription/t/010_stream_subxact.pl      | 102 +++
 src/test/subscription/t/011_stream_ddl.pl          |  95 +++
 .../subscription/t/012_stream_subxact_abort.pl     |  82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |  84 ++
 18 files changed, 1963 insertions(+), 45 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index c24ace1..d8de56c 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 5bbc165..c25b7c5 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731..f28482f 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9ebb026..9065a1b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 6c97f68..479e3ca 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9..5257ab0 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd..83d0642 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f90a896..f0c3278 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,62 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +291,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -542,17 +682,323 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or
+	 * inside streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		!in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -565,6 +1011,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1029,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1068,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1186,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1331,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1704,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1845,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1493,6 +1973,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1597,7 +2085,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1909,6 +2397,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1941,6 +2443,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3041,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 15379e3..8785d87 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +122,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +148,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +211,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +240,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +263,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -232,6 +315,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -290,9 +378,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +428,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +470,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +491,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +523,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +543,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +567,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +587,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +612,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +644,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -588,6 +725,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -624,6 +868,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -753,12 +1029,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -793,7 +1102,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d4..617b909 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561..89158ed 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c75dceb..56517a9 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v34-0007-Enable-streaming-for-all-subscription-TAP-tests.patch000664 000765 000024 00000035342 13703601143 025376 0ustar00amitkapilastaff000000 000000 From 89db37c184b1a66ce421c3dd6f6d1cdd916cdd30 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v34 7/9] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318f..6f7bedc 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f8..4ba8086 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334e..cdf9b8e 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v34-0008-Add-TAP-test-for-streaming-vs.-DDL.patch000664 000765 000024 00000010621 13703601143 022324 0ustar00amitkapilastaff000000 000000 From 5d36313c51337ab237203874c26d7397cb02ff66 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v34 8/9] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v34-0009-Add-streaming-option-in-pg_dump.patch000664 000765 000024 00000005322 13703601143 022402 0ustar00amitkapilastaff000000 000000 From 37b4939e2f2b373da0cbd3f5c2b69a86324c8971 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v34 9/9] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index e758b5c..ff2ae37 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4235,8 +4236,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4249,6 +4250,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4265,6 +4267,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4342,6 +4345,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0c2fcfb..af64270 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
1.8.3.1

#440Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#439)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 15, 2020 at 9:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have reviewed your changes and those look good to me, please find
the latest version of the patch set.

I have done an additional round of review and below are the changes I
made in the attached patch-set.
1. Changed comments in 0002.
2. In 0005, apart from changing a few comments and function name, I
have changed below code:
+ if (ReorderBufferCanStream(rb) &&
+ !SnapBuildXactNeedsSkip(builder, ctx->reader->ReadRecPtr))
Here, I think it is better to compare it with EndRecPtr.  I feel in
boundary case the next record could be the same as start_decoding_at,
so why to avoid streaming in that case?

Make sense to me

3. In 0006, made below changes:
a. Removed function ReorderBufferFreeChange and added a new
parameter in ReorderBufferReturnChange to achieve the same purpose.
b. Changed quite a few comments, function names, added additional
Asserts, and few other cosmetic changes.
4. In 0007, made below changes:
a. Removed the unnecessary change in .gitignore
b. Changed the newly added option name to "stream-change".

Apart from above, I have merged patches 0004, 0005, 0006 and 0007 as
those seems one functionality to me. For the sake of review, the
patch-set that contains merged patches is attached separately as
v34-combined.

Let me know what you think of the changes?

I have reviewed the changes and looks fine to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#441Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#440)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jul 16, 2020 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Let me know what you think of the changes?

I have reviewed the changes and looks fine to me.

Thanks, I am planning to start committing a few of the infrastructure
patches (especially first two) by early next week as we have resolved
all the open issues and done an extensive review of the entire
patch-set. In the attached version, there is a slight change in one
of the commit messages as compared to the previous version. I would
like to describe in brief the first two patches for the sake of
convenience. Let me know if you or anyone else sees any problems with
these.

The first patch in the series allows us to WAL-log subtransaction and
top-level XID association. The logical decoding infrastructure needs
to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding. So we also write the assignment info into WAL
immediately, as part of the next WAL record (to minimize overhead)
only when *wal_level=logical*. We can not remove the existing
XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in
the hot standby snapshot.

The second patch writes WAL for invalidations at command end with
wal_level=logical. When wal_level=logical, write invalidations at
command end into WAL so that decoding can use this information. This
patch is required to allow the streaming of in-progress transactions
in logical decoding. We still add the invalidations to the cache and
write them to WAL at commit time in RecordTransactionCommit(). This
uses the existing XLOG_INVALIDATIONS xlog record type, from the
RM_STANDBY_ID resource manager (see LogStandbyInvalidations for
details). So existing code relying on those invalidations (e.g. redo)
does not need to be changed. The invalidations written at command end
uses a new xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID
resource manager. See LogLogicalInvalidations for details. These new
xlog records are ignored by existing redo procedures, which still rely
on the invalidations written to commit records. The invalidations are
decoded and accumulated in top-transaction, and then executed during
replay. This obviates the need to decode the invalidations as part of
a commit record.

The performance testing has shown that there is no performance penalty
with either of the patches but there is some additional WAL which in
most cases is 2-5% but in worst cases and for some specific DDL's it
is up to 15% with the second patch, however, that happens at
wal_level=logical only. We have considered an alternative to blow up
all caches on any DDL in WALSenders and that will have both CPU and
network overhead. For detailed results and analysis see [1]/messages/by-id/CAKYtNAqWkPpPFrdEbpPrCan3G_QAcankZarRKKd7cj6vQigM7w@mail.gmail.com[2]/messages/by-id/CAA4eK1L3PoiBw6uogB7jD5rmdT-GmEF4kOEccS1AWKuBhSkQkQ@mail.gmail.com.

[1]: /messages/by-id/CAKYtNAqWkPpPFrdEbpPrCan3G_QAcankZarRKKd7cj6vQigM7w@mail.gmail.com
[2]: /messages/by-id/CAA4eK1L3PoiBw6uogB7jD5rmdT-GmEF4kOEccS1AWKuBhSkQkQ@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v35.tarapplication/x-tar; name=v35.tarDownload
v35-0001-Immediately-WAL-log-subtransaction-and-top-level.patch000664 000765 000024 00000026117 13704025031 025520 0ustar00amitkapilastaff000000 000000 From 5471f2bc9d3f4199541e3042b587557147e9dd5c Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 5 Jun 2020 09:03:16 +0530
Subject: [PATCH v35 1/9] Immediately WAL-log subtransaction and top-level XID
 association.

The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead) only when wal_level=logical.
We can not remove the existing XLOG_XACT_ASSIGNMENT WAL as that is
required for avoiding overflow in the hot standby snapshot.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/transam/xact.c        | 50 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c  | 23 +++++++++++++--
 src/backend/access/transam/xlogreader.c  |  5 ++++
 src/backend/replication/logical/decode.c | 44 ++++++++++++++--------------
 src/include/access/xact.h                |  3 ++
 src/include/access/xlog.h                |  1 +
 src/include/access/xlogreader.h          |  3 ++
 src/include/access/xlogrecord.h          |  1 +
 8 files changed, 107 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b3ee7fa..bd4c3cf 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -191,6 +191,7 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		chain;			/* start a new block after this one */
+	bool		assigned;		/* assigned to top-level XID */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -223,6 +224,7 @@ typedef struct SerializedTransactionState
 static TransactionStateData TopTransactionStateData = {
 	.state = TRANS_DEFAULT,
 	.blockState = TBLOCK_DEFAULT,
+	.assigned = false,
 };
 
 /*
@@ -5120,6 +5122,7 @@ PushTransaction(void)
 	GetUserIdAndSecContext(&s->prevUser, &s->prevSecContext);
 	s->prevXactReadOnly = XactReadOnly;
 	s->parallelModeLevel = 0;
+	s->assigned = false;
 
 	CurrentTransactionState = s;
 
@@ -6022,3 +6025,50 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+/*
+ * IsSubTransactionAssignmentPending
+ *
+ * This is used to decide whether we need to WAL log the top-level XID for
+ * operation in a subtransaction.  We require that for logical decoding, see
+ * LogicalDecodingProcessRecord.
+ *
+ * This returns true if wal_level >= logical and we are inside a valid
+ * subtransaction, for which the assignment was not yet written to any WAL
+ * record.
+ */
+bool
+IsSubTransactionAssignmentPending(void)
+{
+	/* wal_level has to be logical */
+	if (!XLogLogicalInfoActive())
+		return false;
+
+	/* we need to be in a transaction state */
+	if (!IsTransactionState())
+		return false;
+
+	/* it has to be a subtransaction */
+	if (!IsSubTransaction())
+		return false;
+
+	/* the subtransaction has to have a XID assigned */
+	if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+		return false;
+
+	/* and it should not be already 'assigned' */
+	return !CurrentTransactionState->assigned;
+}
+
+/*
+ * MarkSubTransactionAssigned
+ *
+ * Mark the subtransaction assignment as completed.
+ */
+void
+MarkSubTransactionAssigned(void)
+{
+	Assert(IsSubTransactionAssignmentPending());
+
+	CurrentTransactionState->assigned = true;
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f..c526bb1 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -89,11 +89,13 @@ static XLogRecData hdr_rdt;
 static char *hdr_scratch = NULL;
 
 #define SizeOfXlogOrigin	(sizeof(RepOriginId) + sizeof(char))
+#define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
 	(SizeOfXLogRecord + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
-	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
+	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
+	 SizeOfXLogTransactionId)
 
 /*
  * An array of XLogRecData structs, to hold registered data.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
 	int			i;
 
+	/* reset the subxact assignment flag (if needed) */
+	if (curinsert_flags & XLOG_INCLUDE_XID)
+		MarkSubTransactionAssigned();
+
 	for (i = 0; i < max_registered_block_id; i++)
 		registered_buffers[i].in_use = false;
 
@@ -398,7 +404,7 @@ void
 XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	curinsert_flags = flags;
+	curinsert_flags |= flags;
 }
 
 /*
@@ -748,6 +754,19 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		scratch += sizeof(replorigin_session_origin);
 	}
 
+	/* followed by toplevel XID, if not already included in previous record */
+	if (IsSubTransactionAssignmentPending())
+	{
+		TransactionId xid = GetTopTransactionIdIfAny();
+
+		/* update the flag (later used by XLogResetInsertion) */
+		XLogSetRecordFlags(XLOG_INCLUDE_XID);
+
+		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
+		memcpy(scratch, &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cb76be4..a757bac 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1197,6 +1197,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 	state->decoded_record = record;
 	state->record_origin = InvalidRepOriginId;
+	state->toplevel_xid = InvalidTransactionId;
 
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
@@ -1235,6 +1236,10 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 		{
 			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
 		}
+		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
+		{
+			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3a..0c0c371 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -94,11 +94,27 @@ void
 LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record)
 {
 	XLogRecordBuffer buf;
+	TransactionId txid;
 
 	buf.origptr = ctx->reader->ReadRecPtr;
 	buf.endptr = ctx->reader->EndRecPtr;
 	buf.record = record;
 
+	txid = XLogRecGetTopXid(record);
+
+	/*
+	 * If the top-level xid is valid, we need to assign the subxact to the
+	 * top-level xact. We need to do this for all records, hence we do it
+	 * before the switch.
+	 */
+	if (TransactionIdIsValid(txid))
+	{
+		ReorderBufferAssignChild(ctx->reorder,
+								 txid,
+								 record->decoded_record->xl_xid,
+								 buf.origptr);
+	}
+
 	/* cast so we get a warning when new rmgrs are added */
 	switch ((RmgrId) XLogRecGetRmid(record))
 	{
@@ -216,13 +232,8 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	/*
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
 	 * bail out.
-	 *
-	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
-	 * when the snapshot is being built: it is possible to get later records
-	 * that require subxids to be properly assigned.
 	 */
-	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
-		info != XLOG_XACT_ASSIGNMENT)
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
 		return;
 
 	switch (info)
@@ -264,22 +275,13 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
-			{
-				xl_xact_assignment *xlrec;
-				int			i;
-				TransactionId *sub_xid;
 
-				xlrec = (xl_xact_assignment *) XLogRecGetData(r);
-
-				sub_xid = &xlrec->xsub[0];
-
-				for (i = 0; i < xlrec->nsubxacts; i++)
-				{
-					ReorderBufferAssignChild(reorder, xlrec->xtop,
-											 *(sub_xid++), buf->origptr);
-				}
-				break;
-			}
+			/*
+			 * We assign subxact to the toplevel xact while processing each
+			 * record if required.  So, we don't need to do anything here.
+			 * See LogicalDecodingProcessRecord.
+			 */
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index db19187..aef8555 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -428,6 +428,9 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
+extern bool IsSubTransactionAssignmentPending(void);
+extern void MarkSubTransactionAssigned(void);
+
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 5b14334..d8391aa 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+#define XLOG_INCLUDE_XID		0x04	/* include XID of top-level xact */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index b0f2a6e..b976882 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -191,6 +191,8 @@ struct XLogReaderState
 
 	RepOriginId record_origin;
 
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+
 	/* information about blocks referenced by the record. */
 	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
 
@@ -304,6 +306,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
 #define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
 #define XLogRecGetData(decoder) ((decoder)->main_data)
 #define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
 #define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index acd9af0..2f0c8bf 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -223,5 +223,6 @@ typedef struct XLogRecordDataHeaderLong
 #define XLR_BLOCK_ID_DATA_SHORT		255
 #define XLR_BLOCK_ID_DATA_LONG		254
 #define XLR_BLOCK_ID_ORIGIN			253
+#define XLR_BLOCK_ID_TOPLEVEL_XID	252
 
 #endif							/* XLOGRECORD_H */
-- 
1.8.3.1

v35-0002-WAL-Log-invalidations-at-command-end-with-wal_le.patch000664 000765 000024 00000035065 13704025042 025360 0ustar00amitkapilastaff000000 000000 From 395c9a56bafa6ac2d5028da1c201ee4f3f8212b5 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 6 Jun 2020 09:54:21 +0530
Subject: [PATCH v35 2/9] WAL Log invalidations at command end with
 wal_level=logical.

When wal_level=logical, write invalidations at command end into WAL so
that decoding can use this information.

This patch is required to allow the streaming of in-progress transactions
in logical decoding.  The actual work to allow streaming will be committed
as a separate patch.

We still add the invalidations to the cache and write them to WAL at
commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not need
to be changed.

The invalidations written at command end uses a new xlog record type
XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See
LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures, which
still rely on the invalidations written to commit records.

The invalidations are decoded and accumulated in top-transaction, and then
executed during replay.  This obviates the need to decode the
invalidations as part of a commit record.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/rmgrdesc/xactdesc.c          | 10 +++++
 src/backend/access/transam/xact.c               | 17 ++++++++
 src/backend/replication/logical/decode.c        | 58 +++++++++++++++----------
 src/backend/replication/logical/reorderbuffer.c | 52 ++++++++++++++++++----
 src/backend/utils/cache/inval.c                 | 55 +++++++++++++++++++++++
 src/include/access/xact.h                       | 13 +++++-
 src/include/replication/reorderbuffer.h         |  3 ++
 src/include/utils/inval.h                       |  2 +
 8 files changed, 177 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce755..68aa994 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -396,6 +396,13 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		standby_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs, InvalidOid,
+								   InvalidOid, false);
+	}
 }
 
 const char *
@@ -423,6 +430,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index bd4c3cf..d4f7c29 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1224,6 +1224,16 @@ RecordTransactionCommit(void)
 	bool		RelcacheInitFileInval = false;
 	bool		wrote_xlog;
 
+	/*
+	 * Log pending invalidations for logical decoding of in-progress
+	 * transactions.  Normally for DDLs, we log this at each command end,
+	 * however, for certain cases where we directly update the system table
+	 * without a transaction block, the invalidations are not logged till this
+	 * time.
+	 */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	/* Get data needed for commit record */
 	nrels = smgrGetPendingDeletes(true, &rels);
 	nchildren = xactGetCommittedChildren(&children);
@@ -6022,6 +6032,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0c0c371..7153eba 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -278,10 +278,39 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 			/*
 			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here.
-			 * See LogicalDecodingProcessRecord.
+			 * record if required.  So, we don't need to do anything here. See
+			 * LogicalDecodingProcessRecord.
 			 */
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/*
+				 * Execute the invalidations for xid-less transactions,
+				 * otherwise, accumulate them so that they can be processed at
+				 * the commit time.
+				 */
+				if (!ctx->fast_forward)
+				{
+					if (TransactionIdIsValid(xid))
+					{
+						ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+													  invals->nmsgs, invals->msgs);
+						ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+														  buf->origptr);
+					}
+					else
+						ReorderBufferImmediateInvalidation(ctx->reorder,
+														   invals->nmsgs,
+														   invals->msgs);
+				}
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
@@ -334,15 +363,11 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_STANDBY_LOCK:
 			break;
 		case XLOG_INVALIDATIONS:
-			{
-				xl_invalidations *invalidations =
-				(xl_invalidations *) XLogRecGetData(r);
 
-				if (!ctx->fast_forward)
-					ReorderBufferImmediateInvalidation(ctx->reorder,
-													   invalidations->nmsgs,
-													   invalidations->msgs);
-			}
+			/*
+			 * We are processing the invalidations at the command level via
+			 * XLOG_XACT_INVALIDATIONS.  So we don't need to do anything here.
+			 */
 			break;
 		default:
 			elog(ERROR, "unexpected RM_STANDBY_ID record type: %u", info);
@@ -573,19 +598,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
-	/*
-	 * Process invalidation messages, even if we're not interested in the
-	 * transaction's contents, since the various caches need to always be
-	 * consistent.
-	 */
-	if (parsed->nmsgs > 0)
-	{
-		if (!ctx->fast_forward)
-			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
-										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
-	}
-
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5251932..1661190 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -856,6 +856,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to top-level transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -2201,7 +2204,11 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 /*
  * Setup the invalidation of the toplevel transaction.
  *
- * This needs to be done before ReorderBufferCommit is called!
+ * This needs to be called for each XLOG_XACT_INVALIDATIONS message and
+ * accumulates all the invalidation messages in the toplevel transaction.
+ * This is required because in some cases where we skip processing the
+ * transaction (see ReorderBufferForget), we need to execute all the
+ * invalidations together.
  */
 void
 ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
@@ -2212,17 +2219,35 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	if (txn->ninvalidations != 0)
-		elog(ERROR, "only ever add one set of invalidations");
+	/*
+	 * We collect all the invalidations under the top transaction so that we
+	 * can execute them all together.
+	 */
+	if (txn->toptxn)
+		txn = txn->toptxn;
 
 	Assert(nmsgs > 0);
 
-	txn->ninvalidations = nmsgs;
-	txn->invalidations = (SharedInvalidationMessage *)
-		MemoryContextAlloc(rb->context,
-						   sizeof(SharedInvalidationMessage) * nmsgs);
-	memcpy(txn->invalidations, msgs,
-		   sizeof(SharedInvalidationMessage) * nmsgs);
+	/* Accumulate invalidations. */
+	if (txn->ninvalidations == 0)
+	{
+		txn->ninvalidations = nmsgs;
+		txn->invalidations = (SharedInvalidationMessage *)
+			MemoryContextAlloc(rb->context,
+							   sizeof(SharedInvalidationMessage) * nmsgs);
+		memcpy(txn->invalidations, msgs,
+			   sizeof(SharedInvalidationMessage) * nmsgs);
+	}
+	else
+	{
+		txn->invalidations = (SharedInvalidationMessage *)
+			repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
+					 (txn->ninvalidations + nmsgs));
+
+		memcpy(txn->invalidations + txn->ninvalidations, msgs,
+			   nmsgs * sizeof(SharedInvalidationMessage));
+		txn->ninvalidations += nmsgs;
+	}
 }
 
 /*
@@ -2250,6 +2275,15 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * Mark top-level transaction as having catalog changes too if one of its
+	 * children has so that the ReorderBufferBuildTupleCidHash can
+	 * conveniently check just top-level transaction and decide whether to
+	 * build the hash table or not.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..edd9077 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,9 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transactions.  See
+ *      CommandEndInvalidationMessages.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +107,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -1094,6 +1098,11 @@ CommandEndInvalidationMessages(void)
 
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
+
+	/* WAL Log per-command invalidation messages for wal_level=logical */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
 							   &transInvalInfo->CurrentCmdInvalidMsgs);
 }
@@ -1501,3 +1510,49 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * LogLogicalInvalidations
+ *
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end or at commit time if any invalidations
+ * are pending.
+ */
+void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage *invalMessages;
+	int			nmsgs = 0;
+
+	/* Quick exit if we haven't done anything with invalidation messages. */
+	if (transInvalInfo == NULL)
+		return;
+
+	ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+									 MakeSharedInvalidMessagesArray);
+
+	Assert(!(numSharedInvalidMessagesArray > 0 &&
+			 SharedInvalidMessagesArray == NULL));
+
+	invalMessages = SharedInvalidMessagesArray;
+	nmsgs = numSharedInvalidMessagesArray;
+	SharedInvalidMessagesArray = NULL;
+	numSharedInvalidMessagesArray = 0;
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						 nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index aef8555..ac3f5e3 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,17 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 019bd38..1055e99 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -220,6 +220,9 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	end_lsn;
 
+	/* Toplevel transaction for this subxact (NULL for top-level). */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN of the last lsn at which snapshot information reside, so we can
 	 * restart decoding from there and fully recover this transaction from
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index bc5081c..463888c 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -61,4 +61,6 @@ extern void CacheRegisterRelcacheCallback(RelcacheCallbackFunction func,
 extern void CallSyscacheCallbacks(int cacheid, uint32 hashvalue);
 
 extern void InvalidateSystemCaches(void);
+
+extern void LogLogicalInvalidations(void);
 #endif							/* INVAL_H */
-- 
1.8.3.1

v35-0003-Extend-the-logical-decoding-output-plugin-API-wi.patch000664 000765 000024 00000110354 13704025031 025417 0ustar00amitkapilastaff000000 000000 From 01fe6aaa0ebf7dbfef8d3830e4bbac16c7534149 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:01:24 +0530
Subject: [PATCH v35 3/9] Extend the logical decoding output plugin API with
 stream methods.

This adds seven methods to the output plugin API, adding support for
streaming changes for large transactions.

* stream_start
* stream_stop
* stream_abort
* stream_commit
* stream_change
* stream_message
* stream_truncate

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate a chunk of changes
streamed for a particular toplevel transaction.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/test_decoding.c     | 123 +++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 218 +++++++++++++++++++
 src/backend/replication/logical/logical.c | 351 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  59 +++++
 6 files changed, 825 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..4a71826 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
 }
 
 
@@ -540,3 +569,97 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the changes as the transaction can abort
+ * at a later point in time.  We don't want users to see the changes until the
+ * transaction is committed.
+ */
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the contents for transactional messages
+ * as the transaction can abort at a later point in time.  We don't want users to
+ * see the message contents until the transaction is committed.
+ */
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (transactional)
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu",
+						 transactional, prefix, sz);
+	}
+	else
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+						 transactional, prefix, sz);
+		appendBinaryStringInfo(ctx->out, message, sz);
+	}
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the detailed information of Truncate.
+ * See pg_decode_stream_change.
+ */
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index c89f93c..18116c8 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,6 +389,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -401,6 +408,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -679,6 +695,117 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+    
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The actual changes are not displayed as the transaction can abort at a later
+      point in time and we don't decode changes for aborted transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The message contents for transactional messages are not displayed as the transaction
+      can abort at a later point in time and we don't decode changes for aborted
+      transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -747,4 +874,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and two optional callbacks
+    (<function>stream_message_cb</function>) and (<function>stream_truncate_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be..6ee59bd 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr message_lsn, bool transactional,
+									  const char *prefix, Size message_size, const char *message);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require start/stop/abort/commit/change
+	 * callbacks. The message and truncate callback are optional, similar to
+	 * regular output plugins. We however enable streaming when at least one
+	 * of the methods is enabled, so that we can easily identify missing
+	 * methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional, so we do not
+	 * fail with ERROR when missing, but the wrappers simply do nothing. We
+	 * must set the ReorderBuffer callbacks to something, otherwise the calls
+	 * from there will crash (we don't want to move the checks there).
+	 */
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,307 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_start_cb callback")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_stop_cb callback")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_abort_cb callback")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_commit_cb callback")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_change_cb callback")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475..deef318 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..2d9aa11 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.commit_lsn
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1055e99..42bc817 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -348,6 +348,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											   ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
 struct ReorderBuffer
 {
 	/*
@@ -387,6 +435,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamTruncateCB stream_truncate;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v35-0004-Implement-streaming-mode-in-ReorderBuffer.patch000664 000765 000024 00000225463 13704025031 024364 0ustar00amitkapilastaff000000 000000 From 77ea9418add32dd857390ba94bf6414a2f389b03 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 15 Jul 2020 18:38:32 +0530
Subject: [PATCH v35 4/9] Implement streaming mode in ReorderBuffer.

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke new stream API methods. However, sometimes if we have
incomplete toast or speculative insert we spill to the disk because we can
not generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

Now that we can stream in-progress transactions, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

We also provide a new option via SQL APIs to fetch the changes being
streamed.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/Makefile                  |   2 +-
 contrib/test_decoding/expected/stream.out       |  40 +
 contrib/test_decoding/expected/truncate.out     |   6 +
 contrib/test_decoding/sql/stream.sql            |  21 +
 contrib/test_decoding/sql/truncate.sql          |   1 +
 contrib/test_decoding/test_decoding.c           |  13 +
 doc/src/sgml/logicaldecoding.sgml               |   9 +-
 doc/src/sgml/test-decoding.sgml                 |  22 +
 src/backend/access/heap/heapam.c                |  13 +
 src/backend/access/heap/heapam_visibility.c     |  42 +-
 src/backend/access/index/genam.c                |  53 ++
 src/backend/access/table/tableam.c              |   8 +
 src/backend/access/transam/xact.c               |  19 +
 src/backend/replication/logical/decode.c        |  17 +-
 src/backend/replication/logical/logical.c       |  10 +
 src/backend/replication/logical/reorderbuffer.c | 969 +++++++++++++++++++++---
 src/include/access/heapam_xlog.h                |   1 +
 src/include/access/tableam.h                    |  55 ++
 src/include/access/xact.h                       |   4 +
 src/include/replication/logical.h               |   1 +
 src/include/replication/reorderbuffer.h         |  56 +-
 21 files changed, 1256 insertions(+), 106 deletions(-)
 create mode 100644 contrib/test_decoding/expected/stream.out
 create mode 100644 contrib/test_decoding/sql/stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c58..ed9a3d6 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,7 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate
+	spill slot truncate stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
new file mode 100644
index 0000000..bdf7002
--- /dev/null
+++ b/contrib/test_decoding/expected/stream.out
@@ -0,0 +1,40 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+ count 
+-------
+   157
+(1 row)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/truncate.out b/contrib/test_decoding/expected/truncate.out
index 1cf2ae8..e64d377 100644
--- a/contrib/test_decoding/expected/truncate.out
+++ b/contrib/test_decoding/expected/truncate.out
@@ -25,3 +25,9 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
  COMMIT
 (9 rows)
 
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
new file mode 100644
index 0000000..24d41b1
--- /dev/null
+++ b/contrib/test_decoding/sql/stream.sql
@@ -0,0 +1,21 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/truncate.sql b/contrib/test_decoding/sql/truncate.sql
index 5aecdf0..5633854 100644
--- a/contrib/test_decoding/sql/truncate.sql
+++ b/contrib/test_decoding/sql/truncate.sql
@@ -11,3 +11,4 @@ TRUNCATE tab1, tab1 RESTART IDENTITY CASCADE;
 TRUNCATE tab1, tab2;
 
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 4a71826..bb3d9f3 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -122,6 +122,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	ListCell   *option;
 	TestDecodingData *data;
+	bool enable_streaming = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -212,6 +213,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "stream-changes") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_streaming))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -221,6 +232,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 							elem->arg ? strVal(elem->arg) : "(null)")));
 		}
 	}
+
+	ctx->streaming &= enable_streaming;
 }
 
 /* cleanup this plugin's resources */
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 18116c8..98b47b0 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d..fe7c978 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes', '1');
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d881f4c..33a4580 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments in xact.c where
+	 * these variables are declared.  Normally we have such a check at tableam
+	 * level API but this is called from many places so we need to ensure it
+	 * here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
@@ -1945,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..c771280 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the tuple
+		 * yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1659,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..9d9a70a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,10 +430,37 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments in xact.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments in xact.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments in xact.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 4b2bb29..a61e279 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -235,6 +235,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
 	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
+	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d4f7c29..99722ee 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -83,6 +83,19 @@ bool		XactDeferrable;
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
 /*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool		bsysscan = false;
+
+/*
  * When running as a parallel worker, we place only a single
  * TransactionStateData on the parallel worker's state stack, and the XID
  * reflected there will be that of the *innermost* currently-active
@@ -2680,6 +2693,9 @@ AbortTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* If in parallel mode, clean up workers and exit parallel mode. */
 	if (IsInParallelMode())
 	{
@@ -4982,6 +4998,9 @@ AbortSubTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* Exit from parallel mode, if necessary. */
 	if (IsInParallelMode())
 	{
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7153eba..2010d5a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6ee59bd..8deff89 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1442,3 +1442,13 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Clear logical streaming state during (sub)transaction abort.
+ */
+void
+ResetLogicalStreamingState(void)
+{
+	CheckXidAlive = InvalidTransactionId;
+	bsysscan = false;
+}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1661190..27b4617 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -178,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -236,6 +252,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +261,16 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static inline bool ReorderBufferCanStartStreaming(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -367,6 +394,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -416,13 +446,15 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 }
 
 /*
- * Free an ReorderBufferChange.
+ * Free a ReorderBufferChange and update memory accounting, if requested.
  */
 void
-ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
+						  bool upd_mem)
 {
 	/* update memory accounting info */
-	ReorderBufferChangeMemoryUpdate(rb, change, false);
+	if (upd_mem)
+		ReorderBufferChangeMemoryUpdate(rb, change, false);
 
 	/* free contained data */
 	switch (change->action)
@@ -624,16 +656,101 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 }
 
 /*
- * Queue a change into a transaction so it can be replayed upon commit.
+ * Record the partial change for the streaming of in-progress transactions.  We
+ * can stream only complete changes so if we have a partial change like toast
+ * table insert or speculative then we mark such a 'txn' so that it can't be
+ * streamed.  We also ensure that if the changes in such a 'txn' are above
+ * logical_decoding_work_mem threshold then we stream them as soon as we have a
+ * complete change.
+ */
+static void
+ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								  ReorderBufferChange *change,
+								  bool toast_insert)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The partial changes need to be processed only while streaming
+	 * in-progress transactions.
+	 */
+	if (!ReorderBufferCanStream(rb))
+		return;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * Set the toast insert bit whenever we get toast insert to indicate a partial
+	 * change and clear it when we get the insert or update on main table (Both
+	 * update and insert will do the insert in the toast table).
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			 IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial change and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+	{
+		/*
+		 * Speculative confirm change must be preceded by speculative insertion.
+		 */
+		Assert(rbtxn_has_spec_insert(toptxn));
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+	}
+
+	/*
+	 * Stream the transaction if it is serialized before and the changes are
+	 * now complete in the top-level transaction.
+	 *
+	 * The reason for doing the streaming of such a transaction as soon as
+	 * we get the complete change for it is that previously it would have
+	 * reached the memory threshold and wouldn't get streamed because of
+	 * incomplete changes.  Delaying such transactions would increase apply
+	 * lag for them.
+	 */
+	if (ReorderBufferCanStartStreaming(rb) &&
+		!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		rbtxn_is_serialized(txn))
+		ReorderBufferStreamTXN(rb, toptxn);
+}
+
+/*
+ * Queue a change into a transaction so it can be replayed upon commit or will be
+ * streamed when we reach logical_decoding_work_mem threshold.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the previous changes we have detected that the transaction
+	 * is aborted.  So there is no point in collecting further changes for it.
+	 */
+	if (txn->concurrent_abort)
+	{
+		/*
+		 * We don't need to update memory accounting for this change as we have
+		 * not added it to the queue yet.
+		 */
+		ReorderBufferReturnChange(rb, change, false);
+		return;
+	}
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -645,6 +762,9 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* process partial change */
+	ReorderBufferProcessPartialChange(rb, txn, change, toast_insert);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -674,7 +794,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -764,6 +884,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the (sub)transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -1018,6 +1170,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1032,6 +1187,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1148,7 +1306,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 	{
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1234,7 +1392,7 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1280,7 +1438,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1297,7 +1455,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1310,6 +1468,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1335,6 +1502,91 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change, true);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
+	 * streamed always, even if it does not contain any changes (that is, when
+	 * all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the toplevel xact (we send the XID in all messages), but we never
+	 * stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
+	 * memory. We could also keep the hash table and update it with new ctid
+	 * values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
  */
@@ -1485,57 +1737,178 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the
+	 * xid aborted.  That will happen during catalog access.
+	 */
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						  ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction. This resets the TXN such that it
+ * can be used to stream the remaining data of transaction being processed.
+ */
+static void
+ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+					  Snapshot snapshot_now,
+					  CommandId command_id,
+					  XLogRecPtr last_lsn,
+					  ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Free all resources allocated for toast reconstruction */
+	ReorderBufferToastReset(rb, txn);
+
+	/* Return the spec insert change if it is not NULL */
+	if (specinsert != NULL)
+	{
+		ReorderBufferReturnChange(rb, specinsert, true);
+		specinsert = NULL;
 	}
 
-	snapshot_now = txn->base_snapshot;
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. We iterate over the top and subtransactions (using a k-way
+ * merge) and replay the changes in lsn order.
+ *
+ * If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool stream_started = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1558,14 +1931,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming ? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* We only need to send begin/commit for non-streamed transactions. */
+		if (!streaming)
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1573,6 +1947,36 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * We can't call start stream callback before processing first
+			 * change.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					txn->origin_id = change->origin_id;
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1649,7 +2053,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1685,11 +2090,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1714,7 +2119,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					/* clear out a pending (and thus failed) speculation */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
@@ -1747,7 +2152,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1756,10 +2164,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1790,7 +2195,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1837,7 +2241,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		 */
 		if (specinsert)
 		{
-			ReorderBufferReturnChange(rb, specinsert);
+			ReorderBufferReturnChange(rb, specinsert, true);
 			specinsert = NULL;
 		}
 
@@ -1845,14 +2249,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, send the last message for this set of
+		 * changes depending upon streaming mode.
+		 */
+		if (streaming)
+		{
+			if (stream_started)
+			{
+				rb->stream_stop(rb, txn, prev_lsn);
+				stream_started = false;
+			}
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot for the next set of changes in
+		 * streaming mode.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1870,14 +2295,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as
+		 * streamed (if they contained changes). Otherwise, remove all the
+		 * changes and deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1896,15 +2334,106 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
+		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * This error can only occur when we are sending the data in
+			 * streaming mode and the streaming in not finished yet.
+			 */
+			Assert(streaming);
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+			curtxn->concurrent_abort = true;
+
+			/* Reset the TXN so that it is allowed to stream remaining data. */
+			ReorderBufferResetTXN(rb, txn, snapshot_now,
+								  command_id, prev_lsn,
+								  specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+	}
+	PG_END_TRY();
+}
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot	snapshot_now;
+	CommandId	command_id = FirstCommandId;
 
-		PG_RE_THROW();
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
 	}
-	PG_END_TRY();
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
 }
 
 /*
@@ -1931,6 +2460,22 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_abort(rb, txn, lsn);
+
+		/*
+		 * We might have decoded changes for this transaction that could load
+		 * the cache as per the current transaction's view (consider DDL's
+		 * happened in this transaction). We don't want the decoding of future
+		 * transactions to use those cache entries so execute invalidations.
+		 */
+		if (txn->ninvalidations > 0)
+			ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+											   txn->invalidations);
+	}
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2000,6 +2545,10 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2082,7 +2631,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2131,12 +2680,21 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2144,6 +2702,8 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2155,19 +2715,41 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* If streaming supported, update the total size in top level as well. */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2196,6 +2778,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2388,6 +2971,51 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ *
+ * Note that, we skip transactions that contains incomplete changes. There
+ * is a scope of optimization here such that we can select the largest transaction
+ * which has complete changes.  But that will make the code and design quite complex
+ * and that might not be worth the benefit.  If we plan to stream the transactions
+ * that contains incomplete changes then we need to find a way to partially
+ * stream/truncate the transaction changes in-memory and build a mechanism to
+ * partially truncate the spilled files.  Additionally, whenever we partially
+ * stream the transaction we need to maintain the last streamed lsn and next time
+ * we need to restore from that segment and the offset in WAL.  As we stream the
+ * changes from the top transaction and restore them subtransaction wise, we need
+ * to even remember the subxact from where we streamed the last change.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	Size largest_size = 0;
+	ReorderBufferTXN *largest = NULL;
+
+	/* Find the largest top-level transaction. */
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		if ((largest != NULL || txn->total_size > largest_size) &&
+			(txn->total_size > 0) && !(rbtxn_has_incomplete_tuple(txn)))
+		{
+			largest = txn;
+			largest_size = txn->total_size;
+		}
+	}
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
  * disk until we reach under the memory limit.
@@ -2419,11 +3047,33 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if possible.  Otherwise, spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStartStreaming(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it
+			 * from memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2501,7 +3151,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 
 		spilled++;
 	}
@@ -2713,6 +3363,136 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+/* Returns true, if the output plugin supports streaming, false, otherwise. */
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/* Returns true, if the streaming can be started now, false, otherwise. */
+static inline bool
+ReorderBufferCanStartStreaming(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild *builder = ctx->snapshot_builder;
+
+	/*
+	 * We can't start streaming immediately even if the streaming is enabled
+	 * because we previously decoded this transaction and now just are
+	 * restarting.
+	 */
+	if (ReorderBufferCanStream(rb) &&
+		!SnapBuildXactNeedsSkip(builder, ctx->reader->EndRecPtr))
+	{
+		/* We must have a consistent snapshot by this time */
+		Assert(SnapBuildCurrentState(builder) == SNAPBUILD_CONSISTENT);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * We can't make any assumptions about base snapshot here, similar to what
+	 * ReorderBufferCommit() does. That relies on base_snapshot getting
+	 * transferred from subxact in ReorderBufferCommitChild(), but that was
+	 * not yet called as the transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 *
+	 * Unlike DecodeCommit which adds xids of all the subtransactions in
+	 * snapshot's xip array via SnapBuildCommittedTxn, we can't do that here
+	 * but we do add them to subxip array instead via ReorderBufferCopySnap.
+	 * This allows the catalog changes made in subtransactions decoded till
+	 * now to be visible.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		/*
+		 * If this transaction has no snapshot, it didn't make any changes to
+		 * the database till now, so there's nothing to decode.
+		 */
+		if (txn->base_snapshot == NULL)
+		{
+			Assert(txn->ninvalidations == 0);
+			return;
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -2813,7 +3593,7 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		dlist_container(ReorderBufferChange, node, cleanup_iter.cur);
 
 		dlist_delete(&cleanup->node);
-		ReorderBufferReturnChange(rb, cleanup);
+		ReorderBufferReturnChange(rb, cleanup, true);
 	}
 	txn->nentries_mem = 0;
 	Assert(dlist_is_empty(&txn->changes));
@@ -3522,7 +4302,7 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			dlist_container(ReorderBufferChange, node, it.cur);
 
 			dlist_delete(&change->node);
-			ReorderBufferReturnChange(rb, change);
+			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
 
@@ -3812,6 +4592,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases, we assume the CID is from the future
+	 * command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 0d28f01..b188427 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -19,6 +19,7 @@
 
 #include "access/relscan.h"
 #include "access/sdir.h"
+#include "access/xact.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1017,6 +1027,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1056,6 +1073,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with
+	 * valid CheckXidAlive for catalog or regular tables.  See detailed
+	 * comments in xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1713,6 +1738,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1730,6 +1763,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1748,6 +1789,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1764,6 +1812,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index ac3f5e3..5f767eb 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,10 @@ typedef enum
 /* Synchronous commit level */
 extern int	synchronous_commit;
 
+/* used during logical streaming of a transaction */
+extern TransactionId CheckXidAlive;
+extern bool bsysscan;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index deef318..b0fae98 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -121,5 +121,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+extern void ResetLogicalStreamingState(void);
 
 #endif
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 42bc817..1ae17d5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -162,6 +162,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -181,6 +184,40 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -249,6 +286,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
@@ -313,6 +357,12 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -484,12 +534,14 @@ void		ReorderBufferFree(ReorderBuffer *);
 ReorderBufferTupleBuf *ReorderBufferGetTupleBuf(ReorderBuffer *, Size tuple_len);
 void		ReorderBufferReturnTupleBuf(ReorderBuffer *, ReorderBufferTupleBuf *tuple);
 ReorderBufferChange *ReorderBufferGetChange(ReorderBuffer *);
-void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
+void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *, bool);
 
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool toast_insert);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v35-0005-Extend-the-BufFile-interface-for-the-streaming-o.patch000664 000765 000024 00000037512 13704025031 025422 0ustar00amitkapilastaff000000 000000 From fca4404114225a03872a5719c79b5cc385404ffc Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v35 5/9] Extend the BufFile interface for the streaming of
 in-progress transactions.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times.

Implement the interface for BufFileTruncate interface to allow files to be
truncated up to a particular offset.  Extend BufFileSeek API to support
SEEK_END case.  Add an option to provide a mode while opening the shared
BufFiles instead of always opening in read-only mode.
---
 src/backend/postmaster/pgstat.c           |  3 +
 src/backend/storage/file/buffile.c        | 86 +++++++++++++++++++++++----
 src/backend/storage/file/fd.c             |  9 ++-
 src/backend/storage/file/sharedfileset.c  | 98 ++++++++++++++++++++++++++++---
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  4 +-
 10 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 88992c2..6c97f68 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 3907349..c08ff4f 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile supports temporary files that can be used by the single backend when
+ * the corresponding files need to be survived across the transaction and need
+ * to be opened and closed multiple times.  Such files need to be created as a
+ * member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -364,6 +368,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -666,11 +673,22 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
+			break;
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +856,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420e..f376a97 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594..9a3dc10 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can also be used by backends when the temporary files need
+ * to be opened/closed multiple times and the underlying files need to survive
+ * across transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,19 +29,29 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -223,6 +255,58 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 }
 
 /*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
+/*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
  */
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59..788815c 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a43..b83fb50 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201..807a9c1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752ba..fc34c49 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d..e209f04 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf07..d5edb60 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
1.8.3.1

v35-0006-Add-support-for-streaming-to-built-in-replicatio.patch000664 000765 000024 00000263316 13704025031 025633 0ustar00amitkapilastaff000000 000000 From 94169c7627db5b32f049ac64df5d6940a0c78f8d Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 08:57:16 +0530
Subject: [PATCH v35 6/9] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |   4 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  45 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   3 +
 src/backend/replication/logical/proto.c            | 140 ++-
 src/backend/replication/logical/worker.c           | 946 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 345 +++++++-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  42 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/009_stream_simple.pl       |  86 ++
 src/test/subscription/t/010_stream_subxact.pl      | 102 +++
 src/test/subscription/t/011_stream_ddl.pl          |  95 +++
 .../subscription/t/012_stream_subxact_abort.pl     |  82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |  84 ++
 18 files changed, 1963 insertions(+), 45 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index c24ace1..d8de56c 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -163,8 +163,8 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
      <para>
       This clause alters parameters originally set by
       <xref linkend="sql-createsubscription"/>.  See there for more
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal> and <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 5bbc165..c25b7c5 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb15731..f28482f 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -65,6 +65,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9ebb026..9065a1b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 						   bool *enabled, bool *create_slot,
 						   bool *slot_name_given, char **slot_name,
 						   bool *copy_data, char **synchronous_commit,
-						   bool *refresh)
+						   bool *refresh, bool *streaming,
+						   bool *streaming_given)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -90,6 +91,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 		*synchronous_commit = NULL;
 	if (refresh)
 		*refresh = true;
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -175,6 +178,16 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given,
 			refresh_given = true;
 			*refresh = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -318,6 +331,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -334,7 +349,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	parse_subscription_options(stmt->options, &connect, &enabled_given,
 							   &enabled, &create_slot, &slotname_given,
 							   &slotname, &copy_data, &synchronous_commit,
-							   NULL);
+							   NULL, &streaming, &streaming_given);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -412,6 +427,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -669,10 +691,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *slotname;
 				bool		slotname_given;
 				char	   *synchronous_commit;
+				bool		streaming;
+				bool		streaming_given;
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, &slotname_given, &slotname,
-										   NULL, &synchronous_commit, NULL);
+										   NULL, &synchronous_commit, NULL,
+										   &streaming, &streaming_given);
 
 				if (slotname_given)
 				{
@@ -697,6 +722,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subsynccommit - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -708,7 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL,
 										   &enabled_given, &enabled, NULL,
-										   NULL, NULL, NULL, NULL, NULL);
+										   NULL, NULL, NULL, NULL, NULL,
+										   NULL, NULL);
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -746,7 +779,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, &refresh);
+										   NULL, &refresh, NULL, NULL);
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
@@ -783,7 +816,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 
 				parse_subscription_options(stmt->options, NULL, NULL, NULL,
 										   NULL, NULL, NULL, &copy_data,
-										   NULL, NULL);
+										   NULL, NULL, NULL, NULL);
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 6c97f68..479e3ca 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e4fd1f9..5257ab0 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 3c6d0cd..83d0642 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -139,10 +139,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple)
+logicalrep_write_insert(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -178,8 +183,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -187,6 +192,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -248,7 +257,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+						Relation rel, HeapTuple oldtuple)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -256,6 +266,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -296,6 +310,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -305,6 +320,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -347,12 +366,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -397,7 +420,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -405,6 +428,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -685,3 +712,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f90a896..f0c3278 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,62 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +291,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -542,17 +682,323 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or
+	 * inside streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		!in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -565,6 +1011,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -580,6 +1029,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -616,6 +1068,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -731,6 +1186,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -873,6 +1331,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1243,6 +1704,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1381,6 +1845,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1493,6 +1973,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1597,7 +2085,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1909,6 +2397,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1941,6 +2443,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2106,6 +3041,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.slotname = myslotname;
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 15379e3..8785d87 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -46,29 +46,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -94,11 +122,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -114,15 +148,24 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names)
+						List **publication_names, bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
+	bool		streaming_given = false;
 
 	foreach(lc, options)
 	{
@@ -168,6 +211,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 						(errcode(ERRCODE_INVALID_NAME),
 						 errmsg("invalid publication_names syntax")));
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -180,6 +240,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -202,7 +263,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Parse the params and ERROR if we see any we don't recognize */
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
-								&data->publication_names);
+								&data->publication_names,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -232,6 +315,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -290,9 +378,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -308,19 +428,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -344,17 +470,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -363,6 +491,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -391,7 +523,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -411,7 +543,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple);
+				logicalrep_write_insert(ctx->out, xid, relation, tuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -435,7 +567,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple, newtuple);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -455,7 +587,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple);
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple);
 				OutputPluginWrite(ctx, true);
 			}
 			else
@@ -480,6 +612,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -508,13 +644,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -588,6 +725,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -624,6 +868,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -753,12 +1029,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -793,7 +1102,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0a756d4..617b909 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -48,6 +48,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -73,6 +75,7 @@ typedef struct Subscription
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
 	bool		enabled;		/* Indicates if the subscription is enabled */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 4860561..89158ed 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -86,25 +90,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c75dceb..56517a9 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -177,6 +177,7 @@ typedef struct
 		{
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v35-0007-Enable-streaming-for-all-subscription-TAP-tests.patch000664 000765 000024 00000035342 13704025031 025374 0ustar00amitkapilastaff000000 000000 From 766eff8e9337cbde25f22df77fb9cc5b749a1a49 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v35 7/9] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318f..6f7bedc 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f8..4ba8086 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334e..cdf9b8e 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v35-0008-Add-TAP-test-for-streaming-vs.-DDL.patch000664 000765 000024 00000010621 13704025031 022322 0ustar00amitkapilastaff000000 000000 From 707e18f5c058fa62f0edef3d79e616ede1d2ef90 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v35 8/9] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v35-0009-Add-streaming-option-in-pg_dump.patch000664 000765 000024 00000005322 13704025031 022400 0ustar00amitkapilastaff000000 000000 From 8862d13a923526f5d115b7052131f270a42b2fbc Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 27 Apr 2020 15:36:39 +0530
Subject: [PATCH v35 9/9] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 9 +++++++--
 src/bin/pg_dump/pg_dump.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 857c7c2..154721b 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4235,8 +4236,8 @@ getSubscriptions(Archive *fout)
 	appendPQExpBuffer(query,
 					  "SELECT s.tableoid, s.oid, s.subname,"
 					  "(%s s.subowner) AS rolname, "
-					  " s.subconninfo, s.subslotname, s.subsynccommit, "
-					  " s.subpublications "
+					  " s.substream, s.subconninfo, s.subslotname, "
+					  " s.subsynccommit, s.subpublications "
 					  "FROM pg_subscription s "
 					  "WHERE s.subdbid = (SELECT oid FROM pg_database"
 					  "                   WHERE datname = current_database())",
@@ -4249,6 +4250,7 @@ getSubscriptions(Archive *fout)
 	i_oid = PQfnumber(res, "oid");
 	i_subname = PQfnumber(res, "subname");
 	i_rolname = PQfnumber(res, "rolname");
+	i_substream = PQfnumber(res, "substream");
 	i_subconninfo = PQfnumber(res, "subconninfo");
 	i_subslotname = PQfnumber(res, "subslotname");
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
@@ -4265,6 +4267,7 @@ getSubscriptions(Archive *fout)
 		AssignDumpId(&subinfo[i].dobj);
 		subinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_subname));
 		subinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname));
+		subinfo[i].substream = pg_strdup(PQgetvalue(res, i, i_substream));
 		subinfo[i].subconninfo = pg_strdup(PQgetvalue(res, i, i_subconninfo));
 		if (PQgetisnull(res, i, i_subslotname))
 			subinfo[i].subslotname = NULL;
@@ -4342,6 +4345,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	else
 		appendPQExpBufferStr(query, "NONE");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0c2fcfb..af64270 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -623,6 +623,7 @@ typedef struct _SubscriptionInfo
 {
 	DumpableObject dobj;
 	char	   *rolname;
+	char       *substream;
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subsynccommit;
-- 
1.8.3.1

#442Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#441)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jul 16, 2020 at 4:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 16, 2020 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Let me know what you think of the changes?

I have reviewed the changes and looks fine to me.

Thanks, I am planning to start committing a few of the infrastructure
patches (especially first two) by early next week as we have resolved
all the open issues and done an extensive review of the entire
patch-set. In the attached version, there is a slight change in one
of the commit messages as compared to the previous version. I would
like to describe in brief the first two patches for the sake of
convenience. Let me know if you or anyone else sees any problems with
these.

The first patch in the series allows us to WAL-log subtransaction and
top-level XID association. The logical decoding infrastructure needs
to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding. So we also write the assignment info into WAL
immediately, as part of the next WAL record (to minimize overhead)
only when *wal_level=logical*. We can not remove the existing
XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in
the hot standby snapshot.

The second patch writes WAL for invalidations at command end with
wal_level=logical. When wal_level=logical, write invalidations at
command end into WAL so that decoding can use this information. This
patch is required to allow the streaming of in-progress transactions
in logical decoding. We still add the invalidations to the cache and
write them to WAL at commit time in RecordTransactionCommit(). This
uses the existing XLOG_INVALIDATIONS xlog record type, from the
RM_STANDBY_ID resource manager (see LogStandbyInvalidations for
details). So existing code relying on those invalidations (e.g. redo)
does not need to be changed. The invalidations written at command end
uses a new xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID
resource manager. See LogLogicalInvalidations for details. These new
xlog records are ignored by existing redo procedures, which still rely
on the invalidations written to commit records. The invalidations are
decoded and accumulated in top-transaction, and then executed during
replay. This obviates the need to decode the invalidations as part of
a commit record.

The performance testing has shown that there is no performance penalty
with either of the patches but there is some additional WAL which in
most cases is 2-5% but in worst cases and for some specific DDL's it
is up to 15% with the second patch, however, that happens at
wal_level=logical only. We have considered an alternative to blow up
all caches on any DDL in WALSenders and that will have both CPU and
network overhead. For detailed results and analysis see [1][2].

[1] - /messages/by-id/CAKYtNAqWkPpPFrdEbpPrCan3G_QAcankZarRKKd7cj6vQigM7w@mail.gmail.com
[2] - /messages/by-id/CAA4eK1L3PoiBw6uogB7jD5rmdT-GmEF4kOEccS1AWKuBhSkQkQ@mail.gmail.com

The patch set required to rebase after committing the binary format
option support in the create subscription command. I have rebased the
patch set on the latest head and also added a test case to test
streaming in binary format.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v36.tarapplication/x-tar; name=v36.tarDownload
v36/0000775000175000017500000000000013705234373012571 5ustar  dilipkumardilipkumarv36/v36-0001-WAL-Log-invalidations-at-command-end-with-wal_le.patch0000664000175000017500000003521713705234373025614 0ustar  dilipkumardilipkumarFrom 2df4ba84a9bb92c4fa0cbd92593cab9d48a52762 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 6 Jun 2020 09:54:21 +0530
Subject: [PATCH v36 1/8] WAL Log invalidations at command end with
 wal_level=logical.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When wal_level=logical, write invalidations at command end into WAL so
that decoding can use this information.

This patch is required to allow the streaming of in-progress transactions
in logical decoding.  The actual work to allow streaming will be committed
as a separate patch.

We still add the invalidations to the cache and write them to WAL at
commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not need
to be changed.

The invalidations written at command end uses a new xlog record type
XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See
LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures, which
still rely on the invalidations written to commit records.

The invalidations are decoded and accumulated in top-transaction, and then
executed during replay.  This obviates the need to decode the
invalidations as part of a commit record.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/rmgrdesc/xactdesc.c          | 10 +++++
 src/backend/access/transam/xact.c               | 17 ++++++++
 src/backend/replication/logical/decode.c        | 58 +++++++++++++++----------
 src/backend/replication/logical/reorderbuffer.c | 52 ++++++++++++++++++----
 src/backend/utils/cache/inval.c                 | 55 +++++++++++++++++++++++
 src/include/access/xact.h                       | 13 +++++-
 src/include/replication/reorderbuffer.h         |  3 ++
 src/include/utils/inval.h                       |  2 +
 8 files changed, 177 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce755..68aa994 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -396,6 +396,13 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		standby_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs, InvalidOid,
+								   InvalidOid, false);
+	}
 }
 
 const char *
@@ -423,6 +430,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index bd4c3cf..d4f7c29 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1224,6 +1224,16 @@ RecordTransactionCommit(void)
 	bool		RelcacheInitFileInval = false;
 	bool		wrote_xlog;
 
+	/*
+	 * Log pending invalidations for logical decoding of in-progress
+	 * transactions.  Normally for DDLs, we log this at each command end,
+	 * however, for certain cases where we directly update the system table
+	 * without a transaction block, the invalidations are not logged till this
+	 * time.
+	 */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	/* Get data needed for commit record */
 	nrels = smgrGetPendingDeletes(true, &rels);
 	nchildren = xactGetCommittedChildren(&children);
@@ -6022,6 +6032,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0c0c371..7153eba 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -278,10 +278,39 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 			/*
 			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here.
-			 * See LogicalDecodingProcessRecord.
+			 * record if required.  So, we don't need to do anything here. See
+			 * LogicalDecodingProcessRecord.
 			 */
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/*
+				 * Execute the invalidations for xid-less transactions,
+				 * otherwise, accumulate them so that they can be processed at
+				 * the commit time.
+				 */
+				if (!ctx->fast_forward)
+				{
+					if (TransactionIdIsValid(xid))
+					{
+						ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+													  invals->nmsgs, invals->msgs);
+						ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+														  buf->origptr);
+					}
+					else
+						ReorderBufferImmediateInvalidation(ctx->reorder,
+														   invals->nmsgs,
+														   invals->msgs);
+				}
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
@@ -334,15 +363,11 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_STANDBY_LOCK:
 			break;
 		case XLOG_INVALIDATIONS:
-			{
-				xl_invalidations *invalidations =
-				(xl_invalidations *) XLogRecGetData(r);
 
-				if (!ctx->fast_forward)
-					ReorderBufferImmediateInvalidation(ctx->reorder,
-													   invalidations->nmsgs,
-													   invalidations->msgs);
-			}
+			/*
+			 * We are processing the invalidations at the command level via
+			 * XLOG_XACT_INVALIDATIONS.  So we don't need to do anything here.
+			 */
 			break;
 		default:
 			elog(ERROR, "unexpected RM_STANDBY_ID record type: %u", info);
@@ -573,19 +598,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
-	/*
-	 * Process invalidation messages, even if we're not interested in the
-	 * transaction's contents, since the various caches need to always be
-	 * consistent.
-	 */
-	if (parsed->nmsgs > 0)
-	{
-		if (!ctx->fast_forward)
-			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
-										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
-	}
-
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 449327a..ce6e621 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -856,6 +856,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to top-level transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -2201,7 +2204,11 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 /*
  * Setup the invalidation of the toplevel transaction.
  *
- * This needs to be done before ReorderBufferCommit is called!
+ * This needs to be called for each XLOG_XACT_INVALIDATIONS message and
+ * accumulates all the invalidation messages in the toplevel transaction.
+ * This is required because in some cases where we skip processing the
+ * transaction (see ReorderBufferForget), we need to execute all the
+ * invalidations together.
  */
 void
 ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
@@ -2212,17 +2219,35 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	if (txn->ninvalidations != 0)
-		elog(ERROR, "only ever add one set of invalidations");
+	/*
+	 * We collect all the invalidations under the top transaction so that we
+	 * can execute them all together.
+	 */
+	if (txn->toptxn)
+		txn = txn->toptxn;
 
 	Assert(nmsgs > 0);
 
-	txn->ninvalidations = nmsgs;
-	txn->invalidations = (SharedInvalidationMessage *)
-		MemoryContextAlloc(rb->context,
-						   sizeof(SharedInvalidationMessage) * nmsgs);
-	memcpy(txn->invalidations, msgs,
-		   sizeof(SharedInvalidationMessage) * nmsgs);
+	/* Accumulate invalidations. */
+	if (txn->ninvalidations == 0)
+	{
+		txn->ninvalidations = nmsgs;
+		txn->invalidations = (SharedInvalidationMessage *)
+			MemoryContextAlloc(rb->context,
+							   sizeof(SharedInvalidationMessage) * nmsgs);
+		memcpy(txn->invalidations, msgs,
+			   sizeof(SharedInvalidationMessage) * nmsgs);
+	}
+	else
+	{
+		txn->invalidations = (SharedInvalidationMessage *)
+			repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
+					 (txn->ninvalidations + nmsgs));
+
+		memcpy(txn->invalidations + txn->ninvalidations, msgs,
+			   nmsgs * sizeof(SharedInvalidationMessage));
+		txn->ninvalidations += nmsgs;
+	}
 }
 
 /*
@@ -2250,6 +2275,15 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * Mark top-level transaction as having catalog changes too if one of its
+	 * children has so that the ReorderBufferBuildTupleCidHash can
+	 * conveniently check just top-level transaction and decide whether to
+	 * build the hash table or not.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..edd9077 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,9 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transactions.  See
+ *      CommandEndInvalidationMessages.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +107,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -1094,6 +1098,11 @@ CommandEndInvalidationMessages(void)
 
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
+
+	/* WAL Log per-command invalidation messages for wal_level=logical */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
 							   &transInvalInfo->CurrentCmdInvalidMsgs);
 }
@@ -1501,3 +1510,49 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * LogLogicalInvalidations
+ *
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end or at commit time if any invalidations
+ * are pending.
+ */
+void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage *invalMessages;
+	int			nmsgs = 0;
+
+	/* Quick exit if we haven't done anything with invalidation messages. */
+	if (transInvalInfo == NULL)
+		return;
+
+	ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+									 MakeSharedInvalidMessagesArray);
+
+	Assert(!(numSharedInvalidMessagesArray > 0 &&
+			 SharedInvalidMessagesArray == NULL));
+
+	invalMessages = SharedInvalidMessagesArray;
+	nmsgs = numSharedInvalidMessagesArray;
+	SharedInvalidMessagesArray = NULL;
+	numSharedInvalidMessagesArray = 0;
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						 nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index aef8555..ac3f5e3 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,17 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 019bd38..1055e99 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -220,6 +220,9 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	end_lsn;
 
+	/* Toplevel transaction for this subxact (NULL for top-level). */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN of the last lsn at which snapshot information reside, so we can
 	 * restart decoding from there and fully recover this transaction from
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index bc5081c..463888c 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -61,4 +61,6 @@ extern void CacheRegisterRelcacheCallback(RelcacheCallbackFunction func,
 extern void CallSyscacheCallbacks(int cacheid, uint32 hashvalue);
 
 extern void InvalidateSystemCaches(void);
+
+extern void LogLogicalInvalidations(void);
 #endif							/* INVAL_H */
-- 
1.8.3.1

v36/v36-0002-Extend-the-logical-decoding-output-plugin-API-wi.patch0000664000175000017500000011035413705234373025656 0ustar  dilipkumardilipkumarFrom ef81a97a30e9285d204d2779995b3c51b2ff0c41 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:01:24 +0530
Subject: [PATCH v36 2/8] Extend the logical decoding output plugin API with
 stream methods.

This adds seven methods to the output plugin API, adding support for
streaming changes for large transactions.

* stream_start
* stream_stop
* stream_abort
* stream_commit
* stream_change
* stream_message
* stream_truncate

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate a chunk of changes
streamed for a particular toplevel transaction.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/test_decoding.c     | 123 +++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 218 +++++++++++++++++++
 src/backend/replication/logical/logical.c | 351 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  59 +++++
 6 files changed, 825 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..4a71826 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
 }
 
 
@@ -540,3 +569,97 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the changes as the transaction can abort
+ * at a later point in time.  We don't want users to see the changes until the
+ * transaction is committed.
+ */
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the contents for transactional messages
+ * as the transaction can abort at a later point in time.  We don't want users to
+ * see the message contents until the transaction is committed.
+ */
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (transactional)
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu",
+						 transactional, prefix, sz);
+	}
+	else
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+						 transactional, prefix, sz);
+		appendBinaryStringInfo(ctx->out, message, sz);
+	}
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the detailed information of Truncate.
+ * See pg_decode_stream_change.
+ */
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index c89f93c..18116c8 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,6 +389,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -401,6 +408,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -679,6 +695,117 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+    
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The actual changes are not displayed as the transaction can abort at a later
+      point in time and we don't decode changes for aborted transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The message contents for transactional messages are not displayed as the transaction
+      can abort at a later point in time and we don't decode changes for aborted
+      transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -747,4 +874,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and two optional callbacks
+    (<function>stream_message_cb</function>) and (<function>stream_truncate_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be..6ee59bd 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr message_lsn, bool transactional,
+									  const char *prefix, Size message_size, const char *message);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require start/stop/abort/commit/change
+	 * callbacks. The message and truncate callback are optional, similar to
+	 * regular output plugins. We however enable streaming when at least one
+	 * of the methods is enabled, so that we can easily identify missing
+	 * methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional, so we do not
+	 * fail with ERROR when missing, but the wrappers simply do nothing. We
+	 * must set the ReorderBuffer callbacks to something, otherwise the calls
+	 * from there will crash (we don't want to move the checks there).
+	 */
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,307 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_start_cb callback")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_stop_cb callback")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_abort_cb callback")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_commit_cb callback")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_change_cb callback")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475..deef318 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..2d9aa11 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.commit_lsn
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1055e99..42bc817 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -348,6 +348,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											   ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
 struct ReorderBuffer
 {
 	/*
@@ -387,6 +435,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamTruncateCB stream_truncate;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v36/v36-0003-Implement-streaming-mode-in-ReorderBuffer.patch0000664000175000017500000022546313705234373024623 0ustar  dilipkumardilipkumarFrom 6ae620db5ba3c5f42bff414c40d1749c6f5483e2 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 15 Jul 2020 18:38:32 +0530
Subject: [PATCH v36 3/8] Implement streaming mode in ReorderBuffer.

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke new stream API methods. However, sometimes if we have
incomplete toast or speculative insert we spill to the disk because we can
not generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

Now that we can stream in-progress transactions, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

We also provide a new option via SQL APIs to fetch the changes being
streamed.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/Makefile                  |   2 +-
 contrib/test_decoding/expected/stream.out       |  40 +
 contrib/test_decoding/expected/truncate.out     |   6 +
 contrib/test_decoding/sql/stream.sql            |  21 +
 contrib/test_decoding/sql/truncate.sql          |   1 +
 contrib/test_decoding/test_decoding.c           |  13 +
 doc/src/sgml/logicaldecoding.sgml               |   9 +-
 doc/src/sgml/test-decoding.sgml                 |  22 +
 src/backend/access/heap/heapam.c                |  13 +
 src/backend/access/heap/heapam_visibility.c     |  42 +-
 src/backend/access/index/genam.c                |  53 ++
 src/backend/access/table/tableam.c              |   8 +
 src/backend/access/transam/xact.c               |  19 +
 src/backend/replication/logical/decode.c        |  17 +-
 src/backend/replication/logical/logical.c       |  10 +
 src/backend/replication/logical/reorderbuffer.c | 969 +++++++++++++++++++++---
 src/include/access/heapam_xlog.h                |   1 +
 src/include/access/tableam.h                    |  55 ++
 src/include/access/xact.h                       |   4 +
 src/include/replication/logical.h               |   1 +
 src/include/replication/reorderbuffer.h         |  56 +-
 21 files changed, 1256 insertions(+), 106 deletions(-)
 create mode 100644 contrib/test_decoding/expected/stream.out
 create mode 100644 contrib/test_decoding/sql/stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c58..ed9a3d6 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,7 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate
+	spill slot truncate stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
new file mode 100644
index 0000000..bdf7002
--- /dev/null
+++ b/contrib/test_decoding/expected/stream.out
@@ -0,0 +1,40 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+ count 
+-------
+   157
+(1 row)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/truncate.out b/contrib/test_decoding/expected/truncate.out
index 1cf2ae8..e64d377 100644
--- a/contrib/test_decoding/expected/truncate.out
+++ b/contrib/test_decoding/expected/truncate.out
@@ -25,3 +25,9 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
  COMMIT
 (9 rows)
 
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
new file mode 100644
index 0000000..24d41b1
--- /dev/null
+++ b/contrib/test_decoding/sql/stream.sql
@@ -0,0 +1,21 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/truncate.sql b/contrib/test_decoding/sql/truncate.sql
index 5aecdf0..5633854 100644
--- a/contrib/test_decoding/sql/truncate.sql
+++ b/contrib/test_decoding/sql/truncate.sql
@@ -11,3 +11,4 @@ TRUNCATE tab1, tab1 RESTART IDENTITY CASCADE;
 TRUNCATE tab1, tab2;
 
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 4a71826..bb3d9f3 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -122,6 +122,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	ListCell   *option;
 	TestDecodingData *data;
+	bool enable_streaming = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -212,6 +213,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "stream-changes") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_streaming))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -221,6 +232,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 							elem->arg ? strVal(elem->arg) : "(null)")));
 		}
 	}
+
+	ctx->streaming &= enable_streaming;
 }
 
 /* cleanup this plugin's resources */
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 18116c8..98b47b0 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d..fe7c978 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes', '1');
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d881f4c..33a4580 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments in xact.c where
+	 * these variables are declared.  Normally we have such a check at tableam
+	 * level API but this is called from many places so we need to ensure it
+	 * here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
@@ -1945,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..c771280 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the tuple
+		 * yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1659,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..9d9a70a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,10 +430,37 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments in xact.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments in xact.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments in xact.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 4b2bb29..a61e279 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -235,6 +235,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
 	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
+	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d4f7c29..99722ee 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -83,6 +83,19 @@ bool		XactDeferrable;
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
 /*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool		bsysscan = false;
+
+/*
  * When running as a parallel worker, we place only a single
  * TransactionStateData on the parallel worker's state stack, and the XID
  * reflected there will be that of the *innermost* currently-active
@@ -2680,6 +2693,9 @@ AbortTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* If in parallel mode, clean up workers and exit parallel mode. */
 	if (IsInParallelMode())
 	{
@@ -4982,6 +4998,9 @@ AbortSubTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* Exit from parallel mode, if necessary. */
 	if (IsInParallelMode())
 	{
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7153eba..2010d5a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6ee59bd..8deff89 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1442,3 +1442,13 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Clear logical streaming state during (sub)transaction abort.
+ */
+void
+ResetLogicalStreamingState(void)
+{
+	CheckXidAlive = InvalidTransactionId;
+	bsysscan = false;
+}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index ce6e621..c469536 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -178,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -236,6 +252,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +261,16 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static inline bool ReorderBufferCanStartStreaming(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -367,6 +394,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -416,13 +446,15 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 }
 
 /*
- * Free an ReorderBufferChange.
+ * Free a ReorderBufferChange and update memory accounting, if requested.
  */
 void
-ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
+						  bool upd_mem)
 {
 	/* update memory accounting info */
-	ReorderBufferChangeMemoryUpdate(rb, change, false);
+	if (upd_mem)
+		ReorderBufferChangeMemoryUpdate(rb, change, false);
 
 	/* free contained data */
 	switch (change->action)
@@ -624,16 +656,101 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 }
 
 /*
- * Queue a change into a transaction so it can be replayed upon commit.
+ * Record the partial change for the streaming of in-progress transactions.  We
+ * can stream only complete changes so if we have a partial change like toast
+ * table insert or speculative then we mark such a 'txn' so that it can't be
+ * streamed.  We also ensure that if the changes in such a 'txn' are above
+ * logical_decoding_work_mem threshold then we stream them as soon as we have a
+ * complete change.
+ */
+static void
+ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								  ReorderBufferChange *change,
+								  bool toast_insert)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The partial changes need to be processed only while streaming
+	 * in-progress transactions.
+	 */
+	if (!ReorderBufferCanStream(rb))
+		return;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * Set the toast insert bit whenever we get toast insert to indicate a partial
+	 * change and clear it when we get the insert or update on main table (Both
+	 * update and insert will do the insert in the toast table).
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			 IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial change and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+	{
+		/*
+		 * Speculative confirm change must be preceded by speculative insertion.
+		 */
+		Assert(rbtxn_has_spec_insert(toptxn));
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+	}
+
+	/*
+	 * Stream the transaction if it is serialized before and the changes are
+	 * now complete in the top-level transaction.
+	 *
+	 * The reason for doing the streaming of such a transaction as soon as
+	 * we get the complete change for it is that previously it would have
+	 * reached the memory threshold and wouldn't get streamed because of
+	 * incomplete changes.  Delaying such transactions would increase apply
+	 * lag for them.
+	 */
+	if (ReorderBufferCanStartStreaming(rb) &&
+		!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		rbtxn_is_serialized(txn))
+		ReorderBufferStreamTXN(rb, toptxn);
+}
+
+/*
+ * Queue a change into a transaction so it can be replayed upon commit or will be
+ * streamed when we reach logical_decoding_work_mem threshold.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the previous changes we have detected that the transaction
+	 * is aborted.  So there is no point in collecting further changes for it.
+	 */
+	if (txn->concurrent_abort)
+	{
+		/*
+		 * We don't need to update memory accounting for this change as we have
+		 * not added it to the queue yet.
+		 */
+		ReorderBufferReturnChange(rb, change, false);
+		return;
+	}
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -645,6 +762,9 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* process partial change */
+	ReorderBufferProcessPartialChange(rb, txn, change, toast_insert);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -674,7 +794,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -764,6 +884,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the (sub)transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -1018,6 +1170,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1032,6 +1187,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1148,7 +1306,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 	{
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1234,7 +1392,7 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1280,7 +1438,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1297,7 +1455,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1310,6 +1468,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1335,6 +1502,91 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change, true);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
+	 * streamed always, even if it does not contain any changes (that is, when
+	 * all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the toplevel xact (we send the XID in all messages), but we never
+	 * stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
+	 * memory. We could also keep the hash table and update it with new ctid
+	 * values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
  */
@@ -1485,57 +1737,178 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the
+	 * xid aborted.  That will happen during catalog access.
+	 */
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						  ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction. This resets the TXN such that it
+ * can be used to stream the remaining data of transaction being processed.
+ */
+static void
+ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+					  Snapshot snapshot_now,
+					  CommandId command_id,
+					  XLogRecPtr last_lsn,
+					  ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Free all resources allocated for toast reconstruction */
+	ReorderBufferToastReset(rb, txn);
+
+	/* Return the spec insert change if it is not NULL */
+	if (specinsert != NULL)
+	{
+		ReorderBufferReturnChange(rb, specinsert, true);
+		specinsert = NULL;
 	}
 
-	snapshot_now = txn->base_snapshot;
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. We iterate over the top and subtransactions (using a k-way
+ * merge) and replay the changes in lsn order.
+ *
+ * If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool stream_started = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1558,14 +1931,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming ? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* We only need to send begin/commit for non-streamed transactions. */
+		if (!streaming)
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1573,6 +1947,36 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * We can't call start stream callback before processing first
+			 * change.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					txn->origin_id = change->origin_id;
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1649,7 +2053,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1685,11 +2090,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1714,7 +2119,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					/* clear out a pending (and thus failed) speculation */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
@@ -1747,7 +2152,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1756,10 +2164,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1790,7 +2195,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1837,7 +2241,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		 */
 		if (specinsert)
 		{
-			ReorderBufferReturnChange(rb, specinsert);
+			ReorderBufferReturnChange(rb, specinsert, true);
 			specinsert = NULL;
 		}
 
@@ -1845,14 +2249,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, send the last message for this set of
+		 * changes depending upon streaming mode.
+		 */
+		if (streaming)
+		{
+			if (stream_started)
+			{
+				rb->stream_stop(rb, txn, prev_lsn);
+				stream_started = false;
+			}
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot for the next set of changes in
+		 * streaming mode.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1870,14 +2295,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as
+		 * streamed (if they contained changes). Otherwise, remove all the
+		 * changes and deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1896,15 +2334,106 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
+		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * This error can only occur when we are sending the data in
+			 * streaming mode and the streaming in not finished yet.
+			 */
+			Assert(streaming);
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+			curtxn->concurrent_abort = true;
+
+			/* Reset the TXN so that it is allowed to stream remaining data. */
+			ReorderBufferResetTXN(rb, txn, snapshot_now,
+								  command_id, prev_lsn,
+								  specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+	}
+	PG_END_TRY();
+}
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot	snapshot_now;
+	CommandId	command_id = FirstCommandId;
 
-		PG_RE_THROW();
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
 	}
-	PG_END_TRY();
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
 }
 
 /*
@@ -1931,6 +2460,22 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_abort(rb, txn, lsn);
+
+		/*
+		 * We might have decoded changes for this transaction that could load
+		 * the cache as per the current transaction's view (consider DDL's
+		 * happened in this transaction). We don't want the decoding of future
+		 * transactions to use those cache entries so execute invalidations.
+		 */
+		if (txn->ninvalidations > 0)
+			ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+											   txn->invalidations);
+	}
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2000,6 +2545,10 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2082,7 +2631,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2131,12 +2680,21 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2144,6 +2702,8 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2155,19 +2715,41 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* If streaming supported, update the total size in top level as well. */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2196,6 +2778,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2388,6 +2971,51 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ *
+ * Note that, we skip transactions that contains incomplete changes. There
+ * is a scope of optimization here such that we can select the largest transaction
+ * which has complete changes.  But that will make the code and design quite complex
+ * and that might not be worth the benefit.  If we plan to stream the transactions
+ * that contains incomplete changes then we need to find a way to partially
+ * stream/truncate the transaction changes in-memory and build a mechanism to
+ * partially truncate the spilled files.  Additionally, whenever we partially
+ * stream the transaction we need to maintain the last streamed lsn and next time
+ * we need to restore from that segment and the offset in WAL.  As we stream the
+ * changes from the top transaction and restore them subtransaction wise, we need
+ * to even remember the subxact from where we streamed the last change.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	Size largest_size = 0;
+	ReorderBufferTXN *largest = NULL;
+
+	/* Find the largest top-level transaction. */
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		if ((largest != NULL || txn->total_size > largest_size) &&
+			(txn->total_size > 0) && !(rbtxn_has_incomplete_tuple(txn)))
+		{
+			largest = txn;
+			largest_size = txn->total_size;
+		}
+	}
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
  * disk until we reach under the memory limit.
@@ -2419,11 +3047,33 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if possible.  Otherwise, spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStartStreaming(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it
+			 * from memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2501,7 +3151,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 
 		spilled++;
 	}
@@ -2713,6 +3363,136 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+/* Returns true, if the output plugin supports streaming, false, otherwise. */
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/* Returns true, if the streaming can be started now, false, otherwise. */
+static inline bool
+ReorderBufferCanStartStreaming(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild *builder = ctx->snapshot_builder;
+
+	/*
+	 * We can't start streaming immediately even if the streaming is enabled
+	 * because we previously decoded this transaction and now just are
+	 * restarting.
+	 */
+	if (ReorderBufferCanStream(rb) &&
+		!SnapBuildXactNeedsSkip(builder, ctx->reader->EndRecPtr))
+	{
+		/* We must have a consistent snapshot by this time */
+		Assert(SnapBuildCurrentState(builder) == SNAPBUILD_CONSISTENT);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * We can't make any assumptions about base snapshot here, similar to what
+	 * ReorderBufferCommit() does. That relies on base_snapshot getting
+	 * transferred from subxact in ReorderBufferCommitChild(), but that was
+	 * not yet called as the transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 *
+	 * Unlike DecodeCommit which adds xids of all the subtransactions in
+	 * snapshot's xip array via SnapBuildCommittedTxn, we can't do that here
+	 * but we do add them to subxip array instead via ReorderBufferCopySnap.
+	 * This allows the catalog changes made in subtransactions decoded till
+	 * now to be visible.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		/*
+		 * If this transaction has no snapshot, it didn't make any changes to
+		 * the database till now, so there's nothing to decode.
+		 */
+		if (txn->base_snapshot == NULL)
+		{
+			Assert(txn->ninvalidations == 0);
+			return;
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -2813,7 +3593,7 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		dlist_container(ReorderBufferChange, node, cleanup_iter.cur);
 
 		dlist_delete(&cleanup->node);
-		ReorderBufferReturnChange(rb, cleanup);
+		ReorderBufferReturnChange(rb, cleanup, true);
 	}
 	txn->nentries_mem = 0;
 	Assert(dlist_is_empty(&txn->changes));
@@ -3522,7 +4302,7 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			dlist_container(ReorderBufferChange, node, it.cur);
 
 			dlist_delete(&change->node);
-			ReorderBufferReturnChange(rb, change);
+			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
 
@@ -3812,6 +4592,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases, we assume the CID is from the future
+	 * command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 0d28f01..b188427 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -19,6 +19,7 @@
 
 #include "access/relscan.h"
 #include "access/sdir.h"
+#include "access/xact.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1017,6 +1027,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1056,6 +1073,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with
+	 * valid CheckXidAlive for catalog or regular tables.  See detailed
+	 * comments in xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1713,6 +1738,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1730,6 +1763,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1748,6 +1789,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1764,6 +1812,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index ac3f5e3..5f767eb 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,10 @@ typedef enum
 /* Synchronous commit level */
 extern int	synchronous_commit;
 
+/* used during logical streaming of a transaction */
+extern TransactionId CheckXidAlive;
+extern bool bsysscan;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index deef318..b0fae98 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -121,5 +121,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+extern void ResetLogicalStreamingState(void);
 
 #endif
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 42bc817..1ae17d5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -162,6 +162,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -181,6 +184,40 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -249,6 +286,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
@@ -313,6 +357,12 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -484,12 +534,14 @@ void		ReorderBufferFree(ReorderBuffer *);
 ReorderBufferTupleBuf *ReorderBufferGetTupleBuf(ReorderBuffer *, Size tuple_len);
 void		ReorderBufferReturnTupleBuf(ReorderBuffer *, ReorderBufferTupleBuf *tuple);
 ReorderBufferChange *ReorderBufferGetChange(ReorderBuffer *);
-void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
+void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *, bool);
 
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool toast_insert);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v36/v36-0004-Extend-the-BufFile-interface-for-the-streaming-o.patch0000664000175000017500000003751213705234373025661 0ustar  dilipkumardilipkumarFrom 6a92ffeaca3bed9c137327970f3f8faed8c3a2b9 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v36 4/8] Extend the BufFile interface for the streaming of
 in-progress transactions.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times.

Implement the interface for BufFileTruncate interface to allow files to be
truncated up to a particular offset.  Extend BufFileSeek API to support
SEEK_END case.  Add an option to provide a mode while opening the shared
BufFiles instead of always opening in read-only mode.
---
 src/backend/postmaster/pgstat.c           |  3 +
 src/backend/storage/file/buffile.c        | 86 +++++++++++++++++++++++----
 src/backend/storage/file/fd.c             |  9 ++-
 src/backend/storage/file/sharedfileset.c  | 98 ++++++++++++++++++++++++++++---
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  4 +-
 10 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 88992c2..6c97f68 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 3907349..c08ff4f 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile supports temporary files that can be used by the single backend when
+ * the corresponding files need to be survived across the transaction and need
+ * to be opened and closed multiple times.  Such files need to be created as a
+ * member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -364,6 +368,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -666,11 +673,22 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
+			break;
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +856,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420e..f376a97 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594..9a3dc10 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can also be used by backends when the temporary files need
+ * to be opened/closed multiple times and the underlying files need to survive
+ * across transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,19 +29,29 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -223,6 +255,58 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 }
 
 /*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
+/*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
  */
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59..788815c 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a43..b83fb50 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201..807a9c1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752ba..fc34c49 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d..e209f04 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf07..d5edb60 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
1.8.3.1

v36/v36-0005-Add-support-for-streaming-to-built-in-replicatio.patch0000664000175000017500000027347413705234373026100 0ustar  dilipkumardilipkumarFrom fa6cd7fb20ed0af4f52b608efe6d53c37ce493e0 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v36 5/8] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  49 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   3 +
 src/backend/replication/logical/proto.c            | 140 ++-
 src/backend/replication/logical/worker.c           | 946 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 348 +++++++-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  46 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/009_stream_simple.pl       |  86 ++
 src/test/subscription/t/010_stream_subxact.pl      | 102 +++
 src/test/subscription/t/011_stream_ddl.pl          |  95 +++
 .../subscription/t/012_stream_subxact_abort.pl     |  82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |  84 ++
 src/test/subscription/t/015_stream_binary.pl       |  86 ++
 19 files changed, 2059 insertions(+), 47 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/015_stream_binary.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70..a81bd54 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal> and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c5..b7d7457 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index e6afb32..153c562 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377..4c58ad8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,8 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +197,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +350,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +375,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -439,6 +455,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -698,6 +721,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +732,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +789,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +834,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +877,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 6c97f68..479e3ca 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e905723..00b665d 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 2b1356e..c205950 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -732,3 +759,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 407eee3..7071309 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,62 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +291,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -608,17 +748,323 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or
+	 * inside streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		!in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -631,6 +1077,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -646,6 +1095,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -682,6 +1134,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -797,6 +1252,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -938,6 +1396,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1308,6 +1769,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1446,6 +1910,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1558,6 +2038,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1662,7 +2150,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1935,6 +2423,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1967,6 +2469,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2133,6 +3068,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc..3360bd5 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,29 +47,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,11 +123,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -115,16 +149,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +226,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +255,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +279,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +300,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +331,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +394,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +444,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +486,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,6 +507,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -406,7 +539,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +559,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +584,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +605,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +631,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +663,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -606,6 +744,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -642,6 +887,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -771,12 +1048,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -811,7 +1121,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35..1d09154 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 287288a..3cc4878 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -93,25 +97,49 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbe..6c0a4e3 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/015_stream_binary.pl b/src/test/subscription/t/015_stream_binary.pl
new file mode 100644
index 0000000..fa2362e
--- /dev/null
+++ b/src/test/subscription/t/015_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v36/v36-0006-Enable-streaming-for-all-subscription-TAP-tests.patch0000664000175000017500000003534213705234373025633 0ustar  dilipkumardilipkumarFrom 48b45a1f31bd58ae0b5f9cf24fe19d903a577ea4 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v36 6/8] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318f..6f7bedc 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f8..4ba8086 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334e..cdf9b8e 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v36/v36-0007-Add-TAP-test-for-streaming-vs.-DDL.patch0000664000175000017500000001062113705234373022561 0ustar  dilipkumardilipkumarFrom 79d053d94ca951a99314897da5e7b9d35f32a35b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v36 7/8] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v36/v36-0008-Add-streaming-option-in-pg_dump.patch0000664000175000017500000000524713705234373022645 0ustar  dilipkumardilipkumarFrom bf598b493f34e9f9acf4ebf226bb711e005742fd Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 11:09:18 +0530
Subject: [PATCH v36 8/8] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 17 +++++++++++++++--
 src/bin/pg_dump/pg_dump.h |  1 +
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 94459b3..f69d64c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4240,10 +4241,17 @@ getSubscriptions(Archive *fout)
 
 	if (fout->remoteVersion >= 140000)
 		appendPQExpBuffer(query,
-						  " s.subbinary\n");
+						  " s.subbinary,\n");
 	else
 		appendPQExpBuffer(query,
-						  " false AS subbinary\n");
+						  " false AS subbinary,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBuffer(query,
+						  " s.substream\n");
+	else
+		appendPQExpBuffer(query,
+						  " false AS substream\n");
 
 	appendPQExpBuffer(query,
 					  "FROM pg_subscription s\n"
@@ -4263,6 +4271,7 @@ getSubscriptions(Archive *fout)
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
+	i_substream = PQfnumber(res, "substream");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4286,6 +4295,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subpublications));
 		subinfo[i].subbinary =
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
+		subinfo[i].substream =
+			pg_strdup(PQgetvalue(res, i, i_substream));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4357,6 +4368,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subbinary, "t") == 0)
 		appendPQExpBuffer(query, ", binary = true");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index da97b73..cc10c7c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -626,6 +626,7 @@ typedef struct _SubscriptionInfo
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subbinary;
+	char       *substream;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
-- 
1.8.3.1

#443Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#442)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jul 20, 2020 at 12:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jul 16, 2020 at 4:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 16, 2020 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Let me know what you think of the changes?

I have reviewed the changes and looks fine to me.

Thanks, I am planning to start committing a few of the infrastructure
patches (especially first two) by early next week as we have resolved
all the open issues and done an extensive review of the entire
patch-set. In the attached version, there is a slight change in one
of the commit messages as compared to the previous version. I would
like to describe in brief the first two patches for the sake of
convenience. Let me know if you or anyone else sees any problems with
these.

The first patch in the series allows us to WAL-log subtransaction and
top-level XID association. The logical decoding infrastructure needs
to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding. So we also write the assignment info into WAL
immediately, as part of the next WAL record (to minimize overhead)
only when *wal_level=logical*. We can not remove the existing
XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in
the hot standby snapshot.

Pushed, this patch.

The patch set required to rebase after committing the binary format
option support in the create subscription command. I have rebased the
patch set on the latest head and also added a test case to test
streaming in binary format.

While going through commit 9de77b5453, I noticed below change:

@@ -424,6 +424,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
PQfreemem(pubnames_literal);
pfree(pubnames_str);

+       if (options->proto.logical.binary &&
+           PQserverVersion(conn->streamConn) >= 140000)
+           appendStringInfoString(&cmd, ", binary 'true'");
+

Now, the similar change in this patch series is as below:

@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
appendStringInfo(&cmd, "proto_version '%u'",
options->proto.logical.proto_version);

+ if (options->proto.logical.streaming)
+ appendStringInfo(&cmd, ", streaming 'on'");
+

I think we also need a version check similar to commit 9de77b5453 to
ensure that we send the new option only when connected to a newer
version (>=14) primary server.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#444Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#443)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jul 20, 2020 at 2:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 20, 2020 at 12:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jul 16, 2020 at 4:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 16, 2020 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Let me know what you think of the changes?

I have reviewed the changes and looks fine to me.

Thanks, I am planning to start committing a few of the infrastructure
patches (especially first two) by early next week as we have resolved
all the open issues and done an extensive review of the entire
patch-set. In the attached version, there is a slight change in one
of the commit messages as compared to the previous version. I would
like to describe in brief the first two patches for the sake of
convenience. Let me know if you or anyone else sees any problems with
these.

The first patch in the series allows us to WAL-log subtransaction and
top-level XID association. The logical decoding infrastructure needs
to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding. So we also write the assignment info into WAL
immediately, as part of the next WAL record (to minimize overhead)
only when *wal_level=logical*. We can not remove the existing
XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in
the hot standby snapshot.

Pushed, this patch.

The patch set required to rebase after committing the binary format
option support in the create subscription command. I have rebased the
patch set on the latest head and also added a test case to test
streaming in binary format.

While going through commit 9de77b5453, I noticed below change:

@@ -424,6 +424,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
PQfreemem(pubnames_literal);
pfree(pubnames_str);

+       if (options->proto.logical.binary &&
+           PQserverVersion(conn->streamConn) >= 140000)
+           appendStringInfoString(&cmd, ", binary 'true'");
+

Now, the similar change in this patch series is as below:

@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
appendStringInfo(&cmd, "proto_version '%u'",
options->proto.logical.proto_version);

+ if (options->proto.logical.streaming)
+ appendStringInfo(&cmd, ", streaming 'on'");
+

I think we also need a version check similar to commit 9de77b5453 to
ensure that we send the new option only when connected to a newer
version (>=14) primary server.

I have changed that in the attached patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v37.tarapplication/x-tar; name=v37.tarDownload
v37/0000775000175000017500000000000013705275432012573 5ustar  dilipkumardilipkumarv37/v37-0001-WAL-Log-invalidations-at-command-end-with-wal_le.patch0000664000175000017500000003521713705275432025617 0ustar  dilipkumardilipkumarFrom 2df4ba84a9bb92c4fa0cbd92593cab9d48a52762 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 6 Jun 2020 09:54:21 +0530
Subject: [PATCH v37 1/8] WAL Log invalidations at command end with
 wal_level=logical.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When wal_level=logical, write invalidations at command end into WAL so
that decoding can use this information.

This patch is required to allow the streaming of in-progress transactions
in logical decoding.  The actual work to allow streaming will be committed
as a separate patch.

We still add the invalidations to the cache and write them to WAL at
commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not need
to be changed.

The invalidations written at command end uses a new xlog record type
XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See
LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures, which
still rely on the invalidations written to commit records.

The invalidations are decoded and accumulated in top-transaction, and then
executed during replay.  This obviates the need to decode the
invalidations as part of a commit record.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/rmgrdesc/xactdesc.c          | 10 +++++
 src/backend/access/transam/xact.c               | 17 ++++++++
 src/backend/replication/logical/decode.c        | 58 +++++++++++++++----------
 src/backend/replication/logical/reorderbuffer.c | 52 ++++++++++++++++++----
 src/backend/utils/cache/inval.c                 | 55 +++++++++++++++++++++++
 src/include/access/xact.h                       | 13 +++++-
 src/include/replication/reorderbuffer.h         |  3 ++
 src/include/utils/inval.h                       |  2 +
 8 files changed, 177 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce755..68aa994 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -396,6 +396,13 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		standby_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs, InvalidOid,
+								   InvalidOid, false);
+	}
 }
 
 const char *
@@ -423,6 +430,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index bd4c3cf..d4f7c29 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1224,6 +1224,16 @@ RecordTransactionCommit(void)
 	bool		RelcacheInitFileInval = false;
 	bool		wrote_xlog;
 
+	/*
+	 * Log pending invalidations for logical decoding of in-progress
+	 * transactions.  Normally for DDLs, we log this at each command end,
+	 * however, for certain cases where we directly update the system table
+	 * without a transaction block, the invalidations are not logged till this
+	 * time.
+	 */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	/* Get data needed for commit record */
 	nrels = smgrGetPendingDeletes(true, &rels);
 	nchildren = xactGetCommittedChildren(&children);
@@ -6022,6 +6032,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0c0c371..7153eba 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -278,10 +278,39 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 			/*
 			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here.
-			 * See LogicalDecodingProcessRecord.
+			 * record if required.  So, we don't need to do anything here. See
+			 * LogicalDecodingProcessRecord.
 			 */
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/*
+				 * Execute the invalidations for xid-less transactions,
+				 * otherwise, accumulate them so that they can be processed at
+				 * the commit time.
+				 */
+				if (!ctx->fast_forward)
+				{
+					if (TransactionIdIsValid(xid))
+					{
+						ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+													  invals->nmsgs, invals->msgs);
+						ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+														  buf->origptr);
+					}
+					else
+						ReorderBufferImmediateInvalidation(ctx->reorder,
+														   invals->nmsgs,
+														   invals->msgs);
+				}
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
@@ -334,15 +363,11 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_STANDBY_LOCK:
 			break;
 		case XLOG_INVALIDATIONS:
-			{
-				xl_invalidations *invalidations =
-				(xl_invalidations *) XLogRecGetData(r);
 
-				if (!ctx->fast_forward)
-					ReorderBufferImmediateInvalidation(ctx->reorder,
-													   invalidations->nmsgs,
-													   invalidations->msgs);
-			}
+			/*
+			 * We are processing the invalidations at the command level via
+			 * XLOG_XACT_INVALIDATIONS.  So we don't need to do anything here.
+			 */
 			break;
 		default:
 			elog(ERROR, "unexpected RM_STANDBY_ID record type: %u", info);
@@ -573,19 +598,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
-	/*
-	 * Process invalidation messages, even if we're not interested in the
-	 * transaction's contents, since the various caches need to always be
-	 * consistent.
-	 */
-	if (parsed->nmsgs > 0)
-	{
-		if (!ctx->fast_forward)
-			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
-										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
-	}
-
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 449327a..ce6e621 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -856,6 +856,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to top-level transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -2201,7 +2204,11 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 /*
  * Setup the invalidation of the toplevel transaction.
  *
- * This needs to be done before ReorderBufferCommit is called!
+ * This needs to be called for each XLOG_XACT_INVALIDATIONS message and
+ * accumulates all the invalidation messages in the toplevel transaction.
+ * This is required because in some cases where we skip processing the
+ * transaction (see ReorderBufferForget), we need to execute all the
+ * invalidations together.
  */
 void
 ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
@@ -2212,17 +2219,35 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	if (txn->ninvalidations != 0)
-		elog(ERROR, "only ever add one set of invalidations");
+	/*
+	 * We collect all the invalidations under the top transaction so that we
+	 * can execute them all together.
+	 */
+	if (txn->toptxn)
+		txn = txn->toptxn;
 
 	Assert(nmsgs > 0);
 
-	txn->ninvalidations = nmsgs;
-	txn->invalidations = (SharedInvalidationMessage *)
-		MemoryContextAlloc(rb->context,
-						   sizeof(SharedInvalidationMessage) * nmsgs);
-	memcpy(txn->invalidations, msgs,
-		   sizeof(SharedInvalidationMessage) * nmsgs);
+	/* Accumulate invalidations. */
+	if (txn->ninvalidations == 0)
+	{
+		txn->ninvalidations = nmsgs;
+		txn->invalidations = (SharedInvalidationMessage *)
+			MemoryContextAlloc(rb->context,
+							   sizeof(SharedInvalidationMessage) * nmsgs);
+		memcpy(txn->invalidations, msgs,
+			   sizeof(SharedInvalidationMessage) * nmsgs);
+	}
+	else
+	{
+		txn->invalidations = (SharedInvalidationMessage *)
+			repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
+					 (txn->ninvalidations + nmsgs));
+
+		memcpy(txn->invalidations + txn->ninvalidations, msgs,
+			   nmsgs * sizeof(SharedInvalidationMessage));
+		txn->ninvalidations += nmsgs;
+	}
 }
 
 /*
@@ -2250,6 +2275,15 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * Mark top-level transaction as having catalog changes too if one of its
+	 * children has so that the ReorderBufferBuildTupleCidHash can
+	 * conveniently check just top-level transaction and decide whether to
+	 * build the hash table or not.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..edd9077 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,9 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transactions.  See
+ *      CommandEndInvalidationMessages.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +107,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -1094,6 +1098,11 @@ CommandEndInvalidationMessages(void)
 
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
+
+	/* WAL Log per-command invalidation messages for wal_level=logical */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
 							   &transInvalInfo->CurrentCmdInvalidMsgs);
 }
@@ -1501,3 +1510,49 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * LogLogicalInvalidations
+ *
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end or at commit time if any invalidations
+ * are pending.
+ */
+void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage *invalMessages;
+	int			nmsgs = 0;
+
+	/* Quick exit if we haven't done anything with invalidation messages. */
+	if (transInvalInfo == NULL)
+		return;
+
+	ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+									 MakeSharedInvalidMessagesArray);
+
+	Assert(!(numSharedInvalidMessagesArray > 0 &&
+			 SharedInvalidMessagesArray == NULL));
+
+	invalMessages = SharedInvalidMessagesArray;
+	nmsgs = numSharedInvalidMessagesArray;
+	SharedInvalidMessagesArray = NULL;
+	numSharedInvalidMessagesArray = 0;
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						 nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index aef8555..ac3f5e3 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,17 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 019bd38..1055e99 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -220,6 +220,9 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	end_lsn;
 
+	/* Toplevel transaction for this subxact (NULL for top-level). */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN of the last lsn at which snapshot information reside, so we can
 	 * restart decoding from there and fully recover this transaction from
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index bc5081c..463888c 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -61,4 +61,6 @@ extern void CacheRegisterRelcacheCallback(RelcacheCallbackFunction func,
 extern void CallSyscacheCallbacks(int cacheid, uint32 hashvalue);
 
 extern void InvalidateSystemCaches(void);
+
+extern void LogLogicalInvalidations(void);
 #endif							/* INVAL_H */
-- 
1.8.3.1

v37/v37-0002-Extend-the-logical-decoding-output-plugin-API-wi.patch0000664000175000017500000011035413705275432025661 0ustar  dilipkumardilipkumarFrom ef81a97a30e9285d204d2779995b3c51b2ff0c41 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:01:24 +0530
Subject: [PATCH v37 2/8] Extend the logical decoding output plugin API with
 stream methods.

This adds seven methods to the output plugin API, adding support for
streaming changes for large transactions.

* stream_start
* stream_stop
* stream_abort
* stream_commit
* stream_change
* stream_message
* stream_truncate

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate a chunk of changes
streamed for a particular toplevel transaction.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/test_decoding.c     | 123 +++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 218 +++++++++++++++++++
 src/backend/replication/logical/logical.c | 351 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  59 +++++
 6 files changed, 825 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..4a71826 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
 }
 
 
@@ -540,3 +569,97 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the changes as the transaction can abort
+ * at a later point in time.  We don't want users to see the changes until the
+ * transaction is committed.
+ */
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the contents for transactional messages
+ * as the transaction can abort at a later point in time.  We don't want users to
+ * see the message contents until the transaction is committed.
+ */
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (transactional)
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu",
+						 transactional, prefix, sz);
+	}
+	else
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+						 transactional, prefix, sz);
+		appendBinaryStringInfo(ctx->out, message, sz);
+	}
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the detailed information of Truncate.
+ * See pg_decode_stream_change.
+ */
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index c89f93c..18116c8 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,6 +389,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -401,6 +408,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -679,6 +695,117 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+    
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The actual changes are not displayed as the transaction can abort at a later
+      point in time and we don't decode changes for aborted transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The message contents for transactional messages are not displayed as the transaction
+      can abort at a later point in time and we don't decode changes for aborted
+      transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -747,4 +874,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and two optional callbacks
+    (<function>stream_message_cb</function>) and (<function>stream_truncate_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be..6ee59bd 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr message_lsn, bool transactional,
+									  const char *prefix, Size message_size, const char *message);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require start/stop/abort/commit/change
+	 * callbacks. The message and truncate callback are optional, similar to
+	 * regular output plugins. We however enable streaming when at least one
+	 * of the methods is enabled, so that we can easily identify missing
+	 * methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional, so we do not
+	 * fail with ERROR when missing, but the wrappers simply do nothing. We
+	 * must set the ReorderBuffer callbacks to something, otherwise the calls
+	 * from there will crash (we don't want to move the checks there).
+	 */
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,307 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_start_cb callback")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_stop_cb callback")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_abort_cb callback")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_commit_cb callback")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_change_cb callback")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475..deef318 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..2d9aa11 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.commit_lsn
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1055e99..42bc817 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -348,6 +348,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											   ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
 struct ReorderBuffer
 {
 	/*
@@ -387,6 +435,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamTruncateCB stream_truncate;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v37/v37-0003-Implement-streaming-mode-in-ReorderBuffer.patch0000664000175000017500000022546313705275432024626 0ustar  dilipkumardilipkumarFrom 6ae620db5ba3c5f42bff414c40d1749c6f5483e2 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 15 Jul 2020 18:38:32 +0530
Subject: [PATCH v37 3/8] Implement streaming mode in ReorderBuffer.

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke new stream API methods. However, sometimes if we have
incomplete toast or speculative insert we spill to the disk because we can
not generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

Now that we can stream in-progress transactions, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

We also provide a new option via SQL APIs to fetch the changes being
streamed.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/Makefile                  |   2 +-
 contrib/test_decoding/expected/stream.out       |  40 +
 contrib/test_decoding/expected/truncate.out     |   6 +
 contrib/test_decoding/sql/stream.sql            |  21 +
 contrib/test_decoding/sql/truncate.sql          |   1 +
 contrib/test_decoding/test_decoding.c           |  13 +
 doc/src/sgml/logicaldecoding.sgml               |   9 +-
 doc/src/sgml/test-decoding.sgml                 |  22 +
 src/backend/access/heap/heapam.c                |  13 +
 src/backend/access/heap/heapam_visibility.c     |  42 +-
 src/backend/access/index/genam.c                |  53 ++
 src/backend/access/table/tableam.c              |   8 +
 src/backend/access/transam/xact.c               |  19 +
 src/backend/replication/logical/decode.c        |  17 +-
 src/backend/replication/logical/logical.c       |  10 +
 src/backend/replication/logical/reorderbuffer.c | 969 +++++++++++++++++++++---
 src/include/access/heapam_xlog.h                |   1 +
 src/include/access/tableam.h                    |  55 ++
 src/include/access/xact.h                       |   4 +
 src/include/replication/logical.h               |   1 +
 src/include/replication/reorderbuffer.h         |  56 +-
 21 files changed, 1256 insertions(+), 106 deletions(-)
 create mode 100644 contrib/test_decoding/expected/stream.out
 create mode 100644 contrib/test_decoding/sql/stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c58..ed9a3d6 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,7 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate
+	spill slot truncate stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
new file mode 100644
index 0000000..bdf7002
--- /dev/null
+++ b/contrib/test_decoding/expected/stream.out
@@ -0,0 +1,40 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+ count 
+-------
+   157
+(1 row)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/truncate.out b/contrib/test_decoding/expected/truncate.out
index 1cf2ae8..e64d377 100644
--- a/contrib/test_decoding/expected/truncate.out
+++ b/contrib/test_decoding/expected/truncate.out
@@ -25,3 +25,9 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
  COMMIT
 (9 rows)
 
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
new file mode 100644
index 0000000..24d41b1
--- /dev/null
+++ b/contrib/test_decoding/sql/stream.sql
@@ -0,0 +1,21 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/truncate.sql b/contrib/test_decoding/sql/truncate.sql
index 5aecdf0..5633854 100644
--- a/contrib/test_decoding/sql/truncate.sql
+++ b/contrib/test_decoding/sql/truncate.sql
@@ -11,3 +11,4 @@ TRUNCATE tab1, tab1 RESTART IDENTITY CASCADE;
 TRUNCATE tab1, tab2;
 
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 4a71826..bb3d9f3 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -122,6 +122,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	ListCell   *option;
 	TestDecodingData *data;
+	bool enable_streaming = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -212,6 +213,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "stream-changes") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_streaming))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -221,6 +232,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 							elem->arg ? strVal(elem->arg) : "(null)")));
 		}
 	}
+
+	ctx->streaming &= enable_streaming;
 }
 
 /* cleanup this plugin's resources */
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 18116c8..98b47b0 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d..fe7c978 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes', '1');
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d881f4c..33a4580 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments in xact.c where
+	 * these variables are declared.  Normally we have such a check at tableam
+	 * level API but this is called from many places so we need to ensure it
+	 * here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
@@ -1945,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..c771280 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the tuple
+		 * yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1659,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..9d9a70a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,10 +430,37 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments in xact.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments in xact.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments in xact.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 4b2bb29..a61e279 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -235,6 +235,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
 	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
+	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d4f7c29..99722ee 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -83,6 +83,19 @@ bool		XactDeferrable;
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
 /*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool		bsysscan = false;
+
+/*
  * When running as a parallel worker, we place only a single
  * TransactionStateData on the parallel worker's state stack, and the XID
  * reflected there will be that of the *innermost* currently-active
@@ -2680,6 +2693,9 @@ AbortTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* If in parallel mode, clean up workers and exit parallel mode. */
 	if (IsInParallelMode())
 	{
@@ -4982,6 +4998,9 @@ AbortSubTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* Exit from parallel mode, if necessary. */
 	if (IsInParallelMode())
 	{
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7153eba..2010d5a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6ee59bd..8deff89 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1442,3 +1442,13 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Clear logical streaming state during (sub)transaction abort.
+ */
+void
+ResetLogicalStreamingState(void)
+{
+	CheckXidAlive = InvalidTransactionId;
+	bsysscan = false;
+}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index ce6e621..c469536 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -178,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -236,6 +252,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +261,16 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static inline bool ReorderBufferCanStartStreaming(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -367,6 +394,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -416,13 +446,15 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 }
 
 /*
- * Free an ReorderBufferChange.
+ * Free a ReorderBufferChange and update memory accounting, if requested.
  */
 void
-ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
+						  bool upd_mem)
 {
 	/* update memory accounting info */
-	ReorderBufferChangeMemoryUpdate(rb, change, false);
+	if (upd_mem)
+		ReorderBufferChangeMemoryUpdate(rb, change, false);
 
 	/* free contained data */
 	switch (change->action)
@@ -624,16 +656,101 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 }
 
 /*
- * Queue a change into a transaction so it can be replayed upon commit.
+ * Record the partial change for the streaming of in-progress transactions.  We
+ * can stream only complete changes so if we have a partial change like toast
+ * table insert or speculative then we mark such a 'txn' so that it can't be
+ * streamed.  We also ensure that if the changes in such a 'txn' are above
+ * logical_decoding_work_mem threshold then we stream them as soon as we have a
+ * complete change.
+ */
+static void
+ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								  ReorderBufferChange *change,
+								  bool toast_insert)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The partial changes need to be processed only while streaming
+	 * in-progress transactions.
+	 */
+	if (!ReorderBufferCanStream(rb))
+		return;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * Set the toast insert bit whenever we get toast insert to indicate a partial
+	 * change and clear it when we get the insert or update on main table (Both
+	 * update and insert will do the insert in the toast table).
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			 IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial change and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+	{
+		/*
+		 * Speculative confirm change must be preceded by speculative insertion.
+		 */
+		Assert(rbtxn_has_spec_insert(toptxn));
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+	}
+
+	/*
+	 * Stream the transaction if it is serialized before and the changes are
+	 * now complete in the top-level transaction.
+	 *
+	 * The reason for doing the streaming of such a transaction as soon as
+	 * we get the complete change for it is that previously it would have
+	 * reached the memory threshold and wouldn't get streamed because of
+	 * incomplete changes.  Delaying such transactions would increase apply
+	 * lag for them.
+	 */
+	if (ReorderBufferCanStartStreaming(rb) &&
+		!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		rbtxn_is_serialized(txn))
+		ReorderBufferStreamTXN(rb, toptxn);
+}
+
+/*
+ * Queue a change into a transaction so it can be replayed upon commit or will be
+ * streamed when we reach logical_decoding_work_mem threshold.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the previous changes we have detected that the transaction
+	 * is aborted.  So there is no point in collecting further changes for it.
+	 */
+	if (txn->concurrent_abort)
+	{
+		/*
+		 * We don't need to update memory accounting for this change as we have
+		 * not added it to the queue yet.
+		 */
+		ReorderBufferReturnChange(rb, change, false);
+		return;
+	}
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -645,6 +762,9 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* process partial change */
+	ReorderBufferProcessPartialChange(rb, txn, change, toast_insert);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -674,7 +794,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -764,6 +884,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the (sub)transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -1018,6 +1170,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1032,6 +1187,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1148,7 +1306,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 	{
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1234,7 +1392,7 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1280,7 +1438,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1297,7 +1455,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1310,6 +1468,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1335,6 +1502,91 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change, true);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
+	 * streamed always, even if it does not contain any changes (that is, when
+	 * all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the toplevel xact (we send the XID in all messages), but we never
+	 * stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
+	 * memory. We could also keep the hash table and update it with new ctid
+	 * values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
  */
@@ -1485,57 +1737,178 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the
+	 * xid aborted.  That will happen during catalog access.
+	 */
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						  ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction. This resets the TXN such that it
+ * can be used to stream the remaining data of transaction being processed.
+ */
+static void
+ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+					  Snapshot snapshot_now,
+					  CommandId command_id,
+					  XLogRecPtr last_lsn,
+					  ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Free all resources allocated for toast reconstruction */
+	ReorderBufferToastReset(rb, txn);
+
+	/* Return the spec insert change if it is not NULL */
+	if (specinsert != NULL)
+	{
+		ReorderBufferReturnChange(rb, specinsert, true);
+		specinsert = NULL;
 	}
 
-	snapshot_now = txn->base_snapshot;
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. We iterate over the top and subtransactions (using a k-way
+ * merge) and replay the changes in lsn order.
+ *
+ * If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool stream_started = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1558,14 +1931,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming ? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* We only need to send begin/commit for non-streamed transactions. */
+		if (!streaming)
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1573,6 +1947,36 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * We can't call start stream callback before processing first
+			 * change.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					txn->origin_id = change->origin_id;
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1649,7 +2053,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1685,11 +2090,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1714,7 +2119,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					/* clear out a pending (and thus failed) speculation */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
@@ -1747,7 +2152,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1756,10 +2164,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1790,7 +2195,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1837,7 +2241,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		 */
 		if (specinsert)
 		{
-			ReorderBufferReturnChange(rb, specinsert);
+			ReorderBufferReturnChange(rb, specinsert, true);
 			specinsert = NULL;
 		}
 
@@ -1845,14 +2249,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, send the last message for this set of
+		 * changes depending upon streaming mode.
+		 */
+		if (streaming)
+		{
+			if (stream_started)
+			{
+				rb->stream_stop(rb, txn, prev_lsn);
+				stream_started = false;
+			}
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot for the next set of changes in
+		 * streaming mode.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1870,14 +2295,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as
+		 * streamed (if they contained changes). Otherwise, remove all the
+		 * changes and deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1896,15 +2334,106 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
+		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * This error can only occur when we are sending the data in
+			 * streaming mode and the streaming in not finished yet.
+			 */
+			Assert(streaming);
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+			curtxn->concurrent_abort = true;
+
+			/* Reset the TXN so that it is allowed to stream remaining data. */
+			ReorderBufferResetTXN(rb, txn, snapshot_now,
+								  command_id, prev_lsn,
+								  specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+	}
+	PG_END_TRY();
+}
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot	snapshot_now;
+	CommandId	command_id = FirstCommandId;
 
-		PG_RE_THROW();
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
 	}
-	PG_END_TRY();
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
 }
 
 /*
@@ -1931,6 +2460,22 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_abort(rb, txn, lsn);
+
+		/*
+		 * We might have decoded changes for this transaction that could load
+		 * the cache as per the current transaction's view (consider DDL's
+		 * happened in this transaction). We don't want the decoding of future
+		 * transactions to use those cache entries so execute invalidations.
+		 */
+		if (txn->ninvalidations > 0)
+			ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+											   txn->invalidations);
+	}
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2000,6 +2545,10 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2082,7 +2631,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2131,12 +2680,21 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2144,6 +2702,8 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2155,19 +2715,41 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* If streaming supported, update the total size in top level as well. */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2196,6 +2778,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2388,6 +2971,51 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ *
+ * Note that, we skip transactions that contains incomplete changes. There
+ * is a scope of optimization here such that we can select the largest transaction
+ * which has complete changes.  But that will make the code and design quite complex
+ * and that might not be worth the benefit.  If we plan to stream the transactions
+ * that contains incomplete changes then we need to find a way to partially
+ * stream/truncate the transaction changes in-memory and build a mechanism to
+ * partially truncate the spilled files.  Additionally, whenever we partially
+ * stream the transaction we need to maintain the last streamed lsn and next time
+ * we need to restore from that segment and the offset in WAL.  As we stream the
+ * changes from the top transaction and restore them subtransaction wise, we need
+ * to even remember the subxact from where we streamed the last change.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	Size largest_size = 0;
+	ReorderBufferTXN *largest = NULL;
+
+	/* Find the largest top-level transaction. */
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		if ((largest != NULL || txn->total_size > largest_size) &&
+			(txn->total_size > 0) && !(rbtxn_has_incomplete_tuple(txn)))
+		{
+			largest = txn;
+			largest_size = txn->total_size;
+		}
+	}
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
  * disk until we reach under the memory limit.
@@ -2419,11 +3047,33 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if possible.  Otherwise, spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStartStreaming(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it
+			 * from memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2501,7 +3151,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 
 		spilled++;
 	}
@@ -2713,6 +3363,136 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+/* Returns true, if the output plugin supports streaming, false, otherwise. */
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/* Returns true, if the streaming can be started now, false, otherwise. */
+static inline bool
+ReorderBufferCanStartStreaming(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild *builder = ctx->snapshot_builder;
+
+	/*
+	 * We can't start streaming immediately even if the streaming is enabled
+	 * because we previously decoded this transaction and now just are
+	 * restarting.
+	 */
+	if (ReorderBufferCanStream(rb) &&
+		!SnapBuildXactNeedsSkip(builder, ctx->reader->EndRecPtr))
+	{
+		/* We must have a consistent snapshot by this time */
+		Assert(SnapBuildCurrentState(builder) == SNAPBUILD_CONSISTENT);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * We can't make any assumptions about base snapshot here, similar to what
+	 * ReorderBufferCommit() does. That relies on base_snapshot getting
+	 * transferred from subxact in ReorderBufferCommitChild(), but that was
+	 * not yet called as the transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 *
+	 * Unlike DecodeCommit which adds xids of all the subtransactions in
+	 * snapshot's xip array via SnapBuildCommittedTxn, we can't do that here
+	 * but we do add them to subxip array instead via ReorderBufferCopySnap.
+	 * This allows the catalog changes made in subtransactions decoded till
+	 * now to be visible.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		/*
+		 * If this transaction has no snapshot, it didn't make any changes to
+		 * the database till now, so there's nothing to decode.
+		 */
+		if (txn->base_snapshot == NULL)
+		{
+			Assert(txn->ninvalidations == 0);
+			return;
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -2813,7 +3593,7 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		dlist_container(ReorderBufferChange, node, cleanup_iter.cur);
 
 		dlist_delete(&cleanup->node);
-		ReorderBufferReturnChange(rb, cleanup);
+		ReorderBufferReturnChange(rb, cleanup, true);
 	}
 	txn->nentries_mem = 0;
 	Assert(dlist_is_empty(&txn->changes));
@@ -3522,7 +4302,7 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			dlist_container(ReorderBufferChange, node, it.cur);
 
 			dlist_delete(&change->node);
-			ReorderBufferReturnChange(rb, change);
+			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
 
@@ -3812,6 +4592,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases, we assume the CID is from the future
+	 * command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 0d28f01..b188427 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -19,6 +19,7 @@
 
 #include "access/relscan.h"
 #include "access/sdir.h"
+#include "access/xact.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1017,6 +1027,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1056,6 +1073,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with
+	 * valid CheckXidAlive for catalog or regular tables.  See detailed
+	 * comments in xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1713,6 +1738,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1730,6 +1763,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1748,6 +1789,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1764,6 +1812,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index ac3f5e3..5f767eb 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,10 @@ typedef enum
 /* Synchronous commit level */
 extern int	synchronous_commit;
 
+/* used during logical streaming of a transaction */
+extern TransactionId CheckXidAlive;
+extern bool bsysscan;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index deef318..b0fae98 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -121,5 +121,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+extern void ResetLogicalStreamingState(void);
 
 #endif
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 42bc817..1ae17d5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -162,6 +162,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -181,6 +184,40 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -249,6 +286,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
@@ -313,6 +357,12 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -484,12 +534,14 @@ void		ReorderBufferFree(ReorderBuffer *);
 ReorderBufferTupleBuf *ReorderBufferGetTupleBuf(ReorderBuffer *, Size tuple_len);
 void		ReorderBufferReturnTupleBuf(ReorderBuffer *, ReorderBufferTupleBuf *tuple);
 ReorderBufferChange *ReorderBufferGetChange(ReorderBuffer *);
-void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
+void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *, bool);
 
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool toast_insert);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v37/v37-0004-Extend-the-BufFile-interface-for-the-streaming-o.patch0000664000175000017500000003751213705275432025664 0ustar  dilipkumardilipkumarFrom 6a92ffeaca3bed9c137327970f3f8faed8c3a2b9 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v37 4/8] Extend the BufFile interface for the streaming of
 in-progress transactions.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times.

Implement the interface for BufFileTruncate interface to allow files to be
truncated up to a particular offset.  Extend BufFileSeek API to support
SEEK_END case.  Add an option to provide a mode while opening the shared
BufFiles instead of always opening in read-only mode.
---
 src/backend/postmaster/pgstat.c           |  3 +
 src/backend/storage/file/buffile.c        | 86 +++++++++++++++++++++++----
 src/backend/storage/file/fd.c             |  9 ++-
 src/backend/storage/file/sharedfileset.c  | 98 ++++++++++++++++++++++++++++---
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  4 +-
 10 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 88992c2..6c97f68 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 3907349..c08ff4f 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile supports temporary files that can be used by the single backend when
+ * the corresponding files need to be survived across the transaction and need
+ * to be opened and closed multiple times.  Such files need to be created as a
+ * member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -364,6 +368,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -666,11 +673,22 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
+			break;
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +856,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420e..f376a97 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594..9a3dc10 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can also be used by backends when the temporary files need
+ * to be opened/closed multiple times and the underlying files need to survive
+ * across transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,19 +29,29 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -223,6 +255,58 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 }
 
 /*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
+/*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
  */
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59..788815c 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a43..b83fb50 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201..807a9c1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752ba..fc34c49 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d..e209f04 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf07..d5edb60 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
1.8.3.1

v37/v37-0005-Add-support-for-streaming-to-built-in-replicatio.patch0000664000175000017500000027356013705275432026077 0ustar  dilipkumardilipkumarFrom ae0e9c21312176197a0bde750e69020cb9118c00 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v37 5/8] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  49 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   4 +
 src/backend/replication/logical/proto.c            | 140 ++-
 src/backend/replication/logical/worker.c           | 946 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 348 +++++++-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  46 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/009_stream_simple.pl       |  86 ++
 src/test/subscription/t/010_stream_subxact.pl      | 102 +++
 src/test/subscription/t/011_stream_ddl.pl          |  95 +++
 .../subscription/t/012_stream_subxact_abort.pl     |  82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |  84 ++
 src/test/subscription/t/015_stream_binary.pl       |  86 ++
 19 files changed, 2060 insertions(+), 47 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/015_stream_binary.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70..a81bd54 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal> and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c5..b7d7457 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index e6afb32..153c562 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377..4c58ad8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,8 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +197,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +350,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +375,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -439,6 +455,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -698,6 +721,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +732,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +789,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +834,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +877,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 6c97f68..479e3ca 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e905723..a6101ac 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 2b1356e..c205950 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -732,3 +759,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 407eee3..7071309 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,62 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +291,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -608,17 +748,323 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or
+	 * inside streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		!in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -631,6 +1077,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -646,6 +1095,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -682,6 +1134,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -797,6 +1252,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -938,6 +1396,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1308,6 +1769,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1446,6 +1910,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1558,6 +2038,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1662,7 +2150,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1935,6 +2423,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1967,6 +2469,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2133,6 +3068,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc..3360bd5 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,29 +47,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,11 +123,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -115,16 +149,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +226,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +255,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +279,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +300,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +331,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +394,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +444,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +486,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,6 +507,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -406,7 +539,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +559,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +584,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +605,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +631,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +663,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -606,6 +744,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -642,6 +887,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -771,12 +1048,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -811,7 +1121,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35..1d09154 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 287288a..3cc4878 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -93,25 +97,49 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbe..6c0a4e3 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/015_stream_binary.pl b/src/test/subscription/t/015_stream_binary.pl
new file mode 100644
index 0000000..fa2362e
--- /dev/null
+++ b/src/test/subscription/t/015_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v37/v37-0006-Enable-streaming-for-all-subscription-TAP-tests.patch0000664000175000017500000003534213705275432025636 0ustar  dilipkumardilipkumarFrom 30e94a7bffb4101d713fcd9f7339cc14b3837a00 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v37 6/8] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318f..6f7bedc 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f8..4ba8086 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334e..cdf9b8e 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v37/v37-0007-Add-TAP-test-for-streaming-vs.-DDL.patch0000664000175000017500000001062113705275432022564 0ustar  dilipkumardilipkumarFrom 0cab6e06e921acccb955bc00d2a5da5ce74f8ed8 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v37 7/8] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v37/v37-0008-Add-streaming-option-in-pg_dump.patch0000664000175000017500000000524713705275432022650 0ustar  dilipkumardilipkumarFrom d214d26de925b5f2f0c007df38f66b2ba59ae18b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 11:09:18 +0530
Subject: [PATCH v37 8/8] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 17 +++++++++++++++--
 src/bin/pg_dump/pg_dump.h |  1 +
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 94459b3..f69d64c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4240,10 +4241,17 @@ getSubscriptions(Archive *fout)
 
 	if (fout->remoteVersion >= 140000)
 		appendPQExpBuffer(query,
-						  " s.subbinary\n");
+						  " s.subbinary,\n");
 	else
 		appendPQExpBuffer(query,
-						  " false AS subbinary\n");
+						  " false AS subbinary,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBuffer(query,
+						  " s.substream\n");
+	else
+		appendPQExpBuffer(query,
+						  " false AS substream\n");
 
 	appendPQExpBuffer(query,
 					  "FROM pg_subscription s\n"
@@ -4263,6 +4271,7 @@ getSubscriptions(Archive *fout)
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
+	i_substream = PQfnumber(res, "substream");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4286,6 +4295,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subpublications));
 		subinfo[i].subbinary =
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
+		subinfo[i].substream =
+			pg_strdup(PQgetvalue(res, i, i_substream));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4357,6 +4368,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subbinary, "t") == 0)
 		appendPQExpBuffer(query, ", binary = true");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index da97b73..cc10c7c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -626,6 +626,7 @@ typedef struct _SubscriptionInfo
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subbinary;
+	char       *substream;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
-- 
1.8.3.1

#445Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#444)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jul 20, 2020 at 4:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jul 20, 2020 at 2:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 20, 2020 at 12:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jul 16, 2020 at 4:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 16, 2020 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Let me know what you think of the changes?

I have reviewed the changes and looks fine to me.

Thanks, I am planning to start committing a few of the infrastructure
patches (especially first two) by early next week as we have resolved
all the open issues and done an extensive review of the entire
patch-set. In the attached version, there is a slight change in one
of the commit messages as compared to the previous version. I would
like to describe in brief the first two patches for the sake of
convenience. Let me know if you or anyone else sees any problems with
these.

The first patch in the series allows us to WAL-log subtransaction and
top-level XID association. The logical decoding infrastructure needs
to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding. So we also write the assignment info into WAL
immediately, as part of the next WAL record (to minimize overhead)
only when *wal_level=logical*. We can not remove the existing
XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in
the hot standby snapshot.

Pushed, this patch.

The patch set required to rebase after committing the binary format
option support in the create subscription command. I have rebased the
patch set on the latest head and also added a test case to test
streaming in binary format.

While going through commit 9de77b5453, I noticed below change:

@@ -424,6 +424,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
PQfreemem(pubnames_literal);
pfree(pubnames_str);

+       if (options->proto.logical.binary &&
+           PQserverVersion(conn->streamConn) >= 140000)
+           appendStringInfoString(&cmd, ", binary 'true'");
+

Now, the similar change in this patch series is as below:

@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
appendStringInfo(&cmd, "proto_version '%u'",
options->proto.logical.proto_version);

+ if (options->proto.logical.streaming)
+ appendStringInfo(&cmd, ", streaming 'on'");
+

I think we also need a version check similar to commit 9de77b5453 to
ensure that we send the new option only when connected to a newer
version (>=14) primary server.

I have changed that in the attached patch.

There was one warning in release mode in the last version in 0004 so
attaching a new version.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v38.tarapplication/x-tar; name=v38.tarDownload
v38/0000775000175000017500000000000013705304025012563 5ustar  dilipkumardilipkumarv38/v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le.patch0000664000175000017500000003521713705304025025610 0ustar  dilipkumardilipkumarFrom 2df4ba84a9bb92c4fa0cbd92593cab9d48a52762 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 6 Jun 2020 09:54:21 +0530
Subject: [PATCH v38 1/8] WAL Log invalidations at command end with
 wal_level=logical.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When wal_level=logical, write invalidations at command end into WAL so
that decoding can use this information.

This patch is required to allow the streaming of in-progress transactions
in logical decoding.  The actual work to allow streaming will be committed
as a separate patch.

We still add the invalidations to the cache and write them to WAL at
commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not need
to be changed.

The invalidations written at command end uses a new xlog record type
XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See
LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures, which
still rely on the invalidations written to commit records.

The invalidations are decoded and accumulated in top-transaction, and then
executed during replay.  This obviates the need to decode the
invalidations as part of a commit record.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/rmgrdesc/xactdesc.c          | 10 +++++
 src/backend/access/transam/xact.c               | 17 ++++++++
 src/backend/replication/logical/decode.c        | 58 +++++++++++++++----------
 src/backend/replication/logical/reorderbuffer.c | 52 ++++++++++++++++++----
 src/backend/utils/cache/inval.c                 | 55 +++++++++++++++++++++++
 src/include/access/xact.h                       | 13 +++++-
 src/include/replication/reorderbuffer.h         |  3 ++
 src/include/utils/inval.h                       |  2 +
 8 files changed, 177 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce755..68aa994 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -396,6 +396,13 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		standby_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs, InvalidOid,
+								   InvalidOid, false);
+	}
 }
 
 const char *
@@ -423,6 +430,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index bd4c3cf..d4f7c29 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1224,6 +1224,16 @@ RecordTransactionCommit(void)
 	bool		RelcacheInitFileInval = false;
 	bool		wrote_xlog;
 
+	/*
+	 * Log pending invalidations for logical decoding of in-progress
+	 * transactions.  Normally for DDLs, we log this at each command end,
+	 * however, for certain cases where we directly update the system table
+	 * without a transaction block, the invalidations are not logged till this
+	 * time.
+	 */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	/* Get data needed for commit record */
 	nrels = smgrGetPendingDeletes(true, &rels);
 	nchildren = xactGetCommittedChildren(&children);
@@ -6022,6 +6032,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0c0c371..7153eba 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -278,10 +278,39 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 			/*
 			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here.
-			 * See LogicalDecodingProcessRecord.
+			 * record if required.  So, we don't need to do anything here. See
+			 * LogicalDecodingProcessRecord.
 			 */
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/*
+				 * Execute the invalidations for xid-less transactions,
+				 * otherwise, accumulate them so that they can be processed at
+				 * the commit time.
+				 */
+				if (!ctx->fast_forward)
+				{
+					if (TransactionIdIsValid(xid))
+					{
+						ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+													  invals->nmsgs, invals->msgs);
+						ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+														  buf->origptr);
+					}
+					else
+						ReorderBufferImmediateInvalidation(ctx->reorder,
+														   invals->nmsgs,
+														   invals->msgs);
+				}
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
@@ -334,15 +363,11 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_STANDBY_LOCK:
 			break;
 		case XLOG_INVALIDATIONS:
-			{
-				xl_invalidations *invalidations =
-				(xl_invalidations *) XLogRecGetData(r);
 
-				if (!ctx->fast_forward)
-					ReorderBufferImmediateInvalidation(ctx->reorder,
-													   invalidations->nmsgs,
-													   invalidations->msgs);
-			}
+			/*
+			 * We are processing the invalidations at the command level via
+			 * XLOG_XACT_INVALIDATIONS.  So we don't need to do anything here.
+			 */
 			break;
 		default:
 			elog(ERROR, "unexpected RM_STANDBY_ID record type: %u", info);
@@ -573,19 +598,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
-	/*
-	 * Process invalidation messages, even if we're not interested in the
-	 * transaction's contents, since the various caches need to always be
-	 * consistent.
-	 */
-	if (parsed->nmsgs > 0)
-	{
-		if (!ctx->fast_forward)
-			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
-										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
-	}
-
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 449327a..ce6e621 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -856,6 +856,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to top-level transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -2201,7 +2204,11 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 /*
  * Setup the invalidation of the toplevel transaction.
  *
- * This needs to be done before ReorderBufferCommit is called!
+ * This needs to be called for each XLOG_XACT_INVALIDATIONS message and
+ * accumulates all the invalidation messages in the toplevel transaction.
+ * This is required because in some cases where we skip processing the
+ * transaction (see ReorderBufferForget), we need to execute all the
+ * invalidations together.
  */
 void
 ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
@@ -2212,17 +2219,35 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	if (txn->ninvalidations != 0)
-		elog(ERROR, "only ever add one set of invalidations");
+	/*
+	 * We collect all the invalidations under the top transaction so that we
+	 * can execute them all together.
+	 */
+	if (txn->toptxn)
+		txn = txn->toptxn;
 
 	Assert(nmsgs > 0);
 
-	txn->ninvalidations = nmsgs;
-	txn->invalidations = (SharedInvalidationMessage *)
-		MemoryContextAlloc(rb->context,
-						   sizeof(SharedInvalidationMessage) * nmsgs);
-	memcpy(txn->invalidations, msgs,
-		   sizeof(SharedInvalidationMessage) * nmsgs);
+	/* Accumulate invalidations. */
+	if (txn->ninvalidations == 0)
+	{
+		txn->ninvalidations = nmsgs;
+		txn->invalidations = (SharedInvalidationMessage *)
+			MemoryContextAlloc(rb->context,
+							   sizeof(SharedInvalidationMessage) * nmsgs);
+		memcpy(txn->invalidations, msgs,
+			   sizeof(SharedInvalidationMessage) * nmsgs);
+	}
+	else
+	{
+		txn->invalidations = (SharedInvalidationMessage *)
+			repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
+					 (txn->ninvalidations + nmsgs));
+
+		memcpy(txn->invalidations + txn->ninvalidations, msgs,
+			   nmsgs * sizeof(SharedInvalidationMessage));
+		txn->ninvalidations += nmsgs;
+	}
 }
 
 /*
@@ -2250,6 +2275,15 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * Mark top-level transaction as having catalog changes too if one of its
+	 * children has so that the ReorderBufferBuildTupleCidHash can
+	 * conveniently check just top-level transaction and decide whether to
+	 * build the hash table or not.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..edd9077 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,9 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transactions.  See
+ *      CommandEndInvalidationMessages.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +107,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -1094,6 +1098,11 @@ CommandEndInvalidationMessages(void)
 
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
+
+	/* WAL Log per-command invalidation messages for wal_level=logical */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
 							   &transInvalInfo->CurrentCmdInvalidMsgs);
 }
@@ -1501,3 +1510,49 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * LogLogicalInvalidations
+ *
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end or at commit time if any invalidations
+ * are pending.
+ */
+void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage *invalMessages;
+	int			nmsgs = 0;
+
+	/* Quick exit if we haven't done anything with invalidation messages. */
+	if (transInvalInfo == NULL)
+		return;
+
+	ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+									 MakeSharedInvalidMessagesArray);
+
+	Assert(!(numSharedInvalidMessagesArray > 0 &&
+			 SharedInvalidMessagesArray == NULL));
+
+	invalMessages = SharedInvalidMessagesArray;
+	nmsgs = numSharedInvalidMessagesArray;
+	SharedInvalidMessagesArray = NULL;
+	numSharedInvalidMessagesArray = 0;
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						 nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index aef8555..ac3f5e3 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,17 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 019bd38..1055e99 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -220,6 +220,9 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	end_lsn;
 
+	/* Toplevel transaction for this subxact (NULL for top-level). */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN of the last lsn at which snapshot information reside, so we can
 	 * restart decoding from there and fully recover this transaction from
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index bc5081c..463888c 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -61,4 +61,6 @@ extern void CacheRegisterRelcacheCallback(RelcacheCallbackFunction func,
 extern void CallSyscacheCallbacks(int cacheid, uint32 hashvalue);
 
 extern void InvalidateSystemCaches(void);
+
+extern void LogLogicalInvalidations(void);
 #endif							/* INVAL_H */
-- 
1.8.3.1

v38/v38-0002-Extend-the-logical-decoding-output-plugin-API-wi.patch0000664000175000017500000011035413705304025025652 0ustar  dilipkumardilipkumarFrom ef81a97a30e9285d204d2779995b3c51b2ff0c41 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:01:24 +0530
Subject: [PATCH v38 2/8] Extend the logical decoding output plugin API with
 stream methods.

This adds seven methods to the output plugin API, adding support for
streaming changes for large transactions.

* stream_start
* stream_stop
* stream_abort
* stream_commit
* stream_change
* stream_message
* stream_truncate

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate a chunk of changes
streamed for a particular toplevel transaction.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/test_decoding.c     | 123 +++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 218 +++++++++++++++++++
 src/backend/replication/logical/logical.c | 351 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  59 +++++
 6 files changed, 825 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..4a71826 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
 }
 
 
@@ -540,3 +569,97 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the changes as the transaction can abort
+ * at a later point in time.  We don't want users to see the changes until the
+ * transaction is committed.
+ */
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the contents for transactional messages
+ * as the transaction can abort at a later point in time.  We don't want users to
+ * see the message contents until the transaction is committed.
+ */
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (transactional)
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu",
+						 transactional, prefix, sz);
+	}
+	else
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+						 transactional, prefix, sz);
+		appendBinaryStringInfo(ctx->out, message, sz);
+	}
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the detailed information of Truncate.
+ * See pg_decode_stream_change.
+ */
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index c89f93c..18116c8 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,6 +389,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -401,6 +408,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -679,6 +695,117 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+    
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The actual changes are not displayed as the transaction can abort at a later
+      point in time and we don't decode changes for aborted transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The message contents for transactional messages are not displayed as the transaction
+      can abort at a later point in time and we don't decode changes for aborted
+      transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -747,4 +874,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and two optional callbacks
+    (<function>stream_message_cb</function>) and (<function>stream_truncate_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be..6ee59bd 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr message_lsn, bool transactional,
+									  const char *prefix, Size message_size, const char *message);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require start/stop/abort/commit/change
+	 * callbacks. The message and truncate callback are optional, similar to
+	 * regular output plugins. We however enable streaming when at least one
+	 * of the methods is enabled, so that we can easily identify missing
+	 * methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional, so we do not
+	 * fail with ERROR when missing, but the wrappers simply do nothing. We
+	 * must set the ReorderBuffer callbacks to something, otherwise the calls
+	 * from there will crash (we don't want to move the checks there).
+	 */
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,307 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_start_cb callback")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_stop_cb callback")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_abort_cb callback")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_commit_cb callback")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_change_cb callback")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475..deef318 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..2d9aa11 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.commit_lsn
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1055e99..42bc817 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -348,6 +348,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											   ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
 struct ReorderBuffer
 {
 	/*
@@ -387,6 +435,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamTruncateCB stream_truncate;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v38/v38-0003-Implement-streaming-mode-in-ReorderBuffer.patch0000664000175000017500000022546313705304025024617 0ustar  dilipkumardilipkumarFrom 6ae620db5ba3c5f42bff414c40d1749c6f5483e2 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 15 Jul 2020 18:38:32 +0530
Subject: [PATCH v38 3/8] Implement streaming mode in ReorderBuffer.

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke new stream API methods. However, sometimes if we have
incomplete toast or speculative insert we spill to the disk because we can
not generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

Now that we can stream in-progress transactions, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

We also provide a new option via SQL APIs to fetch the changes being
streamed.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/Makefile                  |   2 +-
 contrib/test_decoding/expected/stream.out       |  40 +
 contrib/test_decoding/expected/truncate.out     |   6 +
 contrib/test_decoding/sql/stream.sql            |  21 +
 contrib/test_decoding/sql/truncate.sql          |   1 +
 contrib/test_decoding/test_decoding.c           |  13 +
 doc/src/sgml/logicaldecoding.sgml               |   9 +-
 doc/src/sgml/test-decoding.sgml                 |  22 +
 src/backend/access/heap/heapam.c                |  13 +
 src/backend/access/heap/heapam_visibility.c     |  42 +-
 src/backend/access/index/genam.c                |  53 ++
 src/backend/access/table/tableam.c              |   8 +
 src/backend/access/transam/xact.c               |  19 +
 src/backend/replication/logical/decode.c        |  17 +-
 src/backend/replication/logical/logical.c       |  10 +
 src/backend/replication/logical/reorderbuffer.c | 969 +++++++++++++++++++++---
 src/include/access/heapam_xlog.h                |   1 +
 src/include/access/tableam.h                    |  55 ++
 src/include/access/xact.h                       |   4 +
 src/include/replication/logical.h               |   1 +
 src/include/replication/reorderbuffer.h         |  56 +-
 21 files changed, 1256 insertions(+), 106 deletions(-)
 create mode 100644 contrib/test_decoding/expected/stream.out
 create mode 100644 contrib/test_decoding/sql/stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c58..ed9a3d6 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,7 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate
+	spill slot truncate stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
new file mode 100644
index 0000000..bdf7002
--- /dev/null
+++ b/contrib/test_decoding/expected/stream.out
@@ -0,0 +1,40 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+ count 
+-------
+   157
+(1 row)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/truncate.out b/contrib/test_decoding/expected/truncate.out
index 1cf2ae8..e64d377 100644
--- a/contrib/test_decoding/expected/truncate.out
+++ b/contrib/test_decoding/expected/truncate.out
@@ -25,3 +25,9 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
  COMMIT
 (9 rows)
 
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
new file mode 100644
index 0000000..24d41b1
--- /dev/null
+++ b/contrib/test_decoding/sql/stream.sql
@@ -0,0 +1,21 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/truncate.sql b/contrib/test_decoding/sql/truncate.sql
index 5aecdf0..5633854 100644
--- a/contrib/test_decoding/sql/truncate.sql
+++ b/contrib/test_decoding/sql/truncate.sql
@@ -11,3 +11,4 @@ TRUNCATE tab1, tab1 RESTART IDENTITY CASCADE;
 TRUNCATE tab1, tab2;
 
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 4a71826..bb3d9f3 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -122,6 +122,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	ListCell   *option;
 	TestDecodingData *data;
+	bool enable_streaming = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -212,6 +213,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "stream-changes") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_streaming))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -221,6 +232,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 							elem->arg ? strVal(elem->arg) : "(null)")));
 		}
 	}
+
+	ctx->streaming &= enable_streaming;
 }
 
 /* cleanup this plugin's resources */
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 18116c8..98b47b0 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d..fe7c978 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes', '1');
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d881f4c..33a4580 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments in xact.c where
+	 * these variables are declared.  Normally we have such a check at tableam
+	 * level API but this is called from many places so we need to ensure it
+	 * here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
@@ -1945,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..c771280 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the tuple
+		 * yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1659,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..9d9a70a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,10 +430,37 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments in xact.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments in xact.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments in xact.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 4b2bb29..a61e279 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -235,6 +235,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
 	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
+	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d4f7c29..99722ee 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -83,6 +83,19 @@ bool		XactDeferrable;
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
 /*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool		bsysscan = false;
+
+/*
  * When running as a parallel worker, we place only a single
  * TransactionStateData on the parallel worker's state stack, and the XID
  * reflected there will be that of the *innermost* currently-active
@@ -2680,6 +2693,9 @@ AbortTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* If in parallel mode, clean up workers and exit parallel mode. */
 	if (IsInParallelMode())
 	{
@@ -4982,6 +4998,9 @@ AbortSubTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* Exit from parallel mode, if necessary. */
 	if (IsInParallelMode())
 	{
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7153eba..2010d5a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6ee59bd..8deff89 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1442,3 +1442,13 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Clear logical streaming state during (sub)transaction abort.
+ */
+void
+ResetLogicalStreamingState(void)
+{
+	CheckXidAlive = InvalidTransactionId;
+	bsysscan = false;
+}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index ce6e621..c469536 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -178,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -236,6 +252,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +261,16 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static inline bool ReorderBufferCanStartStreaming(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -367,6 +394,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -416,13 +446,15 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 }
 
 /*
- * Free an ReorderBufferChange.
+ * Free a ReorderBufferChange and update memory accounting, if requested.
  */
 void
-ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
+						  bool upd_mem)
 {
 	/* update memory accounting info */
-	ReorderBufferChangeMemoryUpdate(rb, change, false);
+	if (upd_mem)
+		ReorderBufferChangeMemoryUpdate(rb, change, false);
 
 	/* free contained data */
 	switch (change->action)
@@ -624,16 +656,101 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 }
 
 /*
- * Queue a change into a transaction so it can be replayed upon commit.
+ * Record the partial change for the streaming of in-progress transactions.  We
+ * can stream only complete changes so if we have a partial change like toast
+ * table insert or speculative then we mark such a 'txn' so that it can't be
+ * streamed.  We also ensure that if the changes in such a 'txn' are above
+ * logical_decoding_work_mem threshold then we stream them as soon as we have a
+ * complete change.
+ */
+static void
+ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								  ReorderBufferChange *change,
+								  bool toast_insert)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The partial changes need to be processed only while streaming
+	 * in-progress transactions.
+	 */
+	if (!ReorderBufferCanStream(rb))
+		return;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * Set the toast insert bit whenever we get toast insert to indicate a partial
+	 * change and clear it when we get the insert or update on main table (Both
+	 * update and insert will do the insert in the toast table).
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			 IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial change and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+	{
+		/*
+		 * Speculative confirm change must be preceded by speculative insertion.
+		 */
+		Assert(rbtxn_has_spec_insert(toptxn));
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+	}
+
+	/*
+	 * Stream the transaction if it is serialized before and the changes are
+	 * now complete in the top-level transaction.
+	 *
+	 * The reason for doing the streaming of such a transaction as soon as
+	 * we get the complete change for it is that previously it would have
+	 * reached the memory threshold and wouldn't get streamed because of
+	 * incomplete changes.  Delaying such transactions would increase apply
+	 * lag for them.
+	 */
+	if (ReorderBufferCanStartStreaming(rb) &&
+		!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		rbtxn_is_serialized(txn))
+		ReorderBufferStreamTXN(rb, toptxn);
+}
+
+/*
+ * Queue a change into a transaction so it can be replayed upon commit or will be
+ * streamed when we reach logical_decoding_work_mem threshold.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the previous changes we have detected that the transaction
+	 * is aborted.  So there is no point in collecting further changes for it.
+	 */
+	if (txn->concurrent_abort)
+	{
+		/*
+		 * We don't need to update memory accounting for this change as we have
+		 * not added it to the queue yet.
+		 */
+		ReorderBufferReturnChange(rb, change, false);
+		return;
+	}
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -645,6 +762,9 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* process partial change */
+	ReorderBufferProcessPartialChange(rb, txn, change, toast_insert);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -674,7 +794,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -764,6 +884,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the (sub)transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -1018,6 +1170,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1032,6 +1187,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1148,7 +1306,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 	{
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1234,7 +1392,7 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1280,7 +1438,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1297,7 +1455,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1310,6 +1468,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1335,6 +1502,91 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change, true);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
+	 * streamed always, even if it does not contain any changes (that is, when
+	 * all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the toplevel xact (we send the XID in all messages), but we never
+	 * stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
+	 * memory. We could also keep the hash table and update it with new ctid
+	 * values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
  */
@@ -1485,57 +1737,178 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the
+	 * xid aborted.  That will happen during catalog access.
+	 */
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						  ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction. This resets the TXN such that it
+ * can be used to stream the remaining data of transaction being processed.
+ */
+static void
+ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+					  Snapshot snapshot_now,
+					  CommandId command_id,
+					  XLogRecPtr last_lsn,
+					  ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Free all resources allocated for toast reconstruction */
+	ReorderBufferToastReset(rb, txn);
+
+	/* Return the spec insert change if it is not NULL */
+	if (specinsert != NULL)
+	{
+		ReorderBufferReturnChange(rb, specinsert, true);
+		specinsert = NULL;
 	}
 
-	snapshot_now = txn->base_snapshot;
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. We iterate over the top and subtransactions (using a k-way
+ * merge) and replay the changes in lsn order.
+ *
+ * If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool stream_started = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1558,14 +1931,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming ? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* We only need to send begin/commit for non-streamed transactions. */
+		if (!streaming)
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1573,6 +1947,36 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * We can't call start stream callback before processing first
+			 * change.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					txn->origin_id = change->origin_id;
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1649,7 +2053,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1685,11 +2090,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1714,7 +2119,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					/* clear out a pending (and thus failed) speculation */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
@@ -1747,7 +2152,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1756,10 +2164,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1790,7 +2195,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1837,7 +2241,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		 */
 		if (specinsert)
 		{
-			ReorderBufferReturnChange(rb, specinsert);
+			ReorderBufferReturnChange(rb, specinsert, true);
 			specinsert = NULL;
 		}
 
@@ -1845,14 +2249,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, send the last message for this set of
+		 * changes depending upon streaming mode.
+		 */
+		if (streaming)
+		{
+			if (stream_started)
+			{
+				rb->stream_stop(rb, txn, prev_lsn);
+				stream_started = false;
+			}
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot for the next set of changes in
+		 * streaming mode.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1870,14 +2295,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as
+		 * streamed (if they contained changes). Otherwise, remove all the
+		 * changes and deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1896,15 +2334,106 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
+		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * This error can only occur when we are sending the data in
+			 * streaming mode and the streaming in not finished yet.
+			 */
+			Assert(streaming);
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+			curtxn->concurrent_abort = true;
+
+			/* Reset the TXN so that it is allowed to stream remaining data. */
+			ReorderBufferResetTXN(rb, txn, snapshot_now,
+								  command_id, prev_lsn,
+								  specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+	}
+	PG_END_TRY();
+}
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot	snapshot_now;
+	CommandId	command_id = FirstCommandId;
 
-		PG_RE_THROW();
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
 	}
-	PG_END_TRY();
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
 }
 
 /*
@@ -1931,6 +2460,22 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_abort(rb, txn, lsn);
+
+		/*
+		 * We might have decoded changes for this transaction that could load
+		 * the cache as per the current transaction's view (consider DDL's
+		 * happened in this transaction). We don't want the decoding of future
+		 * transactions to use those cache entries so execute invalidations.
+		 */
+		if (txn->ninvalidations > 0)
+			ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+											   txn->invalidations);
+	}
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2000,6 +2545,10 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2082,7 +2631,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2131,12 +2680,21 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2144,6 +2702,8 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2155,19 +2715,41 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* If streaming supported, update the total size in top level as well. */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2196,6 +2778,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2388,6 +2971,51 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ *
+ * Note that, we skip transactions that contains incomplete changes. There
+ * is a scope of optimization here such that we can select the largest transaction
+ * which has complete changes.  But that will make the code and design quite complex
+ * and that might not be worth the benefit.  If we plan to stream the transactions
+ * that contains incomplete changes then we need to find a way to partially
+ * stream/truncate the transaction changes in-memory and build a mechanism to
+ * partially truncate the spilled files.  Additionally, whenever we partially
+ * stream the transaction we need to maintain the last streamed lsn and next time
+ * we need to restore from that segment and the offset in WAL.  As we stream the
+ * changes from the top transaction and restore them subtransaction wise, we need
+ * to even remember the subxact from where we streamed the last change.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	Size largest_size = 0;
+	ReorderBufferTXN *largest = NULL;
+
+	/* Find the largest top-level transaction. */
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		if ((largest != NULL || txn->total_size > largest_size) &&
+			(txn->total_size > 0) && !(rbtxn_has_incomplete_tuple(txn)))
+		{
+			largest = txn;
+			largest_size = txn->total_size;
+		}
+	}
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
  * disk until we reach under the memory limit.
@@ -2419,11 +3047,33 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if possible.  Otherwise, spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStartStreaming(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it
+			 * from memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2501,7 +3151,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 
 		spilled++;
 	}
@@ -2713,6 +3363,136 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+/* Returns true, if the output plugin supports streaming, false, otherwise. */
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/* Returns true, if the streaming can be started now, false, otherwise. */
+static inline bool
+ReorderBufferCanStartStreaming(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild *builder = ctx->snapshot_builder;
+
+	/*
+	 * We can't start streaming immediately even if the streaming is enabled
+	 * because we previously decoded this transaction and now just are
+	 * restarting.
+	 */
+	if (ReorderBufferCanStream(rb) &&
+		!SnapBuildXactNeedsSkip(builder, ctx->reader->EndRecPtr))
+	{
+		/* We must have a consistent snapshot by this time */
+		Assert(SnapBuildCurrentState(builder) == SNAPBUILD_CONSISTENT);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * We can't make any assumptions about base snapshot here, similar to what
+	 * ReorderBufferCommit() does. That relies on base_snapshot getting
+	 * transferred from subxact in ReorderBufferCommitChild(), but that was
+	 * not yet called as the transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 *
+	 * Unlike DecodeCommit which adds xids of all the subtransactions in
+	 * snapshot's xip array via SnapBuildCommittedTxn, we can't do that here
+	 * but we do add them to subxip array instead via ReorderBufferCopySnap.
+	 * This allows the catalog changes made in subtransactions decoded till
+	 * now to be visible.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		/*
+		 * If this transaction has no snapshot, it didn't make any changes to
+		 * the database till now, so there's nothing to decode.
+		 */
+		if (txn->base_snapshot == NULL)
+		{
+			Assert(txn->ninvalidations == 0);
+			return;
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -2813,7 +3593,7 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		dlist_container(ReorderBufferChange, node, cleanup_iter.cur);
 
 		dlist_delete(&cleanup->node);
-		ReorderBufferReturnChange(rb, cleanup);
+		ReorderBufferReturnChange(rb, cleanup, true);
 	}
 	txn->nentries_mem = 0;
 	Assert(dlist_is_empty(&txn->changes));
@@ -3522,7 +4302,7 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			dlist_container(ReorderBufferChange, node, it.cur);
 
 			dlist_delete(&change->node);
-			ReorderBufferReturnChange(rb, change);
+			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
 
@@ -3812,6 +4592,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases, we assume the CID is from the future
+	 * command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 0d28f01..b188427 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -19,6 +19,7 @@
 
 #include "access/relscan.h"
 #include "access/sdir.h"
+#include "access/xact.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1017,6 +1027,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1056,6 +1073,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with
+	 * valid CheckXidAlive for catalog or regular tables.  See detailed
+	 * comments in xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1713,6 +1738,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1730,6 +1763,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1748,6 +1789,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1764,6 +1812,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index ac3f5e3..5f767eb 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,10 @@ typedef enum
 /* Synchronous commit level */
 extern int	synchronous_commit;
 
+/* used during logical streaming of a transaction */
+extern TransactionId CheckXidAlive;
+extern bool bsysscan;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index deef318..b0fae98 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -121,5 +121,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+extern void ResetLogicalStreamingState(void);
 
 #endif
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 42bc817..1ae17d5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -162,6 +162,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -181,6 +184,40 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -249,6 +286,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
@@ -313,6 +357,12 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -484,12 +534,14 @@ void		ReorderBufferFree(ReorderBuffer *);
 ReorderBufferTupleBuf *ReorderBufferGetTupleBuf(ReorderBuffer *, Size tuple_len);
 void		ReorderBufferReturnTupleBuf(ReorderBuffer *, ReorderBufferTupleBuf *tuple);
 ReorderBufferChange *ReorderBufferGetChange(ReorderBuffer *);
-void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
+void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *, bool);
 
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool toast_insert);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v38/v38-0004-Extend-the-BufFile-interface-for-the-streaming-o.patch0000664000175000017500000003753413705304025025661 0ustar  dilipkumardilipkumarFrom 08d3eb4fcc57562dadd5a4c6bf9e23c87f75d0f0 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v38 4/8] Extend the BufFile interface for the streaming of
 in-progress transactions.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times.

Implement the interface for BufFileTruncate interface to allow files to be
truncated up to a particular offset.  Extend BufFileSeek API to support
SEEK_END case.  Add an option to provide a mode while opening the shared
BufFiles instead of always opening in read-only mode.
---
 src/backend/postmaster/pgstat.c           |  3 +
 src/backend/storage/file/buffile.c        | 86 +++++++++++++++++++++++----
 src/backend/storage/file/fd.c             |  9 ++-
 src/backend/storage/file/sharedfileset.c  | 98 ++++++++++++++++++++++++++++---
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  4 +-
 10 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 88992c2..6c97f68 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 3907349..1140cf8 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile supports temporary files that can be used by the single backend when
+ * the corresponding files need to be survived across the transaction and need
+ * to be opened and closed multiple times.  Such files need to be created as a
+ * member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -364,6 +368,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -666,11 +673,22 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
+			break;
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +856,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset = file->curOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420e..f376a97 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594..9a3dc10 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can also be used by backends when the temporary files need
+ * to be opened/closed multiple times and the underlying files need to survive
+ * across transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,19 +29,29 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -223,6 +255,58 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 }
 
 /*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
+/*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
  */
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59..788815c 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a43..b83fb50 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201..807a9c1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752ba..fc34c49 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d..e209f04 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf07..d5edb60 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
1.8.3.1

v38/v38-0005-Add-support-for-streaming-to-built-in-replicatio.patch0000664000175000017500000027356013705304025026070 0ustar  dilipkumardilipkumarFrom adc5b368937f13c25f1d5aa3e1ba4f63bbe177b0 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v38 5/8] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  49 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   4 +
 src/backend/replication/logical/proto.c            | 140 ++-
 src/backend/replication/logical/worker.c           | 946 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 348 +++++++-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  46 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/009_stream_simple.pl       |  86 ++
 src/test/subscription/t/010_stream_subxact.pl      | 102 +++
 src/test/subscription/t/011_stream_ddl.pl          |  95 +++
 .../subscription/t/012_stream_subxact_abort.pl     |  82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |  84 ++
 src/test/subscription/t/015_stream_binary.pl       |  86 ++
 19 files changed, 2060 insertions(+), 47 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/015_stream_binary.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70..a81bd54 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal> and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c5..b7d7457 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index e6afb32..153c562 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377..4c58ad8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,8 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +197,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +350,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +375,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -439,6 +455,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -698,6 +721,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +732,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +789,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +834,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +877,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 6c97f68..479e3ca 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e905723..a6101ac 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 2b1356e..c205950 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -732,3 +759,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 407eee3..7071309 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,62 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +291,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -608,17 +748,323 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or
+	 * inside streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		!in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -631,6 +1077,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -646,6 +1095,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -682,6 +1134,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -797,6 +1252,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -938,6 +1396,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1308,6 +1769,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1446,6 +1910,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1558,6 +2038,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1662,7 +2150,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1935,6 +2423,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1967,6 +2469,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2133,6 +3068,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc..3360bd5 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,29 +47,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,11 +123,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -115,16 +149,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +226,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +255,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +279,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +300,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +331,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +394,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +444,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +486,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,6 +507,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -406,7 +539,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +559,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +584,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +605,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +631,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +663,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -606,6 +744,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -642,6 +887,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -771,12 +1048,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -811,7 +1121,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35..1d09154 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 287288a..3cc4878 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /* Tuple coming via logical replication. */
 typedef struct LogicalRepTupleData
@@ -93,25 +97,49 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbe..6c0a4e3 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/015_stream_binary.pl b/src/test/subscription/t/015_stream_binary.pl
new file mode 100644
index 0000000..fa2362e
--- /dev/null
+++ b/src/test/subscription/t/015_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v38/v38-0006-Enable-streaming-for-all-subscription-TAP-tests.patch0000664000175000017500000003534213705304025025627 0ustar  dilipkumardilipkumarFrom b2e2be0fc37b3f5b3e6a4fc4b14c9bfd99116d8c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v38 6/8] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318f..6f7bedc 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 3a590f8..4ba8086 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334e..cdf9b8e 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v38/v38-0007-Add-TAP-test-for-streaming-vs.-DDL.patch0000664000175000017500000001062113705304025022555 0ustar  dilipkumardilipkumarFrom f45ee70069e060aef93e7484aa8f79e01fd0b4fc Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v38 7/8] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v38/v38-0008-Add-streaming-option-in-pg_dump.patch0000664000175000017500000000524713705304025022641 0ustar  dilipkumardilipkumarFrom d695170959108706f6ea5558185ee5239566a126 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 11:09:18 +0530
Subject: [PATCH v38 8/8] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 17 +++++++++++++++--
 src/bin/pg_dump/pg_dump.h |  1 +
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 94459b3..f69d64c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4240,10 +4241,17 @@ getSubscriptions(Archive *fout)
 
 	if (fout->remoteVersion >= 140000)
 		appendPQExpBuffer(query,
-						  " s.subbinary\n");
+						  " s.subbinary,\n");
 	else
 		appendPQExpBuffer(query,
-						  " false AS subbinary\n");
+						  " false AS subbinary,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBuffer(query,
+						  " s.substream\n");
+	else
+		appendPQExpBuffer(query,
+						  " false AS substream\n");
 
 	appendPQExpBuffer(query,
 					  "FROM pg_subscription s\n"
@@ -4263,6 +4271,7 @@ getSubscriptions(Archive *fout)
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
+	i_substream = PQfnumber(res, "substream");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4286,6 +4295,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subpublications));
 		subinfo[i].subbinary =
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
+		subinfo[i].substream =
+			pg_strdup(PQgetvalue(res, i, i_substream));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4357,6 +4368,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subbinary, "t") == 0)
 		appendPQExpBuffer(query, ", binary = true");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index da97b73..cc10c7c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -626,6 +626,7 @@ typedef struct _SubscriptionInfo
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subbinary;
+	char       *substream;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
-- 
1.8.3.1

#446Ajin Cherian
itsajin@gmail.com
In reply to: Dilip Kumar (#445)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jul 20, 2020 at 11:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

There was one warning in release mode in the last version in 0004 so
attaching a new version.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Hello,

I have tried to rework the patch which did the stats for the streaming of
logical replication but based on the new logical replication stats
framework developed by Masahiko-san and rebased by Amit in [1]/messages/by-id/CA+fd4k5_pPAYRTDrO2PbtTOe0eHQpBvuqmCr8ic39uTNmR49Eg@mail.gmail.com. This uses
v38 of the streaming logical update patch as well as the v1 of the stats
framework patch as base. I will rebase this as the stats framework is
updated. Let me know if you have any comments.

regards,
Ajin Cherian
Fujitsu Australia

[1]: /messages/by-id/CA+fd4k5_pPAYRTDrO2PbtTOe0eHQpBvuqmCr8ic39uTNmR49Eg@mail.gmail.com
/messages/by-id/CA+fd4k5_pPAYRTDrO2PbtTOe0eHQpBvuqmCr8ic39uTNmR49Eg@mail.gmail.com

Attachments:

v1_streaming_stats_update.patchapplication/octet-stream; name=v1_streaming_stats_update.patchDownload
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 44a5985..8d3bc7d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2588,6 +2588,39 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
          Amount of decoded transaction data spilled to disk.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="catalog_table_entry"><para role="column_definition">
+         <structfield>stream_txns</structfield> <type>bigint</type>
+        </para>
+        <para>
+         Number of in-progress transactions streamed to subscriber after
+         memory used by logical decoding exceeds <literal>logical_decoding_work_mem</literal>.
+         Streaming only works with toplevel transactions (subtransactions can't
+         be streamed independently), so the counter does not get incremented for
+         subtransactions.
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="catalog_table_entry"><para role="column_definition">
+         <structfield>stream_count</structfield> <type>bigint</type>
+        </para>
+        <para>
+         Number of times in-progress transactions were streamed to subscriber.
+         Transactions may get streamed repeatedly, and this counter gets incremented
+         on every such invocation.
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="catalog_table_entry"><para role="column_definition">
+         <structfield>stream_bytes</structfield> <type>bigint</type>
+        </para>
+        <para>
+         Amount of decoded in-progress transaction data streamed to subscriber.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 4ab14eb..042278e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -795,7 +795,10 @@ CREATE VIEW pg_stat_replication_slots AS
             s.name,
             s.spill_txns,
             s.spill_count,
-            s.spill_bytes
+            s.spill_bytes,
+            s.stream_txns,
+            s.stream_count,
+            s.stream_bytes
     FROM pg_stat_get_replication_slots() AS s;
 
 CREATE VIEW pg_stat_slru AS
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 12d6c59..0a6c452 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1643,7 +1643,7 @@ pgstat_report_tempfile(size_t filesize)
  */
 void
 pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
-					   int spillbytes)
+					   int spillbytes, int streamtxns, int streamcount, int  streambytes)
 {
 	PgStat_MsgReplSlot msg;
 
@@ -1656,6 +1656,9 @@ pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
 	msg.m_spill_txns = spilltxns;
 	msg.m_spill_count = spillcount;
 	msg.m_spill_bytes = spillbytes;
+	msg.m_stream_txns = streamtxns;
+	msg.m_stream_count = streamcount;
+	msg.m_stream_bytes = streambytes;
 	pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
 }
 
@@ -6674,6 +6677,9 @@ pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len)
 		replSlotStats[idx].spill_txns += msg->m_spill_txns;
 		replSlotStats[idx].spill_count += msg->m_spill_count;
 		replSlotStats[idx].spill_bytes += msg->m_spill_bytes;
+		replSlotStats[idx].stream_txns += msg->m_stream_txns;
+		replSlotStats[idx].stream_count += msg->m_stream_count;
+		replSlotStats[idx].stream_bytes += msg->m_stream_bytes;
 	}
 }
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 2b216a3..d510068 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1469,12 +1469,19 @@ UpdateSpillStats(LogicalDecodingContext *ctx)
         rb,
         (long long) rb->spillTxns,
         (long long) rb->spillCount,
-        (long long) rb->spillBytes);
+        (long long) rb->spillBytes,
+        (long long) rb->streamTxns,
+        (long long) rb->streamCount,
+        (long long) rb->streamBytes);
 
    pgstat_report_replslot(NameStr(ctx->slot->data.name),
-                          rb->spillTxns, rb->spillCount, rb->spillBytes);
+                          rb->spillTxns, rb->spillCount, rb->spillBytes,
+						   rb->streamTxns, rb->streamCount, rb->streamBytes);
    rb->spillTxns = 0;
    rb->spillCount = 0;
    rb->spillBytes = 0;
+   rb->streamTxns = 0;
+   rb->streamCount = 0;
+   rb->streamBytes = 0;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4ea0356..ac4422b 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -347,6 +347,9 @@ ReorderBufferAllocate(void)
 	buffer->spillCount = 0;
 	buffer->spillTxns = 0;
 	buffer->spillBytes = 0;
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
 
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
@@ -3496,10 +3499,18 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->snapshot_now = NULL;
 	}
 
+
+	rb->streamCount += 1;
+	rb->streamBytes += txn->total_size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	/* Process and send the changes to output plugin. */
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
 
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index a677365..0d50e35 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -324,7 +324,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	ConditionVariableBroadcast(&slot->active_cv);
 
 	/* Create statistics entry for the new slot */
-	pgstat_report_replslot(NameStr(slot->data.name), 0, 0, 0);
+	pgstat_report_replslot(NameStr(slot->data.name), 0, 0, 0, 0, 0, 0);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 7cb186e..a26a503 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2096,7 +2096,7 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_REPLICATION_SLOT_CLOS 4
+#define PG_STAT_GET_REPLICATION_SLOT_CLOS 7
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -2144,6 +2144,9 @@ pg_stat_get_replication_slots(PG_FUNCTION_ARGS)
 		values[1] = Int64GetDatum(stat.spill_txns);
 		values[2] = Int64GetDatum(stat.spill_count);
 		values[3] = Int64GetDatum(stat.spill_bytes);
+		values[4] = Int64GetDatum(stat.stream_txns);
+		values[5] = Int64GetDatum(stat.stream_count);
+		values[6] = Int64GetDatum(stat.stream_bytes);
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
 	}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a3d94af..b538962 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5255,9 +5255,9 @@
   proname => 'pg_stat_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{text,int8,int8,int8}',
-  proargmodes => '{o,o,o,o}',
-  proargnames => '{name,spill_txns,spill_count,spill_bytes}',
+  proallargtypes => '{text,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{name,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
   prosrc => 'pg_stat_get_replication_slots' },
 { oid => '6118', descr => 'statistics: information about subscription',
   proname => 'pg_stat_get_subscription', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index cdb9a65..c07cdd6 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -467,6 +467,9 @@ typedef struct PgStat_MsgReplSlot
 	PgStat_Counter	m_spill_txns;
 	PgStat_Counter	m_spill_count;
 	PgStat_Counter	m_spill_bytes;
+	PgStat_Counter	m_stream_txns;
+	PgStat_Counter	m_stream_count;
+	PgStat_Counter	m_stream_bytes;
 } PgStat_MsgReplSlot;
 
 
@@ -787,6 +790,9 @@ typedef struct PgStat_ReplSlotStats
 	PgStat_Counter	spill_txns;
 	PgStat_Counter	spill_count;
 	PgStat_Counter	spill_bytes;
+	PgStat_Counter	stream_txns;
+	PgStat_Counter	stream_count;
+	PgStat_Counter	stream_bytes;
 } PgStat_ReplSlotStats;
 
 /* ----------
@@ -1344,7 +1350,7 @@ extern void pgstat_report_deadlock(void);
 extern void pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount);
 extern void pgstat_report_checksum_failure(void);
 extern void pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
-								   int spillbytes);
+								   int spillbytes, int streamtxns, int streamcount, int streambytes);
 extern void pgstat_report_replslot_drop(const char *slotname);
 
 extern void pgstat_initialize(void);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index fba950c..edc51b1 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -536,6 +536,9 @@ struct ReorderBuffer
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillTxns;		/* number of transactions spilled to disk  */
 	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5353f24..197a86c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2011,8 +2011,11 @@ pg_stat_replication| SELECT s.pid,
 pg_stat_replication_slots| SELECT s.name,
     s.spill_txns,
     s.spill_count,
-    s.spill_bytes
-   FROM pg_stat_get_replication_slots() s(name, spill_txns, spill_count, spill_bytes);
+    s.spill_bytes,
+    s.stream_txns,
+    s.stream_count,
+    s.stream_bytes
+   FROM pg_stat_get_replication_slots() s(name, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes);
 pg_stat_slru| SELECT s.name,
     s.blks_zeroed,
     s.blks_hit,
#447Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#445)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Jul 20, 2020 at 6:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

There was one warning in release mode in the last version in 0004 so
attaching a new version.

Today, I was reviewing patch
v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le and found a
small problem with it.

+ /*
+ * Execute the invalidations for xid-less transactions,
+ * otherwise, accumulate them so that they can be processed at
+ * the commit time.
+ */
+ if (!ctx->fast_forward)
+ {
+ if (TransactionIdIsValid(xid))
+ {
+ ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+   invals->nmsgs, invals->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+   buf->origptr);
+ }

I think we need to set ReorderBufferXidSetCatalogChanges even when
ctx->fast-forward is true because we are dependent on that flag for
snapshot build (see SnapBuildCommitTxn). We are already doing the
same way in DecodeCommit where even though we skip adding
invalidations for fast-forward cases but we do set the flag to
indicate that this txn has catalog changes. Is there any reason to do
things differently here?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#448Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#447)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jul 22, 2020 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 20, 2020 at 6:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

There was one warning in release mode in the last version in 0004 so
attaching a new version.

Today, I was reviewing patch
v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le and found a
small problem with it.

+ /*
+ * Execute the invalidations for xid-less transactions,
+ * otherwise, accumulate them so that they can be processed at
+ * the commit time.
+ */
+ if (!ctx->fast_forward)
+ {
+ if (TransactionIdIsValid(xid))
+ {
+ ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+   invals->nmsgs, invals->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+   buf->origptr);
+ }

I think we need to set ReorderBufferXidSetCatalogChanges even when
ctx->fast-forward is true because we are dependent on that flag for
snapshot build (see SnapBuildCommitTxn). We are already doing the
same way in DecodeCommit where even though we skip adding
invalidations for fast-forward cases but we do set the flag to
indicate that this txn has catalog changes. Is there any reason to do
things differently here?

I think it is wrong, we should set the
ReorderBufferXidSetCatalogChanges, even if it is in fast-forward mode.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v39.tarapplication/x-tar; name=v39.tarDownload
v39/0000775000175000017500000000000013705741664012602 5ustar  dilipkumardilipkumarv39/v39-0001-WAL-Log-invalidations-at-command-end-with-wal_le.patch0000664000175000017500000003526313705741664025631 0ustar  dilipkumardilipkumarFrom c92c3c20bd0a8a770c79312963914d37aaf50de8 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 6 Jun 2020 09:54:21 +0530
Subject: [PATCH v39 1/8] WAL Log invalidations at command end with
 wal_level=logical.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When wal_level=logical, write invalidations at command end into WAL so
that decoding can use this information.

This patch is required to allow the streaming of in-progress transactions
in logical decoding.  The actual work to allow streaming will be committed
as a separate patch.

We still add the invalidations to the cache and write them to WAL at
commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not need
to be changed.

The invalidations written at command end uses a new xlog record type
XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See
LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures, which
still rely on the invalidations written to commit records.

The invalidations are decoded and accumulated in top-transaction, and then
executed during replay.  This obviates the need to decode the
invalidations as part of a commit record.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/rmgrdesc/xactdesc.c          | 10 +++++
 src/backend/access/transam/xact.c               | 17 ++++++++
 src/backend/replication/logical/decode.c        | 58 +++++++++++++++----------
 src/backend/replication/logical/reorderbuffer.c | 52 ++++++++++++++++++----
 src/backend/utils/cache/inval.c                 | 55 +++++++++++++++++++++++
 src/include/access/xact.h                       | 13 +++++-
 src/include/replication/reorderbuffer.h         |  3 ++
 src/include/utils/inval.h                       |  2 +
 8 files changed, 177 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce755..68aa994 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -396,6 +396,13 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invalidations *xlrec = (xl_xact_invalidations *) rec;
+
+		standby_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs, InvalidOid,
+								   InvalidOid, false);
+	}
 }
 
 const char *
@@ -423,6 +430,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index bd4c3cf..d4f7c29 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1224,6 +1224,16 @@ RecordTransactionCommit(void)
 	bool		RelcacheInitFileInval = false;
 	bool		wrote_xlog;
 
+	/*
+	 * Log pending invalidations for logical decoding of in-progress
+	 * transactions.  Normally for DDLs, we log this at each command end,
+	 * however, for certain cases where we directly update the system table
+	 * without a transaction block, the invalidations are not logged till this
+	 * time.
+	 */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	/* Get data needed for commit record */
 	nrels = smgrGetPendingDeletes(true, &rels);
 	nchildren = xactGetCommittedChildren(&children);
@@ -6022,6 +6032,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0c0c371..f3ea15c 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -278,10 +278,39 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 			/*
 			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here.
-			 * See LogicalDecodingProcessRecord.
+			 * record if required.  So, we don't need to do anything here. See
+			 * LogicalDecodingProcessRecord.
 			 */
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invalidations *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+				/*
+				 * Execute the invalidations for xid-less transactions,
+				 * otherwise, accumulate them so that they can be processed at
+				 * the commit time.
+				 */
+				if (TransactionIdIsValid(xid))
+				{
+					if (!ctx->fast_forward)
+						ReorderBufferAddInvalidations(reorder, xid,
+													  buf->origptr,
+													  invals->nmsgs,
+													  invals->msgs);
+					ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+													  buf->origptr);
+				}
+				else if ((!ctx->fast_forward))
+					ReorderBufferImmediateInvalidation(ctx->reorder,
+													   invals->nmsgs,
+													   invals->msgs);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
@@ -334,15 +363,11 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_STANDBY_LOCK:
 			break;
 		case XLOG_INVALIDATIONS:
-			{
-				xl_invalidations *invalidations =
-				(xl_invalidations *) XLogRecGetData(r);
 
-				if (!ctx->fast_forward)
-					ReorderBufferImmediateInvalidation(ctx->reorder,
-													   invalidations->nmsgs,
-													   invalidations->msgs);
-			}
+			/*
+			 * We are processing the invalidations at the command level via
+			 * XLOG_XACT_INVALIDATIONS.  So we don't need to do anything here.
+			 */
 			break;
 		default:
 			elog(ERROR, "unexpected RM_STANDBY_ID record type: %u", info);
@@ -573,19 +598,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
-	/*
-	 * Process invalidation messages, even if we're not interested in the
-	 * transaction's contents, since the various caches need to always be
-	 * consistent.
-	 */
-	if (parsed->nmsgs > 0)
-	{
-		if (!ctx->fast_forward)
-			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
-										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
-	}
-
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 449327a..ce6e621 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -856,6 +856,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to top-level transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -2201,7 +2204,11 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 /*
  * Setup the invalidation of the toplevel transaction.
  *
- * This needs to be done before ReorderBufferCommit is called!
+ * This needs to be called for each XLOG_XACT_INVALIDATIONS message and
+ * accumulates all the invalidation messages in the toplevel transaction.
+ * This is required because in some cases where we skip processing the
+ * transaction (see ReorderBufferForget), we need to execute all the
+ * invalidations together.
  */
 void
 ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
@@ -2212,17 +2219,35 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	if (txn->ninvalidations != 0)
-		elog(ERROR, "only ever add one set of invalidations");
+	/*
+	 * We collect all the invalidations under the top transaction so that we
+	 * can execute them all together.
+	 */
+	if (txn->toptxn)
+		txn = txn->toptxn;
 
 	Assert(nmsgs > 0);
 
-	txn->ninvalidations = nmsgs;
-	txn->invalidations = (SharedInvalidationMessage *)
-		MemoryContextAlloc(rb->context,
-						   sizeof(SharedInvalidationMessage) * nmsgs);
-	memcpy(txn->invalidations, msgs,
-		   sizeof(SharedInvalidationMessage) * nmsgs);
+	/* Accumulate invalidations. */
+	if (txn->ninvalidations == 0)
+	{
+		txn->ninvalidations = nmsgs;
+		txn->invalidations = (SharedInvalidationMessage *)
+			MemoryContextAlloc(rb->context,
+							   sizeof(SharedInvalidationMessage) * nmsgs);
+		memcpy(txn->invalidations, msgs,
+			   sizeof(SharedInvalidationMessage) * nmsgs);
+	}
+	else
+	{
+		txn->invalidations = (SharedInvalidationMessage *)
+			repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
+					 (txn->ninvalidations + nmsgs));
+
+		memcpy(txn->invalidations + txn->ninvalidations, msgs,
+			   nmsgs * sizeof(SharedInvalidationMessage));
+		txn->ninvalidations += nmsgs;
+	}
 }
 
 /*
@@ -2250,6 +2275,15 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * Mark top-level transaction as having catalog changes too if one of its
+	 * children has so that the ReorderBufferBuildTupleCidHash can
+	 * conveniently check just top-level transaction and decide whether to
+	 * build the hash table or not.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..edd9077 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,9 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transactions.  See
+ *      CommandEndInvalidationMessages.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +107,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -1094,6 +1098,11 @@ CommandEndInvalidationMessages(void)
 
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
+
+	/* WAL Log per-command invalidation messages for wal_level=logical */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
 							   &transInvalInfo->CurrentCmdInvalidMsgs);
 }
@@ -1501,3 +1510,49 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * LogLogicalInvalidations
+ *
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end or at commit time if any invalidations
+ * are pending.
+ */
+void
+LogLogicalInvalidations()
+{
+	xl_xact_invalidations xlrec;
+	SharedInvalidationMessage *invalMessages;
+	int			nmsgs = 0;
+
+	/* Quick exit if we haven't done anything with invalidation messages. */
+	if (transInvalInfo == NULL)
+		return;
+
+	ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+									 MakeSharedInvalidMessagesArray);
+
+	Assert(!(numSharedInvalidMessagesArray > 0 &&
+			 SharedInvalidMessagesArray == NULL));
+
+	invalMessages = SharedInvalidMessagesArray;
+	nmsgs = numSharedInvalidMessagesArray;
+	SharedInvalidMessagesArray = NULL;
+	numSharedInvalidMessagesArray = 0;
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvalidations);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvalidations);
+		XLogRegisterData((char *) invalMessages,
+						 nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index aef8555..ac3f5e3 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
@@ -198,6 +198,17 @@ typedef struct xl_xact_assignment
 #define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
 
 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+	int			nmsgs;			/* number of shared inval msgs */
+	SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+}			xl_xact_invalidations;
+
+#define MinSizeOfXactInvalidations offsetof(xl_xact_invalidations, msgs)
+
+/*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
  * only include what's needed.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 019bd38..1055e99 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -220,6 +220,9 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	end_lsn;
 
+	/* Toplevel transaction for this subxact (NULL for top-level). */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN of the last lsn at which snapshot information reside, so we can
 	 * restart decoding from there and fully recover this transaction from
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index bc5081c..463888c 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -61,4 +61,6 @@ extern void CacheRegisterRelcacheCallback(RelcacheCallbackFunction func,
 extern void CallSyscacheCallbacks(int cacheid, uint32 hashvalue);
 
 extern void InvalidateSystemCaches(void);
+
+extern void LogLogicalInvalidations(void);
 #endif							/* INVAL_H */
-- 
1.8.3.1

v39/v39-0002-Extend-the-logical-decoding-output-plugin-API-wi.patch0000664000175000017500000011035413705741664025672 0ustar  dilipkumardilipkumarFrom 325a0c89ede0125240cde58ceaa08933162ce156 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:01:24 +0530
Subject: [PATCH v39 2/8] Extend the logical decoding output plugin API with
 stream methods.

This adds seven methods to the output plugin API, adding support for
streaming changes for large transactions.

* stream_start
* stream_stop
* stream_abort
* stream_commit
* stream_change
* stream_message
* stream_truncate

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate a chunk of changes
streamed for a particular toplevel transaction.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/test_decoding.c     | 123 +++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 218 +++++++++++++++++++
 src/backend/replication/logical/logical.c | 351 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  59 +++++
 6 files changed, 825 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..4a71826 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
 }
 
 
@@ -540,3 +569,97 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the changes as the transaction can abort
+ * at a later point in time.  We don't want users to see the changes until the
+ * transaction is committed.
+ */
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the contents for transactional messages
+ * as the transaction can abort at a later point in time.  We don't want users to
+ * see the message contents until the transaction is committed.
+ */
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (transactional)
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu",
+						 transactional, prefix, sz);
+	}
+	else
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+						 transactional, prefix, sz);
+		appendBinaryStringInfo(ctx->out, message, sz);
+	}
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the detailed information of Truncate.
+ * See pg_decode_stream_change.
+ */
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index c89f93c..18116c8 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,6 +389,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -401,6 +408,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -679,6 +695,117 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+    
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The actual changes are not displayed as the transaction can abort at a later
+      point in time and we don't decode changes for aborted transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The message contents for transactional messages are not displayed as the transaction
+      can abort at a later point in time and we don't decode changes for aborted
+      transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -747,4 +874,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and two optional callbacks
+    (<function>stream_message_cb</function>) and (<function>stream_truncate_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be..6ee59bd 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr message_lsn, bool transactional,
+									  const char *prefix, Size message_size, const char *message);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require start/stop/abort/commit/change
+	 * callbacks. The message and truncate callback are optional, similar to
+	 * regular output plugins. We however enable streaming when at least one
+	 * of the methods is enabled, so that we can easily identify missing
+	 * methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional, so we do not
+	 * fail with ERROR when missing, but the wrappers simply do nothing. We
+	 * must set the ReorderBuffer callbacks to something, otherwise the calls
+	 * from there will crash (we don't want to move the checks there).
+	 */
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,307 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_start_cb callback")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_stop_cb callback")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_abort_cb callback")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_commit_cb callback")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_change_cb callback")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475..deef318 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..2d9aa11 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.commit_lsn
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1055e99..42bc817 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -348,6 +348,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											   ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
 struct ReorderBuffer
 {
 	/*
@@ -387,6 +435,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamTruncateCB stream_truncate;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v39/v39-0003-Implement-streaming-mode-in-ReorderBuffer.patch0000664000175000017500000022546313705741664024637 0ustar  dilipkumardilipkumarFrom 2955a9652d59bc65120bd563902d47beee382af7 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 15 Jul 2020 18:38:32 +0530
Subject: [PATCH v39 3/8] Implement streaming mode in ReorderBuffer.

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke new stream API methods. However, sometimes if we have
incomplete toast or speculative insert we spill to the disk because we can
not generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

Now that we can stream in-progress transactions, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

We also provide a new option via SQL APIs to fetch the changes being
streamed.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/Makefile                  |   2 +-
 contrib/test_decoding/expected/stream.out       |  40 +
 contrib/test_decoding/expected/truncate.out     |   6 +
 contrib/test_decoding/sql/stream.sql            |  21 +
 contrib/test_decoding/sql/truncate.sql          |   1 +
 contrib/test_decoding/test_decoding.c           |  13 +
 doc/src/sgml/logicaldecoding.sgml               |   9 +-
 doc/src/sgml/test-decoding.sgml                 |  22 +
 src/backend/access/heap/heapam.c                |  13 +
 src/backend/access/heap/heapam_visibility.c     |  42 +-
 src/backend/access/index/genam.c                |  53 ++
 src/backend/access/table/tableam.c              |   8 +
 src/backend/access/transam/xact.c               |  19 +
 src/backend/replication/logical/decode.c        |  17 +-
 src/backend/replication/logical/logical.c       |  10 +
 src/backend/replication/logical/reorderbuffer.c | 969 +++++++++++++++++++++---
 src/include/access/heapam_xlog.h                |   1 +
 src/include/access/tableam.h                    |  55 ++
 src/include/access/xact.h                       |   4 +
 src/include/replication/logical.h               |   1 +
 src/include/replication/reorderbuffer.h         |  56 +-
 21 files changed, 1256 insertions(+), 106 deletions(-)
 create mode 100644 contrib/test_decoding/expected/stream.out
 create mode 100644 contrib/test_decoding/sql/stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c58..ed9a3d6 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,7 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate
+	spill slot truncate stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
new file mode 100644
index 0000000..bdf7002
--- /dev/null
+++ b/contrib/test_decoding/expected/stream.out
@@ -0,0 +1,40 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+ count 
+-------
+   157
+(1 row)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/truncate.out b/contrib/test_decoding/expected/truncate.out
index 1cf2ae8..e64d377 100644
--- a/contrib/test_decoding/expected/truncate.out
+++ b/contrib/test_decoding/expected/truncate.out
@@ -25,3 +25,9 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
  COMMIT
 (9 rows)
 
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
new file mode 100644
index 0000000..24d41b1
--- /dev/null
+++ b/contrib/test_decoding/sql/stream.sql
@@ -0,0 +1,21 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/truncate.sql b/contrib/test_decoding/sql/truncate.sql
index 5aecdf0..5633854 100644
--- a/contrib/test_decoding/sql/truncate.sql
+++ b/contrib/test_decoding/sql/truncate.sql
@@ -11,3 +11,4 @@ TRUNCATE tab1, tab1 RESTART IDENTITY CASCADE;
 TRUNCATE tab1, tab2;
 
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 4a71826..bb3d9f3 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -122,6 +122,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	ListCell   *option;
 	TestDecodingData *data;
+	bool enable_streaming = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -212,6 +213,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "stream-changes") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_streaming))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -221,6 +232,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 							elem->arg ? strVal(elem->arg) : "(null)")));
 		}
 	}
+
+	ctx->streaming &= enable_streaming;
 }
 
 /* cleanup this plugin's resources */
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 18116c8..98b47b0 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d..fe7c978 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes', '1');
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d881f4c..33a4580 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments in xact.c where
+	 * these variables are declared.  Normally we have such a check at tableam
+	 * level API but this is called from many places so we need to ensure it
+	 * here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
@@ -1945,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..c771280 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the tuple
+		 * yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1659,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..9d9a70a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,10 +430,37 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments in xact.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments in xact.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments in xact.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 4b2bb29..a61e279 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -235,6 +235,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
 	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
+	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d4f7c29..99722ee 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -83,6 +83,19 @@ bool		XactDeferrable;
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
 /*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool		bsysscan = false;
+
+/*
  * When running as a parallel worker, we place only a single
  * TransactionStateData on the parallel worker's state stack, and the XID
  * reflected there will be that of the *innermost* currently-active
@@ -2680,6 +2693,9 @@ AbortTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* If in parallel mode, clean up workers and exit parallel mode. */
 	if (IsInParallelMode())
 	{
@@ -4982,6 +4998,9 @@ AbortSubTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* Exit from parallel mode, if necessary. */
 	if (IsInParallelMode())
 	{
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index f3ea15c..866b56d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6ee59bd..8deff89 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1442,3 +1442,13 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Clear logical streaming state during (sub)transaction abort.
+ */
+void
+ResetLogicalStreamingState(void)
+{
+	CheckXidAlive = InvalidTransactionId;
+	bsysscan = false;
+}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index ce6e621..c469536 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -178,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -236,6 +252,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +261,16 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static inline bool ReorderBufferCanStartStreaming(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -367,6 +394,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -416,13 +446,15 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 }
 
 /*
- * Free an ReorderBufferChange.
+ * Free a ReorderBufferChange and update memory accounting, if requested.
  */
 void
-ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
+						  bool upd_mem)
 {
 	/* update memory accounting info */
-	ReorderBufferChangeMemoryUpdate(rb, change, false);
+	if (upd_mem)
+		ReorderBufferChangeMemoryUpdate(rb, change, false);
 
 	/* free contained data */
 	switch (change->action)
@@ -624,16 +656,101 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 }
 
 /*
- * Queue a change into a transaction so it can be replayed upon commit.
+ * Record the partial change for the streaming of in-progress transactions.  We
+ * can stream only complete changes so if we have a partial change like toast
+ * table insert or speculative then we mark such a 'txn' so that it can't be
+ * streamed.  We also ensure that if the changes in such a 'txn' are above
+ * logical_decoding_work_mem threshold then we stream them as soon as we have a
+ * complete change.
+ */
+static void
+ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								  ReorderBufferChange *change,
+								  bool toast_insert)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The partial changes need to be processed only while streaming
+	 * in-progress transactions.
+	 */
+	if (!ReorderBufferCanStream(rb))
+		return;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * Set the toast insert bit whenever we get toast insert to indicate a partial
+	 * change and clear it when we get the insert or update on main table (Both
+	 * update and insert will do the insert in the toast table).
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			 IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial change and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+	{
+		/*
+		 * Speculative confirm change must be preceded by speculative insertion.
+		 */
+		Assert(rbtxn_has_spec_insert(toptxn));
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+	}
+
+	/*
+	 * Stream the transaction if it is serialized before and the changes are
+	 * now complete in the top-level transaction.
+	 *
+	 * The reason for doing the streaming of such a transaction as soon as
+	 * we get the complete change for it is that previously it would have
+	 * reached the memory threshold and wouldn't get streamed because of
+	 * incomplete changes.  Delaying such transactions would increase apply
+	 * lag for them.
+	 */
+	if (ReorderBufferCanStartStreaming(rb) &&
+		!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		rbtxn_is_serialized(txn))
+		ReorderBufferStreamTXN(rb, toptxn);
+}
+
+/*
+ * Queue a change into a transaction so it can be replayed upon commit or will be
+ * streamed when we reach logical_decoding_work_mem threshold.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the previous changes we have detected that the transaction
+	 * is aborted.  So there is no point in collecting further changes for it.
+	 */
+	if (txn->concurrent_abort)
+	{
+		/*
+		 * We don't need to update memory accounting for this change as we have
+		 * not added it to the queue yet.
+		 */
+		ReorderBufferReturnChange(rb, change, false);
+		return;
+	}
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -645,6 +762,9 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* process partial change */
+	ReorderBufferProcessPartialChange(rb, txn, change, toast_insert);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -674,7 +794,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -764,6 +884,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the (sub)transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -1018,6 +1170,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1032,6 +1187,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1148,7 +1306,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 	{
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1234,7 +1392,7 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1280,7 +1438,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1297,7 +1455,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1310,6 +1468,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1335,6 +1502,91 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change, true);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
+	 * streamed always, even if it does not contain any changes (that is, when
+	 * all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the toplevel xact (we send the XID in all messages), but we never
+	 * stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
+	 * memory. We could also keep the hash table and update it with new ctid
+	 * values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
  */
@@ -1485,57 +1737,178 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the
+	 * xid aborted.  That will happen during catalog access.
+	 */
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						  ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction. This resets the TXN such that it
+ * can be used to stream the remaining data of transaction being processed.
+ */
+static void
+ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+					  Snapshot snapshot_now,
+					  CommandId command_id,
+					  XLogRecPtr last_lsn,
+					  ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Free all resources allocated for toast reconstruction */
+	ReorderBufferToastReset(rb, txn);
+
+	/* Return the spec insert change if it is not NULL */
+	if (specinsert != NULL)
+	{
+		ReorderBufferReturnChange(rb, specinsert, true);
+		specinsert = NULL;
 	}
 
-	snapshot_now = txn->base_snapshot;
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. We iterate over the top and subtransactions (using a k-way
+ * merge) and replay the changes in lsn order.
+ *
+ * If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool stream_started = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1558,14 +1931,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming ? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* We only need to send begin/commit for non-streamed transactions. */
+		if (!streaming)
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1573,6 +1947,36 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * We can't call start stream callback before processing first
+			 * change.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					txn->origin_id = change->origin_id;
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1649,7 +2053,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1685,11 +2090,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1714,7 +2119,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					/* clear out a pending (and thus failed) speculation */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
@@ -1747,7 +2152,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1756,10 +2164,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1790,7 +2195,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1837,7 +2241,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		 */
 		if (specinsert)
 		{
-			ReorderBufferReturnChange(rb, specinsert);
+			ReorderBufferReturnChange(rb, specinsert, true);
 			specinsert = NULL;
 		}
 
@@ -1845,14 +2249,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, send the last message for this set of
+		 * changes depending upon streaming mode.
+		 */
+		if (streaming)
+		{
+			if (stream_started)
+			{
+				rb->stream_stop(rb, txn, prev_lsn);
+				stream_started = false;
+			}
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot for the next set of changes in
+		 * streaming mode.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1870,14 +2295,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as
+		 * streamed (if they contained changes). Otherwise, remove all the
+		 * changes and deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1896,15 +2334,106 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
+		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * This error can only occur when we are sending the data in
+			 * streaming mode and the streaming in not finished yet.
+			 */
+			Assert(streaming);
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+			curtxn->concurrent_abort = true;
+
+			/* Reset the TXN so that it is allowed to stream remaining data. */
+			ReorderBufferResetTXN(rb, txn, snapshot_now,
+								  command_id, prev_lsn,
+								  specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+	}
+	PG_END_TRY();
+}
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot	snapshot_now;
+	CommandId	command_id = FirstCommandId;
 
-		PG_RE_THROW();
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
 	}
-	PG_END_TRY();
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
 }
 
 /*
@@ -1931,6 +2460,22 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_abort(rb, txn, lsn);
+
+		/*
+		 * We might have decoded changes for this transaction that could load
+		 * the cache as per the current transaction's view (consider DDL's
+		 * happened in this transaction). We don't want the decoding of future
+		 * transactions to use those cache entries so execute invalidations.
+		 */
+		if (txn->ninvalidations > 0)
+			ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+											   txn->invalidations);
+	}
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2000,6 +2545,10 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2082,7 +2631,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2131,12 +2680,21 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2144,6 +2702,8 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2155,19 +2715,41 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* If streaming supported, update the total size in top level as well. */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2196,6 +2778,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2388,6 +2971,51 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ *
+ * Note that, we skip transactions that contains incomplete changes. There
+ * is a scope of optimization here such that we can select the largest transaction
+ * which has complete changes.  But that will make the code and design quite complex
+ * and that might not be worth the benefit.  If we plan to stream the transactions
+ * that contains incomplete changes then we need to find a way to partially
+ * stream/truncate the transaction changes in-memory and build a mechanism to
+ * partially truncate the spilled files.  Additionally, whenever we partially
+ * stream the transaction we need to maintain the last streamed lsn and next time
+ * we need to restore from that segment and the offset in WAL.  As we stream the
+ * changes from the top transaction and restore them subtransaction wise, we need
+ * to even remember the subxact from where we streamed the last change.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	Size largest_size = 0;
+	ReorderBufferTXN *largest = NULL;
+
+	/* Find the largest top-level transaction. */
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		if ((largest != NULL || txn->total_size > largest_size) &&
+			(txn->total_size > 0) && !(rbtxn_has_incomplete_tuple(txn)))
+		{
+			largest = txn;
+			largest_size = txn->total_size;
+		}
+	}
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
  * disk until we reach under the memory limit.
@@ -2419,11 +3047,33 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if possible.  Otherwise, spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStartStreaming(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it
+			 * from memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2501,7 +3151,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 
 		spilled++;
 	}
@@ -2713,6 +3363,136 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+/* Returns true, if the output plugin supports streaming, false, otherwise. */
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/* Returns true, if the streaming can be started now, false, otherwise. */
+static inline bool
+ReorderBufferCanStartStreaming(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild *builder = ctx->snapshot_builder;
+
+	/*
+	 * We can't start streaming immediately even if the streaming is enabled
+	 * because we previously decoded this transaction and now just are
+	 * restarting.
+	 */
+	if (ReorderBufferCanStream(rb) &&
+		!SnapBuildXactNeedsSkip(builder, ctx->reader->EndRecPtr))
+	{
+		/* We must have a consistent snapshot by this time */
+		Assert(SnapBuildCurrentState(builder) == SNAPBUILD_CONSISTENT);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * We can't make any assumptions about base snapshot here, similar to what
+	 * ReorderBufferCommit() does. That relies on base_snapshot getting
+	 * transferred from subxact in ReorderBufferCommitChild(), but that was
+	 * not yet called as the transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 *
+	 * Unlike DecodeCommit which adds xids of all the subtransactions in
+	 * snapshot's xip array via SnapBuildCommittedTxn, we can't do that here
+	 * but we do add them to subxip array instead via ReorderBufferCopySnap.
+	 * This allows the catalog changes made in subtransactions decoded till
+	 * now to be visible.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		/*
+		 * If this transaction has no snapshot, it didn't make any changes to
+		 * the database till now, so there's nothing to decode.
+		 */
+		if (txn->base_snapshot == NULL)
+		{
+			Assert(txn->ninvalidations == 0);
+			return;
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -2813,7 +3593,7 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		dlist_container(ReorderBufferChange, node, cleanup_iter.cur);
 
 		dlist_delete(&cleanup->node);
-		ReorderBufferReturnChange(rb, cleanup);
+		ReorderBufferReturnChange(rb, cleanup, true);
 	}
 	txn->nentries_mem = 0;
 	Assert(dlist_is_empty(&txn->changes));
@@ -3522,7 +4302,7 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			dlist_container(ReorderBufferChange, node, it.cur);
 
 			dlist_delete(&change->node);
-			ReorderBufferReturnChange(rb, change);
+			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
 
@@ -3812,6 +4592,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases, we assume the CID is from the future
+	 * command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 0d28f01..b188427 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -19,6 +19,7 @@
 
 #include "access/relscan.h"
 #include "access/sdir.h"
+#include "access/xact.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1017,6 +1027,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1056,6 +1073,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with
+	 * valid CheckXidAlive for catalog or regular tables.  See detailed
+	 * comments in xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1713,6 +1738,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1730,6 +1763,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1748,6 +1789,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1764,6 +1812,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index ac3f5e3..5f767eb 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,10 @@ typedef enum
 /* Synchronous commit level */
 extern int	synchronous_commit;
 
+/* used during logical streaming of a transaction */
+extern TransactionId CheckXidAlive;
+extern bool bsysscan;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index deef318..b0fae98 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -121,5 +121,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+extern void ResetLogicalStreamingState(void);
 
 #endif
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 42bc817..1ae17d5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -162,6 +162,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -181,6 +184,40 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -249,6 +286,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
@@ -313,6 +357,12 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -484,12 +534,14 @@ void		ReorderBufferFree(ReorderBuffer *);
 ReorderBufferTupleBuf *ReorderBufferGetTupleBuf(ReorderBuffer *, Size tuple_len);
 void		ReorderBufferReturnTupleBuf(ReorderBuffer *, ReorderBufferTupleBuf *tuple);
 ReorderBufferChange *ReorderBufferGetChange(ReorderBuffer *);
-void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
+void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *, bool);
 
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool toast_insert);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v39/v39-0004-Extend-the-BufFile-interface-for-the-streaming-o.patch0000664000175000017500000003753413705741664025701 0ustar  dilipkumardilipkumarFrom 61e07cc91b9a0604d549fab394aa66e84d493af1 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v39 4/8] Extend the BufFile interface for the streaming of
 in-progress transactions.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times.

Implement the interface for BufFileTruncate interface to allow files to be
truncated up to a particular offset.  Extend BufFileSeek API to support
SEEK_END case.  Add an option to provide a mode while opening the shared
BufFiles instead of always opening in read-only mode.
---
 src/backend/postmaster/pgstat.c           |  3 +
 src/backend/storage/file/buffile.c        | 86 +++++++++++++++++++++++----
 src/backend/storage/file/fd.c             |  9 ++-
 src/backend/storage/file/sharedfileset.c  | 98 ++++++++++++++++++++++++++++---
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  4 +-
 10 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 88992c2..6c97f68 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 3907349..1140cf8 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile supports temporary files that can be used by the single backend when
+ * the corresponding files need to be survived across the transaction and need
+ * to be opened and closed multiple times.  Such files need to be created as a
+ * member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -364,6 +368,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -666,11 +673,22 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
+			break;
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +856,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset = file->curOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420e..f376a97 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594..9a3dc10 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can also be used by backends when the temporary files need
+ * to be opened/closed multiple times and the underlying files need to survive
+ * across transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,19 +29,29 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -223,6 +255,58 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 }
 
 /*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
+/*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
  */
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59..788815c 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a43..b83fb50 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201..807a9c1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752ba..fc34c49 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d..e209f04 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf07..d5edb60 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
1.8.3.1

v39/v39-0005-Add-support-for-streaming-to-built-in-replicatio.patch0000664000175000017500000027354513705741664026113 0ustar  dilipkumardilipkumarFrom ffa694ce2bafd1a7d990b98bcfc7690847743b2f Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v39 5/8] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  49 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   4 +
 src/backend/replication/logical/proto.c            | 140 ++-
 src/backend/replication/logical/worker.c           | 946 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 348 +++++++-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  46 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/009_stream_simple.pl       |  86 ++
 src/test/subscription/t/010_stream_subxact.pl      | 102 +++
 src/test/subscription/t/011_stream_ddl.pl          |  95 +++
 .../subscription/t/012_stream_subxact_abort.pl     |  82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |  84 ++
 src/test/subscription/t/015_stream_binary.pl       |  86 ++
 19 files changed, 2060 insertions(+), 47 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/015_stream_binary.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70..a81bd54 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal> and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c5..b7d7457 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf..311d462 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377..4c58ad8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,8 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +197,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +350,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +375,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -439,6 +455,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -698,6 +721,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +732,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +789,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +834,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +877,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 6c97f68..479e3ca 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e905723..a6101ac 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097..ff25924 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 2fcf2e6..98e7fd0 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,62 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +291,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,17 +752,323 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or
+	 * inside streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		!in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -635,6 +1081,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1099,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1138,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1256,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1408,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1781,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1922,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2050,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2162,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1947,6 +2435,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1979,6 +2481,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2145,6 +3080,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc..3360bd5 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,29 +47,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,11 +123,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -115,16 +149,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +226,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +255,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +279,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +300,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +331,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +394,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +444,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +486,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,6 +507,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -406,7 +539,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +559,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +584,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +605,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +631,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +663,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -606,6 +744,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -642,6 +887,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -771,12 +1048,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -811,7 +1121,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35..1d09154 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc..655144d 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,49 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbe..6c0a4e3 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/015_stream_binary.pl b/src/test/subscription/t/015_stream_binary.pl
new file mode 100644
index 0000000..fa2362e
--- /dev/null
+++ b/src/test/subscription/t/015_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v39/v39-0006-Enable-streaming-for-all-subscription-TAP-tests.patch0000664000175000017500000003534213705741664025647 0ustar  dilipkumardilipkumarFrom e059dd140d37737c1e7dec457a74ca913ae720de Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v39 6/8] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318f..6f7bedc 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 9f140b5..21410fa 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334e..cdf9b8e 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v39/v39-0007-Add-TAP-test-for-streaming-vs.-DDL.patch0000664000175000017500000001062113705741664022575 0ustar  dilipkumardilipkumarFrom 164a07fae821805bc355d579abdf2bf8e7930094 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v39 7/8] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v39/v39-0008-Add-streaming-option-in-pg_dump.patch0000664000175000017500000000524713705741664022661 0ustar  dilipkumardilipkumarFrom cb212a66dc56df231ba35960e5230c4f3f41db42 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 11:09:18 +0530
Subject: [PATCH v39 8/8] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 17 +++++++++++++++--
 src/bin/pg_dump/pg_dump.h |  1 +
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 94459b3..f69d64c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4240,10 +4241,17 @@ getSubscriptions(Archive *fout)
 
 	if (fout->remoteVersion >= 140000)
 		appendPQExpBuffer(query,
-						  " s.subbinary\n");
+						  " s.subbinary,\n");
 	else
 		appendPQExpBuffer(query,
-						  " false AS subbinary\n");
+						  " false AS subbinary,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBuffer(query,
+						  " s.substream\n");
+	else
+		appendPQExpBuffer(query,
+						  " false AS substream\n");
 
 	appendPQExpBuffer(query,
 					  "FROM pg_subscription s\n"
@@ -4263,6 +4271,7 @@ getSubscriptions(Archive *fout)
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
+	i_substream = PQfnumber(res, "substream");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4286,6 +4295,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subpublications));
 		subinfo[i].subbinary =
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
+		subinfo[i].substream =
+			pg_strdup(PQgetvalue(res, i, i_substream));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4357,6 +4368,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subbinary, "t") == 0)
 		appendPQExpBuffer(query, ", binary = true");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index da97b73..cc10c7c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -626,6 +626,7 @@ typedef struct _SubscriptionInfo
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subbinary;
+	char       *substream;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
-- 
1.8.3.1

#449Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#448)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jul 22, 2020 at 10:20 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jul 22, 2020 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 20, 2020 at 6:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

There was one warning in release mode in the last version in 0004 so
attaching a new version.

Today, I was reviewing patch
v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le and found a
small problem with it.

+ /*
+ * Execute the invalidations for xid-less transactions,
+ * otherwise, accumulate them so that they can be processed at
+ * the commit time.
+ */
+ if (!ctx->fast_forward)
+ {
+ if (TransactionIdIsValid(xid))
+ {
+ ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+   invals->nmsgs, invals->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+   buf->origptr);
+ }

I think we need to set ReorderBufferXidSetCatalogChanges even when
ctx->fast-forward is true because we are dependent on that flag for
snapshot build (see SnapBuildCommitTxn). We are already doing the
same way in DecodeCommit where even though we skip adding
invalidations for fast-forward cases but we do set the flag to
indicate that this txn has catalog changes. Is there any reason to do
things differently here?

I think it is wrong, we should set the
ReorderBufferXidSetCatalogChanges, even if it is in fast-forward mode.

Thanks for the change. I have one more minor comment in the patch
0001-WAL-Log-invalidations-at-command-end-with-wal_le.

 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+ int nmsgs; /* number of shared inval msgs */
+ SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+} xl_xact_invalidations;

I see that we already have a structure xl_xact_invals in the code
which has the same members, so I think it is better to use that
instead of defining a new one.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#450Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#449)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jul 22, 2020 at 4:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 22, 2020 at 10:20 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jul 22, 2020 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 20, 2020 at 6:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

There was one warning in release mode in the last version in 0004 so
attaching a new version.

Today, I was reviewing patch
v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le and found a
small problem with it.

+ /*
+ * Execute the invalidations for xid-less transactions,
+ * otherwise, accumulate them so that they can be processed at
+ * the commit time.
+ */
+ if (!ctx->fast_forward)
+ {
+ if (TransactionIdIsValid(xid))
+ {
+ ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+   invals->nmsgs, invals->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+   buf->origptr);
+ }

I think we need to set ReorderBufferXidSetCatalogChanges even when
ctx->fast-forward is true because we are dependent on that flag for
snapshot build (see SnapBuildCommitTxn). We are already doing the
same way in DecodeCommit where even though we skip adding
invalidations for fast-forward cases but we do set the flag to
indicate that this txn has catalog changes. Is there any reason to do
things differently here?

I think it is wrong, we should set the
ReorderBufferXidSetCatalogChanges, even if it is in fast-forward mode.

Thanks for the change. I have one more minor comment in the patch
0001-WAL-Log-invalidations-at-command-end-with-wal_le.

/*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+ int nmsgs; /* number of shared inval msgs */
+ SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+} xl_xact_invalidations;

I see that we already have a structure xl_xact_invals in the code
which has the same members, so I think it is better to use that
instead of defining a new one.

You are right. I have changed it.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v40.tarapplication/x-tar; name=v40.tarDownload
v40/0000775000175000017500000000000013706016746012567 5ustar  dilipkumardilipkumarv40/v40-0001-WAL-Log-invalidations-at-command-end-with-wal_le.patch0000664000175000017500000003401113706016746025574 0ustar  dilipkumardilipkumarFrom e27023079ffc9e2bcce2c2316ab9e6b5f20c5192 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 6 Jun 2020 09:54:21 +0530
Subject: [PATCH v40 1/8] WAL Log invalidations at command end with
 wal_level=logical.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When wal_level=logical, write invalidations at command end into WAL so
that decoding can use this information.

This patch is required to allow the streaming of in-progress transactions
in logical decoding.  The actual work to allow streaming will be committed
as a separate patch.

We still add the invalidations to the cache and write them to WAL at
commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not need
to be changed.

The invalidations written at command end uses a new xlog record type
XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See
LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures, which
still rely on the invalidations written to commit records.

The invalidations are decoded and accumulated in top-transaction, and then
executed during replay.  This obviates the need to decode the
invalidations as part of a commit record.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/access/rmgrdesc/xactdesc.c          | 10 +++++
 src/backend/access/transam/xact.c               | 17 ++++++++
 src/backend/replication/logical/decode.c        | 58 +++++++++++++++----------
 src/backend/replication/logical/reorderbuffer.c | 52 ++++++++++++++++++----
 src/backend/utils/cache/inval.c                 | 55 +++++++++++++++++++++++
 src/include/access/xact.h                       |  2 +-
 src/include/replication/reorderbuffer.h         |  3 ++
 src/include/utils/inval.h                       |  2 +
 8 files changed, 166 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 9fce755..addd95f 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -396,6 +396,13 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
 		xact_desc_assignment(buf, xlrec);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		xl_xact_invals *xlrec = (xl_xact_invals *) rec;
+
+		standby_desc_invalidations(buf, xlrec->nmsgs, xlrec->msgs, InvalidOid,
+								   InvalidOid, false);
+	}
 }
 
 const char *
@@ -423,6 +430,9 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ASSIGNMENT:
 			id = "ASSIGNMENT";
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			id = "INVALIDATION";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index bd4c3cf..d4f7c29 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1224,6 +1224,16 @@ RecordTransactionCommit(void)
 	bool		RelcacheInitFileInval = false;
 	bool		wrote_xlog;
 
+	/*
+	 * Log pending invalidations for logical decoding of in-progress
+	 * transactions.  Normally for DDLs, we log this at each command end,
+	 * however, for certain cases where we directly update the system table
+	 * without a transaction block, the invalidations are not logged till this
+	 * time.
+	 */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	/* Get data needed for commit record */
 	nrels = smgrGetPendingDeletes(true, &rels);
 	nchildren = xactGetCommittedChildren(&children);
@@ -6022,6 +6032,13 @@ xact_redo(XLogReaderState *record)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
+	else if (info == XLOG_XACT_INVALIDATIONS)
+	{
+		/*
+		 * XXX we do ignore this for now, what matters are invalidations
+		 * written into the commit record.
+		 */
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0c0c371..f3a1c31 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -278,10 +278,39 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 			/*
 			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here.
-			 * See LogicalDecodingProcessRecord.
+			 * record if required.  So, we don't need to do anything here. See
+			 * LogicalDecodingProcessRecord.
 			 */
 			break;
+		case XLOG_XACT_INVALIDATIONS:
+			{
+				TransactionId xid;
+				xl_xact_invals *invals;
+
+				xid = XLogRecGetXid(r);
+				invals = (xl_xact_invals *) XLogRecGetData(r);
+
+				/*
+				 * Execute the invalidations for xid-less transactions,
+				 * otherwise, accumulate them so that they can be processed at
+				 * the commit time.
+				 */
+				if (TransactionIdIsValid(xid))
+				{
+					if (!ctx->fast_forward)
+						ReorderBufferAddInvalidations(reorder, xid,
+													  buf->origptr,
+													  invals->nmsgs,
+													  invals->msgs);
+					ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+													  buf->origptr);
+				}
+				else if ((!ctx->fast_forward))
+					ReorderBufferImmediateInvalidation(ctx->reorder,
+													   invals->nmsgs,
+													   invals->msgs);
+			}
+			break;
 		case XLOG_XACT_PREPARE:
 
 			/*
@@ -334,15 +363,11 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_STANDBY_LOCK:
 			break;
 		case XLOG_INVALIDATIONS:
-			{
-				xl_invalidations *invalidations =
-				(xl_invalidations *) XLogRecGetData(r);
 
-				if (!ctx->fast_forward)
-					ReorderBufferImmediateInvalidation(ctx->reorder,
-													   invalidations->nmsgs,
-													   invalidations->msgs);
-			}
+			/*
+			 * We are processing the invalidations at the command level via
+			 * XLOG_XACT_INVALIDATIONS.  So we don't need to do anything here.
+			 */
 			break;
 		default:
 			elog(ERROR, "unexpected RM_STANDBY_ID record type: %u", info);
@@ -573,19 +598,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		commit_time = parsed->origin_timestamp;
 	}
 
-	/*
-	 * Process invalidation messages, even if we're not interested in the
-	 * transaction's contents, since the various caches need to always be
-	 * consistent.
-	 */
-	if (parsed->nmsgs > 0)
-	{
-		if (!ctx->fast_forward)
-			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
-										  parsed->nmsgs, parsed->msgs);
-		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
-	}
-
 	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
 					   parsed->nsubxacts, parsed->subxacts);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 449327a..ce6e621 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -856,6 +856,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	/* set the reference to top-level transaction */
+	subtxn->toptxn = txn;
+
 	/* add to subtransaction list */
 	dlist_push_tail(&txn->subtxns, &subtxn->node);
 	txn->nsubtxns++;
@@ -2201,7 +2204,11 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 /*
  * Setup the invalidation of the toplevel transaction.
  *
- * This needs to be done before ReorderBufferCommit is called!
+ * This needs to be called for each XLOG_XACT_INVALIDATIONS message and
+ * accumulates all the invalidation messages in the toplevel transaction.
+ * This is required because in some cases where we skip processing the
+ * transaction (see ReorderBufferForget), we need to execute all the
+ * invalidations together.
  */
 void
 ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
@@ -2212,17 +2219,35 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	if (txn->ninvalidations != 0)
-		elog(ERROR, "only ever add one set of invalidations");
+	/*
+	 * We collect all the invalidations under the top transaction so that we
+	 * can execute them all together.
+	 */
+	if (txn->toptxn)
+		txn = txn->toptxn;
 
 	Assert(nmsgs > 0);
 
-	txn->ninvalidations = nmsgs;
-	txn->invalidations = (SharedInvalidationMessage *)
-		MemoryContextAlloc(rb->context,
-						   sizeof(SharedInvalidationMessage) * nmsgs);
-	memcpy(txn->invalidations, msgs,
-		   sizeof(SharedInvalidationMessage) * nmsgs);
+	/* Accumulate invalidations. */
+	if (txn->ninvalidations == 0)
+	{
+		txn->ninvalidations = nmsgs;
+		txn->invalidations = (SharedInvalidationMessage *)
+			MemoryContextAlloc(rb->context,
+							   sizeof(SharedInvalidationMessage) * nmsgs);
+		memcpy(txn->invalidations, msgs,
+			   sizeof(SharedInvalidationMessage) * nmsgs);
+	}
+	else
+	{
+		txn->invalidations = (SharedInvalidationMessage *)
+			repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
+					 (txn->ninvalidations + nmsgs));
+
+		memcpy(txn->invalidations + txn->ninvalidations, msgs,
+			   nmsgs * sizeof(SharedInvalidationMessage));
+		txn->ninvalidations += nmsgs;
+	}
 }
 
 /*
@@ -2250,6 +2275,15 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+	/*
+	 * Mark top-level transaction as having catalog changes too if one of its
+	 * children has so that the ReorderBufferBuildTupleCidHash can
+	 * conveniently check just top-level transaction and decide whether to
+	 * build the hash table or not.
+	 */
+	if (txn->toptxn != NULL)
+		txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 591dd33..eee100d 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -85,6 +85,9 @@
  *	worth trying to avoid sending such inval traffic in the future, if those
  *	problems can be overcome cheaply.
  *
+ *	When wal_level=logical, write invalidations into WAL at each command end to
+ *	support the decoding of the in-progress transactions.  See
+ *      CommandEndInvalidationMessages.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,6 +107,7 @@
 #include "catalog/pg_constraint.h"
 #include "miscadmin.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
 #include "utils/inval.h"
@@ -1094,6 +1098,11 @@ CommandEndInvalidationMessages(void)
 
 	ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
 								LocalExecuteInvalidationMessage);
+
+	/* WAL Log per-command invalidation messages for wal_level=logical */
+	if (XLogLogicalInfoActive())
+		LogLogicalInvalidations();
+
 	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
 							   &transInvalInfo->CurrentCmdInvalidMsgs);
 }
@@ -1501,3 +1510,49 @@ CallSyscacheCallbacks(int cacheid, uint32 hashvalue)
 		i = ccitem->link - 1;
 	}
 }
+
+/*
+ * LogLogicalInvalidations
+ *
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end or at commit time if any invalidations
+ * are pending.
+ */
+void
+LogLogicalInvalidations()
+{
+	xl_xact_invals xlrec;
+	SharedInvalidationMessage *invalMessages;
+	int			nmsgs = 0;
+
+	/* Quick exit if we haven't done anything with invalidation messages. */
+	if (transInvalInfo == NULL)
+		return;
+
+	ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+									 MakeSharedInvalidMessagesArray);
+
+	Assert(!(numSharedInvalidMessagesArray > 0 &&
+			 SharedInvalidMessagesArray == NULL));
+
+	invalMessages = SharedInvalidMessagesArray;
+	nmsgs = numSharedInvalidMessagesArray;
+	SharedInvalidMessagesArray = NULL;
+	numSharedInvalidMessagesArray = 0;
+
+	if (nmsgs > 0)
+	{
+		/* prepare record */
+		memset(&xlrec, 0, MinSizeOfXactInvals);
+		xlrec.nmsgs = nmsgs;
+
+		/* perform insertion */
+		XLogBeginInsert();
+		XLogRegisterData((char *) (&xlrec), MinSizeOfXactInvals);
+		XLogRegisterData((char *) invalMessages,
+						 nmsgs * sizeof(SharedInvalidationMessage));
+		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+
+		pfree(invalMessages);
+	}
+}
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index aef8555..5348011 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -146,7 +146,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
 #define XLOG_XACT_ASSIGNMENT		0x50
-/* free opcode 0x60 */
+#define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
 /* mask for filtering opcodes out of xl_info */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 019bd38..1055e99 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -220,6 +220,9 @@ typedef struct ReorderBufferTXN
 	 */
 	XLogRecPtr	end_lsn;
 
+	/* Toplevel transaction for this subxact (NULL for top-level). */
+	struct ReorderBufferTXN *toptxn;
+
 	/*
 	 * LSN of the last lsn at which snapshot information reside, so we can
 	 * restart decoding from there and fully recover this transaction from
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index bc5081c..463888c 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -61,4 +61,6 @@ extern void CacheRegisterRelcacheCallback(RelcacheCallbackFunction func,
 extern void CallSyscacheCallbacks(int cacheid, uint32 hashvalue);
 
 extern void InvalidateSystemCaches(void);
+
+extern void LogLogicalInvalidations(void);
 #endif							/* INVAL_H */
-- 
1.8.3.1

v40/v40-0002-Extend-the-logical-decoding-output-plugin-API-wi.patch0000664000175000017500000011035413706016746025647 0ustar  dilipkumardilipkumarFrom 23a739d2cd1b6a2293e2f363c5fee491042c7d24 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:01:24 +0530
Subject: [PATCH v40 2/8] Extend the logical decoding output plugin API with
 stream methods.

This adds seven methods to the output plugin API, adding support for
streaming changes for large transactions.

* stream_start
* stream_stop
* stream_abort
* stream_commit
* stream_change
* stream_message
* stream_truncate

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate a chunk of changes
streamed for a particular toplevel transaction.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/test_decoding.c     | 123 +++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 218 +++++++++++++++++++
 src/backend/replication/logical/logical.c | 351 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  59 +++++
 6 files changed, 825 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..4a71826 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
 }
 
 
@@ -540,3 +569,97 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the changes as the transaction can abort
+ * at a later point in time.  We don't want users to see the changes until the
+ * transaction is committed.
+ */
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the contents for transactional messages
+ * as the transaction can abort at a later point in time.  We don't want users to
+ * see the message contents until the transaction is committed.
+ */
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (transactional)
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu",
+						 transactional, prefix, sz);
+	}
+	else
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+						 transactional, prefix, sz);
+		appendBinaryStringInfo(ctx->out, message, sz);
+	}
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the detailed information of Truncate.
+ * See pg_decode_stream_change.
+ */
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	OutputPluginPrepareWrite(ctx, true);
+	appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index c89f93c..18116c8 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,6 +389,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -401,6 +408,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_change_cb</function>,
+     <function>stream_commit_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_start_cb</function> and <function>stream_stop_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -679,6 +695,117 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+    
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The actual changes are not displayed as the transaction can abort at a later
+      point in time and we don't decode changes for aborted transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        Relation relation,
+                                        ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The message contents for transactional messages are not displayed as the transaction
+      can abort at a later point in time and we don't decode changes for aborted
+      transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr message_lsn,
+                                        bool transactional,
+                                        const char *prefix,
+                                        Size message_size,
+                                        const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                         ReorderBufferTXN *txn,
+                                         int nrelations,
+                                         Relation relations[],
+                                         ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -747,4 +874,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_change_cb</function>, <function>stream_commit_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function>) and two optional callbacks
+    (<function>stream_message_cb</function>) and (<function>stream_truncate_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be..6ee59bd 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 Relation relation, ReorderBufferChange *change);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   int nrelations, Relation relations[], ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr message_lsn, bool transactional,
+									  const char *prefix, Size message_size, const char *message);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require start/stop/abort/commit/change
+	 * callbacks. The message and truncate callback are optional, similar to
+	 * regular output plugins. We however enable streaming when at least one
+	 * of the methods is enabled, so that we can easily identify missing
+	 * methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional, so we do not
+	 * fail with ERROR when missing, but the wrappers simply do nothing. We
+	 * must set the ReorderBuffer callbacks to something, otherwise the calls
+	 * from there will crash (we don't want to move the checks there).
+	 */
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,307 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_start_cb callback")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_stop_cb callback")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_abort_cb callback")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_commit_cb callback")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_change_cb callback")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475..deef318 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..2d9aa11 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.commit_lsn
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1055e99..42bc817 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -348,6 +348,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											   ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
 struct ReorderBuffer
 {
 	/*
@@ -387,6 +435,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamTruncateCB stream_truncate;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v40/v40-0003-Implement-streaming-mode-in-ReorderBuffer.patch0000664000175000017500000022546313706016746024614 0ustar  dilipkumardilipkumarFrom a1f9860907d17a88060d891740161d80864488d2 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 15 Jul 2020 18:38:32 +0530
Subject: [PATCH v40 3/8] Implement streaming mode in ReorderBuffer.

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke new stream API methods. However, sometimes if we have
incomplete toast or speculative insert we spill to the disk because we can
not generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

Now that we can stream in-progress transactions, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

We also provide a new option via SQL APIs to fetch the changes being
streamed.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/Makefile                  |   2 +-
 contrib/test_decoding/expected/stream.out       |  40 +
 contrib/test_decoding/expected/truncate.out     |   6 +
 contrib/test_decoding/sql/stream.sql            |  21 +
 contrib/test_decoding/sql/truncate.sql          |   1 +
 contrib/test_decoding/test_decoding.c           |  13 +
 doc/src/sgml/logicaldecoding.sgml               |   9 +-
 doc/src/sgml/test-decoding.sgml                 |  22 +
 src/backend/access/heap/heapam.c                |  13 +
 src/backend/access/heap/heapam_visibility.c     |  42 +-
 src/backend/access/index/genam.c                |  53 ++
 src/backend/access/table/tableam.c              |   8 +
 src/backend/access/transam/xact.c               |  19 +
 src/backend/replication/logical/decode.c        |  17 +-
 src/backend/replication/logical/logical.c       |  10 +
 src/backend/replication/logical/reorderbuffer.c | 969 +++++++++++++++++++++---
 src/include/access/heapam_xlog.h                |   1 +
 src/include/access/tableam.h                    |  55 ++
 src/include/access/xact.h                       |   4 +
 src/include/replication/logical.h               |   1 +
 src/include/replication/reorderbuffer.h         |  56 +-
 21 files changed, 1256 insertions(+), 106 deletions(-)
 create mode 100644 contrib/test_decoding/expected/stream.out
 create mode 100644 contrib/test_decoding/sql/stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c58..ed9a3d6 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,7 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate
+	spill slot truncate stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
new file mode 100644
index 0000000..bdf7002
--- /dev/null
+++ b/contrib/test_decoding/expected/stream.out
@@ -0,0 +1,40 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+ count 
+-------
+   157
+(1 row)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/truncate.out b/contrib/test_decoding/expected/truncate.out
index 1cf2ae8..e64d377 100644
--- a/contrib/test_decoding/expected/truncate.out
+++ b/contrib/test_decoding/expected/truncate.out
@@ -25,3 +25,9 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
  COMMIT
 (9 rows)
 
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
new file mode 100644
index 0000000..24d41b1
--- /dev/null
+++ b/contrib/test_decoding/sql/stream.sql
@@ -0,0 +1,21 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/truncate.sql b/contrib/test_decoding/sql/truncate.sql
index 5aecdf0..5633854 100644
--- a/contrib/test_decoding/sql/truncate.sql
+++ b/contrib/test_decoding/sql/truncate.sql
@@ -11,3 +11,4 @@ TRUNCATE tab1, tab1 RESTART IDENTITY CASCADE;
 TRUNCATE tab1, tab2;
 
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 4a71826..bb3d9f3 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -122,6 +122,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	ListCell   *option;
 	TestDecodingData *data;
+	bool enable_streaming = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -212,6 +213,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "stream-changes") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_streaming))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -221,6 +232,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 							elem->arg ? strVal(elem->arg) : "(null)")));
 		}
 	}
+
+	ctx->streaming &= enable_streaming;
 }
 
 /* cleanup this plugin's resources */
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 18116c8..98b47b0 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d..fe7c978 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes', '1');
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d881f4c..33a4580 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments in xact.c where
+	 * these variables are declared.  Normally we have such a check at tableam
+	 * level API but this is called from many places so we need to ensure it
+	 * here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
@@ -1945,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..c771280 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the tuple
+		 * yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1659,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..9d9a70a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,10 +430,37 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments in xact.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments in xact.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments in xact.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 4b2bb29..a61e279 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -235,6 +235,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
 	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
+	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d4f7c29..99722ee 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -83,6 +83,19 @@ bool		XactDeferrable;
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
 /*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool		bsysscan = false;
+
+/*
  * When running as a parallel worker, we place only a single
  * TransactionStateData on the parallel worker's state stack, and the XID
  * reflected there will be that of the *innermost* currently-active
@@ -2680,6 +2693,9 @@ AbortTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* If in parallel mode, clean up workers and exit parallel mode. */
 	if (IsInParallelMode())
 	{
@@ -4982,6 +4998,9 @@ AbortSubTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* Exit from parallel mode, if necessary. */
 	if (IsInParallelMode())
 	{
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index f3a1c31..f21f61d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6ee59bd..8deff89 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1442,3 +1442,13 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Clear logical streaming state during (sub)transaction abort.
+ */
+void
+ResetLogicalStreamingState(void)
+{
+	CheckXidAlive = InvalidTransactionId;
+	bsysscan = false;
+}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index ce6e621..c469536 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -178,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -236,6 +252,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +261,16 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static inline bool ReorderBufferCanStartStreaming(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -367,6 +394,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -416,13 +446,15 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 }
 
 /*
- * Free an ReorderBufferChange.
+ * Free a ReorderBufferChange and update memory accounting, if requested.
  */
 void
-ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
+						  bool upd_mem)
 {
 	/* update memory accounting info */
-	ReorderBufferChangeMemoryUpdate(rb, change, false);
+	if (upd_mem)
+		ReorderBufferChangeMemoryUpdate(rb, change, false);
 
 	/* free contained data */
 	switch (change->action)
@@ -624,16 +656,101 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 }
 
 /*
- * Queue a change into a transaction so it can be replayed upon commit.
+ * Record the partial change for the streaming of in-progress transactions.  We
+ * can stream only complete changes so if we have a partial change like toast
+ * table insert or speculative then we mark such a 'txn' so that it can't be
+ * streamed.  We also ensure that if the changes in such a 'txn' are above
+ * logical_decoding_work_mem threshold then we stream them as soon as we have a
+ * complete change.
+ */
+static void
+ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								  ReorderBufferChange *change,
+								  bool toast_insert)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The partial changes need to be processed only while streaming
+	 * in-progress transactions.
+	 */
+	if (!ReorderBufferCanStream(rb))
+		return;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * Set the toast insert bit whenever we get toast insert to indicate a partial
+	 * change and clear it when we get the insert or update on main table (Both
+	 * update and insert will do the insert in the toast table).
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			 IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial change and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+	{
+		/*
+		 * Speculative confirm change must be preceded by speculative insertion.
+		 */
+		Assert(rbtxn_has_spec_insert(toptxn));
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+	}
+
+	/*
+	 * Stream the transaction if it is serialized before and the changes are
+	 * now complete in the top-level transaction.
+	 *
+	 * The reason for doing the streaming of such a transaction as soon as
+	 * we get the complete change for it is that previously it would have
+	 * reached the memory threshold and wouldn't get streamed because of
+	 * incomplete changes.  Delaying such transactions would increase apply
+	 * lag for them.
+	 */
+	if (ReorderBufferCanStartStreaming(rb) &&
+		!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		rbtxn_is_serialized(txn))
+		ReorderBufferStreamTXN(rb, toptxn);
+}
+
+/*
+ * Queue a change into a transaction so it can be replayed upon commit or will be
+ * streamed when we reach logical_decoding_work_mem threshold.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the previous changes we have detected that the transaction
+	 * is aborted.  So there is no point in collecting further changes for it.
+	 */
+	if (txn->concurrent_abort)
+	{
+		/*
+		 * We don't need to update memory accounting for this change as we have
+		 * not added it to the queue yet.
+		 */
+		ReorderBufferReturnChange(rb, change, false);
+		return;
+	}
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -645,6 +762,9 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* process partial change */
+	ReorderBufferProcessPartialChange(rb, txn, change, toast_insert);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -674,7 +794,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -764,6 +884,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the (sub)transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -1018,6 +1170,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1032,6 +1187,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1148,7 +1306,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 	{
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1234,7 +1392,7 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1280,7 +1438,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1297,7 +1455,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1310,6 +1468,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1335,6 +1502,91 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change, true);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
+	 * streamed always, even if it does not contain any changes (that is, when
+	 * all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the toplevel xact (we send the XID in all messages), but we never
+	 * stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
+	 * memory. We could also keep the hash table and update it with new ctid
+	 * values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
  */
@@ -1485,57 +1737,178 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the
+	 * xid aborted.  That will happen during catalog access.
+	 */
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						  ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction. This resets the TXN such that it
+ * can be used to stream the remaining data of transaction being processed.
+ */
+static void
+ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+					  Snapshot snapshot_now,
+					  CommandId command_id,
+					  XLogRecPtr last_lsn,
+					  ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Free all resources allocated for toast reconstruction */
+	ReorderBufferToastReset(rb, txn);
+
+	/* Return the spec insert change if it is not NULL */
+	if (specinsert != NULL)
+	{
+		ReorderBufferReturnChange(rb, specinsert, true);
+		specinsert = NULL;
 	}
 
-	snapshot_now = txn->base_snapshot;
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. We iterate over the top and subtransactions (using a k-way
+ * merge) and replay the changes in lsn order.
+ *
+ * If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool stream_started = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1558,14 +1931,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming ? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* We only need to send begin/commit for non-streamed transactions. */
+		if (!streaming)
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1573,6 +1947,36 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * We can't call start stream callback before processing first
+			 * change.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					txn->origin_id = change->origin_id;
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1649,7 +2053,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1685,11 +2090,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1714,7 +2119,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					/* clear out a pending (and thus failed) speculation */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
@@ -1747,7 +2152,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1756,10 +2164,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1790,7 +2195,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1837,7 +2241,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		 */
 		if (specinsert)
 		{
-			ReorderBufferReturnChange(rb, specinsert);
+			ReorderBufferReturnChange(rb, specinsert, true);
 			specinsert = NULL;
 		}
 
@@ -1845,14 +2249,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, send the last message for this set of
+		 * changes depending upon streaming mode.
+		 */
+		if (streaming)
+		{
+			if (stream_started)
+			{
+				rb->stream_stop(rb, txn, prev_lsn);
+				stream_started = false;
+			}
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot for the next set of changes in
+		 * streaming mode.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1870,14 +2295,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as
+		 * streamed (if they contained changes). Otherwise, remove all the
+		 * changes and deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1896,15 +2334,106 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
+		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * This error can only occur when we are sending the data in
+			 * streaming mode and the streaming in not finished yet.
+			 */
+			Assert(streaming);
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+			curtxn->concurrent_abort = true;
+
+			/* Reset the TXN so that it is allowed to stream remaining data. */
+			ReorderBufferResetTXN(rb, txn, snapshot_now,
+								  command_id, prev_lsn,
+								  specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+	}
+	PG_END_TRY();
+}
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot	snapshot_now;
+	CommandId	command_id = FirstCommandId;
 
-		PG_RE_THROW();
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
 	}
-	PG_END_TRY();
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
 }
 
 /*
@@ -1931,6 +2460,22 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_abort(rb, txn, lsn);
+
+		/*
+		 * We might have decoded changes for this transaction that could load
+		 * the cache as per the current transaction's view (consider DDL's
+		 * happened in this transaction). We don't want the decoding of future
+		 * transactions to use those cache entries so execute invalidations.
+		 */
+		if (txn->ninvalidations > 0)
+			ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+											   txn->invalidations);
+	}
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2000,6 +2545,10 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2082,7 +2631,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2131,12 +2680,21 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2144,6 +2702,8 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2155,19 +2715,41 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* If streaming supported, update the total size in top level as well. */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2196,6 +2778,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2388,6 +2971,51 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ *
+ * Note that, we skip transactions that contains incomplete changes. There
+ * is a scope of optimization here such that we can select the largest transaction
+ * which has complete changes.  But that will make the code and design quite complex
+ * and that might not be worth the benefit.  If we plan to stream the transactions
+ * that contains incomplete changes then we need to find a way to partially
+ * stream/truncate the transaction changes in-memory and build a mechanism to
+ * partially truncate the spilled files.  Additionally, whenever we partially
+ * stream the transaction we need to maintain the last streamed lsn and next time
+ * we need to restore from that segment and the offset in WAL.  As we stream the
+ * changes from the top transaction and restore them subtransaction wise, we need
+ * to even remember the subxact from where we streamed the last change.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	Size largest_size = 0;
+	ReorderBufferTXN *largest = NULL;
+
+	/* Find the largest top-level transaction. */
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		if ((largest != NULL || txn->total_size > largest_size) &&
+			(txn->total_size > 0) && !(rbtxn_has_incomplete_tuple(txn)))
+		{
+			largest = txn;
+			largest_size = txn->total_size;
+		}
+	}
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
  * disk until we reach under the memory limit.
@@ -2419,11 +3047,33 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if possible.  Otherwise, spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStartStreaming(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it
+			 * from memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2501,7 +3151,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 
 		spilled++;
 	}
@@ -2713,6 +3363,136 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+/* Returns true, if the output plugin supports streaming, false, otherwise. */
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/* Returns true, if the streaming can be started now, false, otherwise. */
+static inline bool
+ReorderBufferCanStartStreaming(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild *builder = ctx->snapshot_builder;
+
+	/*
+	 * We can't start streaming immediately even if the streaming is enabled
+	 * because we previously decoded this transaction and now just are
+	 * restarting.
+	 */
+	if (ReorderBufferCanStream(rb) &&
+		!SnapBuildXactNeedsSkip(builder, ctx->reader->EndRecPtr))
+	{
+		/* We must have a consistent snapshot by this time */
+		Assert(SnapBuildCurrentState(builder) == SNAPBUILD_CONSISTENT);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * We can't make any assumptions about base snapshot here, similar to what
+	 * ReorderBufferCommit() does. That relies on base_snapshot getting
+	 * transferred from subxact in ReorderBufferCommitChild(), but that was
+	 * not yet called as the transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 *
+	 * Unlike DecodeCommit which adds xids of all the subtransactions in
+	 * snapshot's xip array via SnapBuildCommittedTxn, we can't do that here
+	 * but we do add them to subxip array instead via ReorderBufferCopySnap.
+	 * This allows the catalog changes made in subtransactions decoded till
+	 * now to be visible.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		/*
+		 * If this transaction has no snapshot, it didn't make any changes to
+		 * the database till now, so there's nothing to decode.
+		 */
+		if (txn->base_snapshot == NULL)
+		{
+			Assert(txn->ninvalidations == 0);
+			return;
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -2813,7 +3593,7 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		dlist_container(ReorderBufferChange, node, cleanup_iter.cur);
 
 		dlist_delete(&cleanup->node);
-		ReorderBufferReturnChange(rb, cleanup);
+		ReorderBufferReturnChange(rb, cleanup, true);
 	}
 	txn->nentries_mem = 0;
 	Assert(dlist_is_empty(&txn->changes));
@@ -3522,7 +4302,7 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			dlist_container(ReorderBufferChange, node, it.cur);
 
 			dlist_delete(&change->node);
-			ReorderBufferReturnChange(rb, change);
+			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
 
@@ -3812,6 +4592,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases, we assume the CID is from the future
+	 * command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 0d28f01..b188427 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -19,6 +19,7 @@
 
 #include "access/relscan.h"
 #include "access/sdir.h"
+#include "access/xact.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1017,6 +1027,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1056,6 +1073,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with
+	 * valid CheckXidAlive for catalog or regular tables.  See detailed
+	 * comments in xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1713,6 +1738,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1730,6 +1763,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1748,6 +1789,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1764,6 +1812,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5348011..c18554b 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,10 @@ typedef enum
 /* Synchronous commit level */
 extern int	synchronous_commit;
 
+/* used during logical streaming of a transaction */
+extern TransactionId CheckXidAlive;
+extern bool bsysscan;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index deef318..b0fae98 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -121,5 +121,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+extern void ResetLogicalStreamingState(void);
 
 #endif
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 42bc817..1ae17d5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -162,6 +162,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -181,6 +184,40 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -249,6 +286,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
@@ -313,6 +357,12 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -484,12 +534,14 @@ void		ReorderBufferFree(ReorderBuffer *);
 ReorderBufferTupleBuf *ReorderBufferGetTupleBuf(ReorderBuffer *, Size tuple_len);
 void		ReorderBufferReturnTupleBuf(ReorderBuffer *, ReorderBufferTupleBuf *tuple);
 ReorderBufferChange *ReorderBufferGetChange(ReorderBuffer *);
-void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
+void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *, bool);
 
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool toast_insert);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v40/v40-0004-Extend-the-BufFile-interface-for-the-streaming-o.patch0000664000175000017500000003753413706016746025656 0ustar  dilipkumardilipkumarFrom ed38fb1ad5b22ab10b7c4cf3cb47fb991a1a5af2 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v40 4/8] Extend the BufFile interface for the streaming of
 in-progress transactions.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times.

Implement the interface for BufFileTruncate interface to allow files to be
truncated up to a particular offset.  Extend BufFileSeek API to support
SEEK_END case.  Add an option to provide a mode while opening the shared
BufFiles instead of always opening in read-only mode.
---
 src/backend/postmaster/pgstat.c           |  3 +
 src/backend/storage/file/buffile.c        | 86 +++++++++++++++++++++++----
 src/backend/storage/file/fd.c             |  9 ++-
 src/backend/storage/file/sharedfileset.c  | 98 ++++++++++++++++++++++++++++---
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  4 +-
 10 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 88992c2..6c97f68 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 3907349..1140cf8 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile supports temporary files that can be used by the single backend when
+ * the corresponding files need to be survived across the transaction and need
+ * to be opened and closed multiple times.  Such files need to be created as a
+ * member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -364,6 +368,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -666,11 +673,22 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
+			break;
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +856,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset = file->curOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420e..f376a97 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594..9a3dc10 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can also be used by backends when the temporary files need
+ * to be opened/closed multiple times and the underlying files need to survive
+ * across transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,19 +29,29 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -223,6 +255,58 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 }
 
 /*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
+/*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
  */
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59..788815c 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a43..b83fb50 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201..807a9c1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752ba..fc34c49 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d..e209f04 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf07..d5edb60 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
1.8.3.1

v40/v40-0005-Add-support-for-streaming-to-built-in-replicatio.patch0000664000175000017500000027354513706016746026070 0ustar  dilipkumardilipkumarFrom 93965bcb85789f745d0cadffcb0a7c6cad4b1c81 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v40 5/8] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  49 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   4 +
 src/backend/replication/logical/proto.c            | 140 ++-
 src/backend/replication/logical/worker.c           | 946 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 348 +++++++-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  46 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/009_stream_simple.pl       |  86 ++
 src/test/subscription/t/010_stream_subxact.pl      | 102 +++
 src/test/subscription/t/011_stream_ddl.pl          |  95 +++
 .../subscription/t/012_stream_subxact_abort.pl     |  82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |  84 ++
 src/test/subscription/t/015_stream_binary.pl       |  86 ++
 19 files changed, 2060 insertions(+), 47 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/015_stream_binary.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70..a81bd54 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal> and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c5..b7d7457 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf..311d462 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377..4c58ad8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,8 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +197,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +350,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +375,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -439,6 +455,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -698,6 +721,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +732,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +789,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +834,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +877,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 6c97f68..479e3ca 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e905723..a6101ac 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097..ff25924 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 2fcf2e6..98e7fd0 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,62 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +291,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,17 +752,323 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or
+	 * inside streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		!in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -635,6 +1081,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1099,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1138,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1256,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1408,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1781,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1922,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2050,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2162,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1947,6 +2435,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1979,6 +2481,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2145,6 +3080,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc..3360bd5 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,29 +47,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,11 +123,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -115,16 +149,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +226,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +255,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +279,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +300,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +331,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +394,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +444,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +486,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,6 +507,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -406,7 +539,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +559,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +584,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +605,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +631,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +663,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -606,6 +744,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -642,6 +887,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -771,12 +1048,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -811,7 +1121,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35..1d09154 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc..655144d 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,49 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbe..6c0a4e3 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/015_stream_binary.pl b/src/test/subscription/t/015_stream_binary.pl
new file mode 100644
index 0000000..fa2362e
--- /dev/null
+++ b/src/test/subscription/t/015_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v40/v40-0006-Enable-streaming-for-all-subscription-TAP-tests.patch0000664000175000017500000003534213706016746025624 0ustar  dilipkumardilipkumarFrom af2c660e40c3b2cd59a5bcc84463808b18f0258c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v40 6/8] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318f..6f7bedc 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 9f140b5..21410fa 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334e..cdf9b8e 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v40/v40-0007-Add-TAP-test-for-streaming-vs.-DDL.patch0000664000175000017500000001062113706016746022552 0ustar  dilipkumardilipkumarFrom 2bdca16fe396938be82edcdcc21aeedecccf83e4 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v40 7/8] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v40/v40-0008-Add-streaming-option-in-pg_dump.patch0000664000175000017500000000524713706016746022636 0ustar  dilipkumardilipkumarFrom ed6477ee711147d6e76d0fdde9c73f05dd43acf2 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 11:09:18 +0530
Subject: [PATCH v40 8/8] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 17 +++++++++++++++--
 src/bin/pg_dump/pg_dump.h |  1 +
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 94459b3..f69d64c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4240,10 +4241,17 @@ getSubscriptions(Archive *fout)
 
 	if (fout->remoteVersion >= 140000)
 		appendPQExpBuffer(query,
-						  " s.subbinary\n");
+						  " s.subbinary,\n");
 	else
 		appendPQExpBuffer(query,
-						  " false AS subbinary\n");
+						  " false AS subbinary,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBuffer(query,
+						  " s.substream\n");
+	else
+		appendPQExpBuffer(query,
+						  " false AS substream\n");
 
 	appendPQExpBuffer(query,
 					  "FROM pg_subscription s\n"
@@ -4263,6 +4271,7 @@ getSubscriptions(Archive *fout)
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
+	i_substream = PQfnumber(res, "substream");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4286,6 +4295,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subpublications));
 		subinfo[i].subbinary =
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
+		subinfo[i].substream =
+			pg_strdup(PQgetvalue(res, i, i_substream));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4357,6 +4368,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subbinary, "t") == 0)
 		appendPQExpBuffer(query, ", binary = true");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index da97b73..cc10c7c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -626,6 +626,7 @@ typedef struct _SubscriptionInfo
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subbinary;
+	char       *substream;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
-- 
1.8.3.1

#451Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#450)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jul 22, 2020 at 4:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

You are right. I have changed it.

Thanks, I have pushed the second patch in this series which is
0001-WAL-Log-invalidations-at-command-end-with-wal_le in your latest
patch. I will continue working on remaining patches.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#452Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#451)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jul 23, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 22, 2020 at 4:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

You are right. I have changed it.

Thanks, I have pushed the second patch in this series which is
0001-WAL-Log-invalidations-at-command-end-with-wal_le in your latest
patch. I will continue working on remaining patches.

I have reviewed and made a number of changes in the next patch which
extends the logical decoding output plugin API with stream methods.
(v41-0001-Extend-the-logical-decoding-output-plugin-API-wi).

1. I think we need handling of include_xids and include_timestamp but
not skip_empty_xacts in the new APIs, as of now, none of the options
were respected. We need 'include_xids' handling because we need to
include xid with stream messages and similarly 'include_timestamp' for
stream commit messages. OTOH, I think we never use streaming mode for
empty xacts, so we don't need to bother about skip_empty_xacts in
streaming APIs.
2. Then I made a number of changes in documentation, comments, and
other cosmetic changes.

Kindly review/test and let me know if you see any problems with the
above changes.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v41.tarapplication/x-tar; name=v41.tarDownload
v41-0001-Extend-the-logical-decoding-output-plugin-API-wi.patch000664 000765 000024 00000114113 13706541756 025432 0ustar00amitkapilastaff000000 000000 From 7be6e2e7bc36b50e1bfc07b89dcb80b22d3c9149 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:01:24 +0530
Subject: [PATCH v41 1/7] Extend the logical decoding output plugin API with
 stream methods.

This adds seven methods to the output plugin API, adding support for
streaming changes for large in-porgress transactions.

* stream_start
* stream_stop
* stream_abort
* stream_commit
* stream_change
* stream_message
* stream_truncate

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate a chunk of changes
streamed for a particular toplevel transaction.

This commit simply adds these new APIs and the upcoming patch to "allow
the streaming mode in ReorderBuffer" will use these APIs.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/test_decoding.c     | 176 +++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 218 +++++++++++++++++++
 src/backend/replication/logical/logical.c | 351 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 ++++++
 src/include/replication/reorderbuffer.h   |  59 +++++
 6 files changed, 878 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c9488..dbef52a 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
 }
 
 
@@ -540,3 +569,150 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+/*
+ * We never try to stream any empty xact so we don't need any special handling
+ * for skip_empty_xacts in streaming mode APIs.
+ */
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "opening a streamed block for transaction");
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * We never try to stream any empty xact so we don't need any special handling
+ * for skip_empty_xacts in streaming mode APIs.
+ */
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "closing a streamed block for transaction");
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * We never try to stream any empty xact so we don't need any special handling
+ * for skip_empty_xacts in streaming mode APIs.
+ */
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "aborting streamed (sub)transaction");
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * We never try to stream any empty xact so we don't need any special handling
+ * for skip_empty_xacts in streaming mode APIs.
+ */
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "committing streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the changes as the transaction can abort
+ * at a later point in time.  We don't want users to see the changes until the
+ * transaction is committed.
+ */
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "streaming change for transaction");
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the contents for transactional messages
+ * as the transaction can abort at a later point in time.  We don't want users to
+ * see the message contents until the transaction is committed.
+ */
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (transactional)
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu",
+						 transactional, prefix, sz);
+	}
+	else
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+						 transactional, prefix, sz);
+		appendBinaryStringInfo(ctx->out, message, sz);
+	}
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the detailed information of Truncate.
+ * See pg_decode_stream_change.
+ */
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "streaming truncate for transaction");
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index c89f93c..791a62b 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,6 +389,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -401,6 +408,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_start_cb</function>,
+     <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -679,6 +695,117 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                           ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                             ReorderBufferTXN *txn,
+                                             XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+    
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The actual changes are not displayed as the transaction can abort at a later
+      point in time and we don't decode changes for aborted transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                             ReorderBufferTXN *txn,
+                                             Relation relation,
+                                             ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The message contents for transactional messages are not displayed as the transaction
+      can abort at a later point in time and we don't decode changes for aborted
+      transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr message_lsn,
+                                              bool transactional,
+                                              const char *prefix,
+                                              Size message_size,
+                                              const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               int nrelations,
+                                               Relation relations[],
+                                               ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -747,4 +874,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_start_cb</function>, <function>stream_stop_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_commit_cb</function>
+    and <function>stream_change_cb</function>) and two optional callbacks
+    (<function>stream_message_cb</function>) and (<function>stream_truncate_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be..05d24b9 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 Relation relation, ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr message_lsn, bool transactional,
+									  const char *prefix, Size message_size, const char *message);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   int nrelations, Relation relations[], ReorderBufferChange *change);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require start/stop/abort/commit/change
+	 * callbacks. The message and truncate callbacks are optional, similar to
+	 * regular output plugins. We however enable streaming when at least one
+	 * of the methods is enabled so that we can easily identify missing
+	 * methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional, so we do not
+	 * fail with ERROR when missing, but the wrappers simply do nothing. We
+	 * must set the ReorderBuffer callbacks to something, otherwise the calls
+	 * from there will crash (we don't want to move the checks there).
+	 */
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,307 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this message's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_start_cb callback")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this message's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_stop_cb callback")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_abort_cb callback")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_commit_cb callback")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_change_cb callback")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475..deef318 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -80,6 +80,11 @@ typedef struct LogicalDecodingContext
 	void	   *output_writer_private;
 
 	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236..b78c796 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
+/*
  * Output plugin callbacks
  */
 typedef struct OutputPluginCallbacks
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1055e99..42bc817 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -348,6 +348,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											   ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
 struct ReorderBuffer
 {
 	/*
@@ -387,6 +435,17 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamTruncateCB stream_truncate;
+
+	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
 	void	   *private_data;
-- 
1.8.3.1

v41-0002-Implement-streaming-mode-in-ReorderBuffer.patch000664 000765 000024 00000225463 13706541756 024402 0ustar00amitkapilastaff000000 000000 From 0a917eb6fe5343828eeccbebf61d2855aec2d5e8 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 15 Jul 2020 18:38:32 +0530
Subject: [PATCH v41 2/7] Implement streaming mode in ReorderBuffer.

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke new stream API methods. However, sometimes if we have
incomplete toast or speculative insert we spill to the disk because we can
not generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

Now that we can stream in-progress transactions, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

We also provide a new option via SQL APIs to fetch the changes being
streamed.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/Makefile                  |   2 +-
 contrib/test_decoding/expected/stream.out       |  40 +
 contrib/test_decoding/expected/truncate.out     |   6 +
 contrib/test_decoding/sql/stream.sql            |  21 +
 contrib/test_decoding/sql/truncate.sql          |   1 +
 contrib/test_decoding/test_decoding.c           |  13 +
 doc/src/sgml/logicaldecoding.sgml               |   9 +-
 doc/src/sgml/test-decoding.sgml                 |  22 +
 src/backend/access/heap/heapam.c                |  13 +
 src/backend/access/heap/heapam_visibility.c     |  42 +-
 src/backend/access/index/genam.c                |  53 ++
 src/backend/access/table/tableam.c              |   8 +
 src/backend/access/transam/xact.c               |  19 +
 src/backend/replication/logical/decode.c        |  17 +-
 src/backend/replication/logical/logical.c       |  10 +
 src/backend/replication/logical/reorderbuffer.c | 969 +++++++++++++++++++++---
 src/include/access/heapam_xlog.h                |   1 +
 src/include/access/tableam.h                    |  55 ++
 src/include/access/xact.h                       |   4 +
 src/include/replication/logical.h               |   1 +
 src/include/replication/reorderbuffer.h         |  56 +-
 21 files changed, 1256 insertions(+), 106 deletions(-)
 create mode 100644 contrib/test_decoding/expected/stream.out
 create mode 100644 contrib/test_decoding/sql/stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c58..ed9a3d6 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,7 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate
+	spill slot truncate stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
new file mode 100644
index 0000000..bdf7002
--- /dev/null
+++ b/contrib/test_decoding/expected/stream.out
@@ -0,0 +1,40 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+ count 
+-------
+   157
+(1 row)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/truncate.out b/contrib/test_decoding/expected/truncate.out
index 1cf2ae8..e64d377 100644
--- a/contrib/test_decoding/expected/truncate.out
+++ b/contrib/test_decoding/expected/truncate.out
@@ -25,3 +25,9 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
  COMMIT
 (9 rows)
 
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
new file mode 100644
index 0000000..24d41b1
--- /dev/null
+++ b/contrib/test_decoding/sql/stream.sql
@@ -0,0 +1,21 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/truncate.sql b/contrib/test_decoding/sql/truncate.sql
index 5aecdf0..5633854 100644
--- a/contrib/test_decoding/sql/truncate.sql
+++ b/contrib/test_decoding/sql/truncate.sql
@@ -11,3 +11,4 @@ TRUNCATE tab1, tab1 RESTART IDENTITY CASCADE;
 TRUNCATE tab1, tab2;
 
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index dbef52a..d8e2b41 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -122,6 +122,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	ListCell   *option;
 	TestDecodingData *data;
+	bool enable_streaming = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -212,6 +213,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "stream-changes") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_streaming))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -221,6 +232,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 							elem->arg ? strVal(elem->arg) : "(null)")));
 		}
 	}
+
+	ctx->streaming &= enable_streaming;
 }
 
 /* cleanup this plugin's resources */
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 791a62b..1571d71 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d..fe7c978 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes', '1');
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d881f4c..33a4580 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments in xact.c where
+	 * these variables are declared.  Normally we have such a check at tableam
+	 * level API but this is called from many places so we need to ensure it
+	 * here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
@@ -1945,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..c771280 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the tuple
+		 * yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1659,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..9d9a70a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,10 +430,37 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments in xact.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments in xact.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments in xact.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 4b2bb29..a61e279 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -235,6 +235,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
 	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
+	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d4f7c29..99722ee 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -83,6 +83,19 @@ bool		XactDeferrable;
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
 /*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool		bsysscan = false;
+
+/*
  * When running as a parallel worker, we place only a single
  * TransactionStateData on the parallel worker's state stack, and the XID
  * reflected there will be that of the *innermost* currently-active
@@ -2680,6 +2693,9 @@ AbortTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* If in parallel mode, clean up workers and exit parallel mode. */
 	if (IsInParallelMode())
 	{
@@ -4982,6 +4998,9 @@ AbortSubTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* Exit from parallel mode, if necessary. */
 	if (IsInParallelMode())
 	{
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index f3a1c31..f21f61d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 05d24b9..42f284b 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1442,3 +1442,13 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Clear logical streaming state during (sub)transaction abort.
+ */
+void
+ResetLogicalStreamingState(void)
+{
+	CheckXidAlive = InvalidTransactionId;
+	bsysscan = false;
+}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index ce6e621..c469536 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -178,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -236,6 +252,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +261,16 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static inline bool ReorderBufferCanStartStreaming(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -367,6 +394,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -416,13 +446,15 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 }
 
 /*
- * Free an ReorderBufferChange.
+ * Free a ReorderBufferChange and update memory accounting, if requested.
  */
 void
-ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
+						  bool upd_mem)
 {
 	/* update memory accounting info */
-	ReorderBufferChangeMemoryUpdate(rb, change, false);
+	if (upd_mem)
+		ReorderBufferChangeMemoryUpdate(rb, change, false);
 
 	/* free contained data */
 	switch (change->action)
@@ -624,16 +656,101 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 }
 
 /*
- * Queue a change into a transaction so it can be replayed upon commit.
+ * Record the partial change for the streaming of in-progress transactions.  We
+ * can stream only complete changes so if we have a partial change like toast
+ * table insert or speculative then we mark such a 'txn' so that it can't be
+ * streamed.  We also ensure that if the changes in such a 'txn' are above
+ * logical_decoding_work_mem threshold then we stream them as soon as we have a
+ * complete change.
+ */
+static void
+ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								  ReorderBufferChange *change,
+								  bool toast_insert)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The partial changes need to be processed only while streaming
+	 * in-progress transactions.
+	 */
+	if (!ReorderBufferCanStream(rb))
+		return;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * Set the toast insert bit whenever we get toast insert to indicate a partial
+	 * change and clear it when we get the insert or update on main table (Both
+	 * update and insert will do the insert in the toast table).
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			 IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial change and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+	{
+		/*
+		 * Speculative confirm change must be preceded by speculative insertion.
+		 */
+		Assert(rbtxn_has_spec_insert(toptxn));
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+	}
+
+	/*
+	 * Stream the transaction if it is serialized before and the changes are
+	 * now complete in the top-level transaction.
+	 *
+	 * The reason for doing the streaming of such a transaction as soon as
+	 * we get the complete change for it is that previously it would have
+	 * reached the memory threshold and wouldn't get streamed because of
+	 * incomplete changes.  Delaying such transactions would increase apply
+	 * lag for them.
+	 */
+	if (ReorderBufferCanStartStreaming(rb) &&
+		!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		rbtxn_is_serialized(txn))
+		ReorderBufferStreamTXN(rb, toptxn);
+}
+
+/*
+ * Queue a change into a transaction so it can be replayed upon commit or will be
+ * streamed when we reach logical_decoding_work_mem threshold.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the previous changes we have detected that the transaction
+	 * is aborted.  So there is no point in collecting further changes for it.
+	 */
+	if (txn->concurrent_abort)
+	{
+		/*
+		 * We don't need to update memory accounting for this change as we have
+		 * not added it to the queue yet.
+		 */
+		ReorderBufferReturnChange(rb, change, false);
+		return;
+	}
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -645,6 +762,9 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* process partial change */
+	ReorderBufferProcessPartialChange(rb, txn, change, toast_insert);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -674,7 +794,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -764,6 +884,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the (sub)transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -1018,6 +1170,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1032,6 +1187,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1148,7 +1306,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 	{
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1234,7 +1392,7 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1280,7 +1438,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1297,7 +1455,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1310,6 +1468,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1335,6 +1502,91 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change, true);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
+	 * streamed always, even if it does not contain any changes (that is, when
+	 * all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the toplevel xact (we send the XID in all messages), but we never
+	 * stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
+	 * memory. We could also keep the hash table and update it with new ctid
+	 * values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
  */
@@ -1485,57 +1737,178 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the
+	 * xid aborted.  That will happen during catalog access.
+	 */
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						  ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction. This resets the TXN such that it
+ * can be used to stream the remaining data of transaction being processed.
+ */
+static void
+ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+					  Snapshot snapshot_now,
+					  CommandId command_id,
+					  XLogRecPtr last_lsn,
+					  ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Free all resources allocated for toast reconstruction */
+	ReorderBufferToastReset(rb, txn);
+
+	/* Return the spec insert change if it is not NULL */
+	if (specinsert != NULL)
+	{
+		ReorderBufferReturnChange(rb, specinsert, true);
+		specinsert = NULL;
 	}
 
-	snapshot_now = txn->base_snapshot;
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. We iterate over the top and subtransactions (using a k-way
+ * merge) and replay the changes in lsn order.
+ *
+ * If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool stream_started = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1558,14 +1931,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming ? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* We only need to send begin/commit for non-streamed transactions. */
+		if (!streaming)
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1573,6 +1947,36 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * We can't call start stream callback before processing first
+			 * change.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					txn->origin_id = change->origin_id;
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1649,7 +2053,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1685,11 +2090,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1714,7 +2119,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					/* clear out a pending (and thus failed) speculation */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
@@ -1747,7 +2152,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1756,10 +2164,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1790,7 +2195,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1837,7 +2241,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		 */
 		if (specinsert)
 		{
-			ReorderBufferReturnChange(rb, specinsert);
+			ReorderBufferReturnChange(rb, specinsert, true);
 			specinsert = NULL;
 		}
 
@@ -1845,14 +2249,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, send the last message for this set of
+		 * changes depending upon streaming mode.
+		 */
+		if (streaming)
+		{
+			if (stream_started)
+			{
+				rb->stream_stop(rb, txn, prev_lsn);
+				stream_started = false;
+			}
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot for the next set of changes in
+		 * streaming mode.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1870,14 +2295,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as
+		 * streamed (if they contained changes). Otherwise, remove all the
+		 * changes and deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1896,15 +2334,106 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
+		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * This error can only occur when we are sending the data in
+			 * streaming mode and the streaming in not finished yet.
+			 */
+			Assert(streaming);
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+			curtxn->concurrent_abort = true;
+
+			/* Reset the TXN so that it is allowed to stream remaining data. */
+			ReorderBufferResetTXN(rb, txn, snapshot_now,
+								  command_id, prev_lsn,
+								  specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+	}
+	PG_END_TRY();
+}
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot	snapshot_now;
+	CommandId	command_id = FirstCommandId;
 
-		PG_RE_THROW();
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
 	}
-	PG_END_TRY();
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
 }
 
 /*
@@ -1931,6 +2460,22 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_abort(rb, txn, lsn);
+
+		/*
+		 * We might have decoded changes for this transaction that could load
+		 * the cache as per the current transaction's view (consider DDL's
+		 * happened in this transaction). We don't want the decoding of future
+		 * transactions to use those cache entries so execute invalidations.
+		 */
+		if (txn->ninvalidations > 0)
+			ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+											   txn->invalidations);
+	}
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2000,6 +2545,10 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2082,7 +2631,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2131,12 +2680,21 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2144,6 +2702,8 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2155,19 +2715,41 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* If streaming supported, update the total size in top level as well. */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2196,6 +2778,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2388,6 +2971,51 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ *
+ * Note that, we skip transactions that contains incomplete changes. There
+ * is a scope of optimization here such that we can select the largest transaction
+ * which has complete changes.  But that will make the code and design quite complex
+ * and that might not be worth the benefit.  If we plan to stream the transactions
+ * that contains incomplete changes then we need to find a way to partially
+ * stream/truncate the transaction changes in-memory and build a mechanism to
+ * partially truncate the spilled files.  Additionally, whenever we partially
+ * stream the transaction we need to maintain the last streamed lsn and next time
+ * we need to restore from that segment and the offset in WAL.  As we stream the
+ * changes from the top transaction and restore them subtransaction wise, we need
+ * to even remember the subxact from where we streamed the last change.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	Size largest_size = 0;
+	ReorderBufferTXN *largest = NULL;
+
+	/* Find the largest top-level transaction. */
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		if ((largest != NULL || txn->total_size > largest_size) &&
+			(txn->total_size > 0) && !(rbtxn_has_incomplete_tuple(txn)))
+		{
+			largest = txn;
+			largest_size = txn->total_size;
+		}
+	}
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
  * disk until we reach under the memory limit.
@@ -2419,11 +3047,33 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if possible.  Otherwise, spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStartStreaming(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it
+			 * from memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2501,7 +3151,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 
 		spilled++;
 	}
@@ -2713,6 +3363,136 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+/* Returns true, if the output plugin supports streaming, false, otherwise. */
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/* Returns true, if the streaming can be started now, false, otherwise. */
+static inline bool
+ReorderBufferCanStartStreaming(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild *builder = ctx->snapshot_builder;
+
+	/*
+	 * We can't start streaming immediately even if the streaming is enabled
+	 * because we previously decoded this transaction and now just are
+	 * restarting.
+	 */
+	if (ReorderBufferCanStream(rb) &&
+		!SnapBuildXactNeedsSkip(builder, ctx->reader->EndRecPtr))
+	{
+		/* We must have a consistent snapshot by this time */
+		Assert(SnapBuildCurrentState(builder) == SNAPBUILD_CONSISTENT);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * We can't make any assumptions about base snapshot here, similar to what
+	 * ReorderBufferCommit() does. That relies on base_snapshot getting
+	 * transferred from subxact in ReorderBufferCommitChild(), but that was
+	 * not yet called as the transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 *
+	 * Unlike DecodeCommit which adds xids of all the subtransactions in
+	 * snapshot's xip array via SnapBuildCommittedTxn, we can't do that here
+	 * but we do add them to subxip array instead via ReorderBufferCopySnap.
+	 * This allows the catalog changes made in subtransactions decoded till
+	 * now to be visible.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		/*
+		 * If this transaction has no snapshot, it didn't make any changes to
+		 * the database till now, so there's nothing to decode.
+		 */
+		if (txn->base_snapshot == NULL)
+		{
+			Assert(txn->ninvalidations == 0);
+			return;
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -2813,7 +3593,7 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		dlist_container(ReorderBufferChange, node, cleanup_iter.cur);
 
 		dlist_delete(&cleanup->node);
-		ReorderBufferReturnChange(rb, cleanup);
+		ReorderBufferReturnChange(rb, cleanup, true);
 	}
 	txn->nentries_mem = 0;
 	Assert(dlist_is_empty(&txn->changes));
@@ -3522,7 +4302,7 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			dlist_container(ReorderBufferChange, node, it.cur);
 
 			dlist_delete(&change->node);
-			ReorderBufferReturnChange(rb, change);
+			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
 
@@ -3812,6 +4592,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases, we assume the CID is from the future
+	 * command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 0d28f01..b188427 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -19,6 +19,7 @@
 
 #include "access/relscan.h"
 #include "access/sdir.h"
+#include "access/xact.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1017,6 +1027,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1056,6 +1073,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with
+	 * valid CheckXidAlive for catalog or regular tables.  See detailed
+	 * comments in xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1713,6 +1738,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1730,6 +1763,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1748,6 +1789,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1764,6 +1812,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5348011..c18554b 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,10 @@ typedef enum
 /* Synchronous commit level */
 extern int	synchronous_commit;
 
+/* used during logical streaming of a transaction */
+extern TransactionId CheckXidAlive;
+extern bool bsysscan;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index deef318..b0fae98 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -121,5 +121,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+extern void ResetLogicalStreamingState(void);
 
 #endif
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 42bc817..1ae17d5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -162,6 +162,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -181,6 +184,40 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -249,6 +286,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
@@ -313,6 +357,12 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -484,12 +534,14 @@ void		ReorderBufferFree(ReorderBuffer *);
 ReorderBufferTupleBuf *ReorderBufferGetTupleBuf(ReorderBuffer *, Size tuple_len);
 void		ReorderBufferReturnTupleBuf(ReorderBuffer *, ReorderBufferTupleBuf *tuple);
 ReorderBufferChange *ReorderBufferGetChange(ReorderBuffer *);
-void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
+void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *, bool);
 
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool toast_insert);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v41-0003-Extend-the-BufFile-interface-for-the-streaming-o.patch000664 000765 000024 00000037534 13706541756 025444 0ustar00amitkapilastaff000000 000000 From 5b5756704071462414335480c3463f965b4dfc81 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v41 3/7] Extend the BufFile interface for the streaming of
 in-progress transactions.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times.

Implement the interface for BufFileTruncate interface to allow files to be
truncated up to a particular offset.  Extend BufFileSeek API to support
SEEK_END case.  Add an option to provide a mode while opening the shared
BufFiles instead of always opening in read-only mode.
---
 src/backend/postmaster/pgstat.c           |  3 +
 src/backend/storage/file/buffile.c        | 86 +++++++++++++++++++++++----
 src/backend/storage/file/fd.c             |  9 ++-
 src/backend/storage/file/sharedfileset.c  | 98 ++++++++++++++++++++++++++++---
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  4 +-
 10 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 88992c2..6c97f68 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 2d7a082..a9ca5d9 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile supports temporary files that can be used by the single backend when
+ * the corresponding files need to be survived across the transaction and need
+ * to be opened and closed multiple times.  Such files need to be created as a
+ * member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -364,6 +368,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -666,11 +673,22 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
+			break;
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +856,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset = file->curOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420e..f376a97 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594..9a3dc10 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can also be used by backends when the temporary files need
+ * to be opened/closed multiple times and the underlying files need to survive
+ * across transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,19 +29,29 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -223,6 +255,58 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 }
 
 /*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
+/*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
  */
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59..788815c 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a43..b83fb50 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201..807a9c1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752ba..fc34c49 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d..e209f04 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf07..d5edb60 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
1.8.3.1

v41-0004-Add-support-for-streaming-to-built-in-replicatio.patch000664 000765 000024 00000273545 13706541756 025656 0ustar00amitkapilastaff000000 000000 From e07bb7ed2f57373c52499b778c4d5a0702bba1d5 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v41 4/7] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  49 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   4 +
 src/backend/replication/logical/proto.c            | 140 ++-
 src/backend/replication/logical/worker.c           | 946 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 348 +++++++-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  46 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/009_stream_simple.pl       |  86 ++
 src/test/subscription/t/010_stream_subxact.pl      | 102 +++
 src/test/subscription/t/011_stream_ddl.pl          |  95 +++
 .../subscription/t/012_stream_subxact_abort.pl     |  82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |  84 ++
 src/test/subscription/t/015_stream_binary.pl       |  86 ++
 19 files changed, 2060 insertions(+), 47 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/015_stream_binary.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70..a81bd54 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal> and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c5..b7d7457 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf..311d462 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377..4c58ad8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,8 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +197,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +350,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +375,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -439,6 +455,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -698,6 +721,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +732,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +789,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +834,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +877,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 6c97f68..479e3ca 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e905723..a6101ac 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097..ff25924 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 2fcf2e6..98e7fd0 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,62 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +291,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,17 +752,323 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or
+	 * inside streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		!in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -635,6 +1081,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1099,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1138,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1256,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1408,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1781,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1922,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2050,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2162,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1947,6 +2435,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1979,6 +2481,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2145,6 +3080,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc..3360bd5 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,29 +47,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,11 +123,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -115,16 +149,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +226,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +255,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +279,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +300,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +331,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +394,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +444,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +486,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,6 +507,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -406,7 +539,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +559,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +584,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +605,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +631,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +663,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -606,6 +744,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -642,6 +887,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -771,12 +1048,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -811,7 +1121,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35..1d09154 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc..655144d 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,49 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbe..6c0a4e3 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/015_stream_binary.pl b/src/test/subscription/t/015_stream_binary.pl
new file mode 100644
index 0000000..fa2362e
--- /dev/null
+++ b/src/test/subscription/t/015_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v41-0005-Enable-streaming-for-all-subscription-TAP-tests.patch000664 000765 000024 00000035342 13706541756 025412 0ustar00amitkapilastaff000000 000000 From e8e883a3f0b2e075fbcd1d744aa5826f6a313492 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v41 5/7] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318f..6f7bedc 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 9f140b5..21410fa 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334e..cdf9b8e 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v41-0006-Add-TAP-test-for-streaming-vs.-DDL.patch000664 000765 000024 00000010621 13706541756 022340 0ustar00amitkapilastaff000000 000000 From 0b9712a5875dfa79449534a27826752963eb24f9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v41 6/7] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v41-0007-Add-streaming-option-in-pg_dump.patch000664 000765 000024 00000005247 13706541756 022424 0ustar00amitkapilastaff000000 000000 From fbb60a7eb79422f62d03b010535383adcf276b65 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 11:09:18 +0530
Subject: [PATCH v41 7/7] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 17 +++++++++++++++--
 src/bin/pg_dump/pg_dump.h |  1 +
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 94459b3..f69d64c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4240,10 +4241,17 @@ getSubscriptions(Archive *fout)
 
 	if (fout->remoteVersion >= 140000)
 		appendPQExpBuffer(query,
-						  " s.subbinary\n");
+						  " s.subbinary,\n");
 	else
 		appendPQExpBuffer(query,
-						  " false AS subbinary\n");
+						  " false AS subbinary,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBuffer(query,
+						  " s.substream\n");
+	else
+		appendPQExpBuffer(query,
+						  " false AS substream\n");
 
 	appendPQExpBuffer(query,
 					  "FROM pg_subscription s\n"
@@ -4263,6 +4271,7 @@ getSubscriptions(Archive *fout)
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
+	i_substream = PQfnumber(res, "substream");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4286,6 +4295,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subpublications));
 		subinfo[i].subbinary =
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
+		subinfo[i].substream =
+			pg_strdup(PQgetvalue(res, i, i_substream));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4357,6 +4368,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subbinary, "t") == 0)
 		appendPQExpBuffer(query, ", binary = true");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index da97b73..cc10c7c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -626,6 +626,7 @@ typedef struct _SubscriptionInfo
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subbinary;
+	char       *substream;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
-- 
1.8.3.1

#453Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#452)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Jul 24, 2020 at 5:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 23, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 22, 2020 at 4:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

You are right. I have changed it.

Thanks, I have pushed the second patch in this series which is
0001-WAL-Log-invalidations-at-command-end-with-wal_le in your latest
patch. I will continue working on remaining patches.

I have reviewed and made a number of changes in the next patch which
extends the logical decoding output plugin API with stream methods.
(v41-0001-Extend-the-logical-decoding-output-plugin-API-wi).

1. I think we need handling of include_xids and include_timestamp but
not skip_empty_xacts in the new APIs, as of now, none of the options
were respected. We need 'include_xids' handling because we need to
include xid with stream messages and similarly 'include_timestamp' for
stream commit messages. OTOH, I think we never use streaming mode for
empty xacts, so we don't need to bother about skip_empty_xacts in
streaming APIs.
2. Then I made a number of changes in documentation, comments, and
other cosmetic changes.

Kindly review/test and let me know if you see any problems with the
above changes.

Your changes look fine to me. Additionally, I have changed a test
case of getting the streaming changes in 0002. Instead of just
showing the count, I am showing that the transaction is actually
streaming.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v42.tarapplication/x-tar; name=v42.tarDownload
v42/000755 000765 000024 00000000000 13706560046 013004 5ustar00dilipkumarstaff000000 000000 v42/v42-0002-Implement-streaming-mode-in-ReorderBuffer.patch000644 000765 000024 00000240354 13706560046 025026 0ustar00dilipkumarstaff000000 000000 From 6bfc0bc86f31e52ac0f6b52c5110c1f684df4549 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 15 Jul 2020 18:38:32 +0530
Subject: [PATCH v42 2/7] Implement streaming mode in ReorderBuffer.

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke new stream API methods. However, sometimes if we have
incomplete toast or speculative insert we spill to the disk because we can
not generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

Now that we can stream in-progress transactions, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

We also provide a new option via SQL APIs to fetch the changes being
streamed.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/Makefile                |   2 +-
 contrib/test_decoding/expected/stream.out     | 196 ++++
 contrib/test_decoding/expected/truncate.out   |   6 +
 contrib/test_decoding/sql/stream.sql          |  21 +
 contrib/test_decoding/sql/truncate.sql        |   1 +
 contrib/test_decoding/test_decoding.c         |  13 +
 doc/src/sgml/logicaldecoding.sgml             |   9 +-
 doc/src/sgml/test-decoding.sgml               |  22 +
 src/backend/access/heap/heapam.c              |  13 +
 src/backend/access/heap/heapam_visibility.c   |  42 +-
 src/backend/access/index/genam.c              |  53 +
 src/backend/access/table/tableam.c            |   8 +
 src/backend/access/transam/xact.c             |  19 +
 src/backend/replication/logical/decode.c      |  17 +-
 src/backend/replication/logical/logical.c     |  10 +
 .../replication/logical/reorderbuffer.c       | 969 ++++++++++++++++--
 src/include/access/heapam_xlog.h              |   1 +
 src/include/access/tableam.h                  |  55 +
 src/include/access/xact.h                     |   4 +
 src/include/replication/logical.h             |   1 +
 src/include/replication/reorderbuffer.h       |  56 +-
 21 files changed, 1412 insertions(+), 106 deletions(-)
 create mode 100644 contrib/test_decoding/expected/stream.out
 create mode 100644 contrib/test_decoding/sql/stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c582a5..ed9a3d6c0e 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,7 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate
+	spill slot truncate stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
new file mode 100644
index 0000000000..272aab2de8
--- /dev/null
+++ b/contrib/test_decoding/expected/stream.out
@@ -0,0 +1,196 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ opening a streamed block for transaction
+ streaming message: transactional: 1 prefix: test, sz: 50
+ closing a streamed block for transaction
+ aborting streamed (sub)transaction
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ committing streamed transaction
+(157 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/truncate.out b/contrib/test_decoding/expected/truncate.out
index 1cf2ae835c..e64d377214 100644
--- a/contrib/test_decoding/expected/truncate.out
+++ b/contrib/test_decoding/expected/truncate.out
@@ -25,3 +25,9 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
  COMMIT
 (9 rows)
 
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
new file mode 100644
index 0000000000..73c5c987da
--- /dev/null
+++ b/contrib/test_decoding/sql/stream.sql
@@ -0,0 +1,21 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 500) || g.i FROM generate_series(1, 150) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 150) g(i);
+COMMIT;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/truncate.sql b/contrib/test_decoding/sql/truncate.sql
index 5aecdf0881..5633854e0d 100644
--- a/contrib/test_decoding/sql/truncate.sql
+++ b/contrib/test_decoding/sql/truncate.sql
@@ -11,3 +11,4 @@ TRUNCATE tab1, tab1 RESTART IDENTITY CASCADE;
 TRUNCATE tab1, tab2;
 
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index dbef52a3af..d8e2b416c6 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -122,6 +122,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	ListCell   *option;
 	TestDecodingData *data;
+	bool enable_streaming = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -212,6 +213,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "stream-changes") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_streaming))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -221,6 +232,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 							elem->arg ? strVal(elem->arg) : "(null)")));
 		}
 	}
+
+	ctx->streaming &= enable_streaming;
 }
 
 /* cleanup this plugin's resources */
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 791a62b57c..1571d71a5b 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d67b..fe7c9783fa 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes', '1');
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d881f4cd46..33a45800b3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments in xact.c where
+	 * these variables are declared.  Normally we have such a check at tableam
+	 * level API but this is called from many places so we need to ensure it
+	 * here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
@@ -1945,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aa..c77128087c 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the tuple
+		 * yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1659,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae39a..9d9a70a354 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,9 +430,36 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments in xact.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
+/*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments in xact.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
 /*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments in xact.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 4b2bb29559..a61e279d68 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -234,6 +234,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	Relation	rel = scan->rs_rd;
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
+	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
 	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d4f7c29847..99722eea4b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -82,6 +82,19 @@ bool		XactDeferrable;
 
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool		bsysscan = false;
+
 /*
  * When running as a parallel worker, we place only a single
  * TransactionStateData on the parallel worker's state stack, and the XID
@@ -2680,6 +2693,9 @@ AbortTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* If in parallel mode, clean up workers and exit parallel mode. */
 	if (IsInParallelMode())
 	{
@@ -4982,6 +4998,9 @@ AbortSubTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* Exit from parallel mode, if necessary. */
 	if (IsInParallelMode())
 	{
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index f3a1c31a29..f21f61d5e1 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 05d24b93da..42f284b33f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1442,3 +1442,13 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Clear logical streaming state during (sub)transaction abort.
+ */
+void
+ResetLogicalStreamingState(void)
+{
+	CheckXidAlive = InvalidTransactionId;
+	bsysscan = false;
+}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index ce6e62152f..c469536b5f 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -178,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -236,6 +252,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +261,16 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static inline bool ReorderBufferCanStartStreaming(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -367,6 +394,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -416,13 +446,15 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 }
 
 /*
- * Free an ReorderBufferChange.
+ * Free a ReorderBufferChange and update memory accounting, if requested.
  */
 void
-ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
+						  bool upd_mem)
 {
 	/* update memory accounting info */
-	ReorderBufferChangeMemoryUpdate(rb, change, false);
+	if (upd_mem)
+		ReorderBufferChangeMemoryUpdate(rb, change, false);
 
 	/* free contained data */
 	switch (change->action)
@@ -624,16 +656,101 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 }
 
 /*
- * Queue a change into a transaction so it can be replayed upon commit.
+ * Record the partial change for the streaming of in-progress transactions.  We
+ * can stream only complete changes so if we have a partial change like toast
+ * table insert or speculative then we mark such a 'txn' so that it can't be
+ * streamed.  We also ensure that if the changes in such a 'txn' are above
+ * logical_decoding_work_mem threshold then we stream them as soon as we have a
+ * complete change.
+ */
+static void
+ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								  ReorderBufferChange *change,
+								  bool toast_insert)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The partial changes need to be processed only while streaming
+	 * in-progress transactions.
+	 */
+	if (!ReorderBufferCanStream(rb))
+		return;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * Set the toast insert bit whenever we get toast insert to indicate a partial
+	 * change and clear it when we get the insert or update on main table (Both
+	 * update and insert will do the insert in the toast table).
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			 IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial change and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+	{
+		/*
+		 * Speculative confirm change must be preceded by speculative insertion.
+		 */
+		Assert(rbtxn_has_spec_insert(toptxn));
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+	}
+
+	/*
+	 * Stream the transaction if it is serialized before and the changes are
+	 * now complete in the top-level transaction.
+	 *
+	 * The reason for doing the streaming of such a transaction as soon as
+	 * we get the complete change for it is that previously it would have
+	 * reached the memory threshold and wouldn't get streamed because of
+	 * incomplete changes.  Delaying such transactions would increase apply
+	 * lag for them.
+	 */
+	if (ReorderBufferCanStartStreaming(rb) &&
+		!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		rbtxn_is_serialized(txn))
+		ReorderBufferStreamTXN(rb, toptxn);
+}
+
+/*
+ * Queue a change into a transaction so it can be replayed upon commit or will be
+ * streamed when we reach logical_decoding_work_mem threshold.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the previous changes we have detected that the transaction
+	 * is aborted.  So there is no point in collecting further changes for it.
+	 */
+	if (txn->concurrent_abort)
+	{
+		/*
+		 * We don't need to update memory accounting for this change as we have
+		 * not added it to the queue yet.
+		 */
+		ReorderBufferReturnChange(rb, change, false);
+		return;
+	}
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -645,6 +762,9 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* process partial change */
+	ReorderBufferProcessPartialChange(rb, txn, change, toast_insert);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -674,7 +794,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -763,6 +883,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+/*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the (sub)transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -1018,6 +1170,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1032,6 +1187,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1148,7 +1306,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 	{
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1234,7 +1392,7 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1280,7 +1438,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1297,7 +1455,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1309,6 +1467,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		dlist_delete(&txn->base_snapshot_node);
 	}
 
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
 	/*
 	 * Remove TXN from its containing list.
 	 *
@@ -1334,6 +1501,91 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferReturnTXN(rb, txn);
 }
 
+/*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change, true);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
+	 * streamed always, even if it does not contain any changes (that is, when
+	 * all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the toplevel xact (we send the XID in all messages), but we never
+	 * stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
+	 * memory. We could also keep the hash table and update it with new ctid
+	 * values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
 /*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1485,57 +1737,178 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the
+	 * xid aborted.  That will happen during catalog access.
+	 */
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						  ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction. This resets the TXN such that it
+ * can be used to stream the remaining data of transaction being processed.
+ */
+static void
+ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+					  Snapshot snapshot_now,
+					  CommandId command_id,
+					  XLogRecPtr last_lsn,
+					  ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Free all resources allocated for toast reconstruction */
+	ReorderBufferToastReset(rb, txn);
+
+	/* Return the spec insert change if it is not NULL */
+	if (specinsert != NULL)
+	{
+		ReorderBufferReturnChange(rb, specinsert, true);
+		specinsert = NULL;
 	}
 
-	snapshot_now = txn->base_snapshot;
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. We iterate over the top and subtransactions (using a k-way
+ * merge) and replay the changes in lsn order.
+ *
+ * If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool stream_started = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1558,14 +1931,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming ? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* We only need to send begin/commit for non-streamed transactions. */
+		if (!streaming)
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1573,6 +1947,36 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * We can't call start stream callback before processing first
+			 * change.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					txn->origin_id = change->origin_id;
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1649,7 +2053,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1685,11 +2090,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1714,7 +2119,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					/* clear out a pending (and thus failed) speculation */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
@@ -1747,7 +2152,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1756,10 +2164,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1790,7 +2195,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1837,7 +2241,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		 */
 		if (specinsert)
 		{
-			ReorderBufferReturnChange(rb, specinsert);
+			ReorderBufferReturnChange(rb, specinsert, true);
 			specinsert = NULL;
 		}
 
@@ -1845,14 +2249,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, send the last message for this set of
+		 * changes depending upon streaming mode.
+		 */
+		if (streaming)
+		{
+			if (stream_started)
+			{
+				rb->stream_stop(rb, txn, prev_lsn);
+				stream_started = false;
+			}
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot for the next set of changes in
+		 * streaming mode.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1870,14 +2295,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as
+		 * streamed (if they contained changes). Otherwise, remove all the
+		 * changes and deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1896,15 +2334,106 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
+		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * This error can only occur when we are sending the data in
+			 * streaming mode and the streaming in not finished yet.
+			 */
+			Assert(streaming);
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+			curtxn->concurrent_abort = true;
+
+			/* Reset the TXN so that it is allowed to stream remaining data. */
+			ReorderBufferResetTXN(rb, txn, snapshot_now,
+								  command_id, prev_lsn,
+								  specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+	}
+	PG_END_TRY();
+}
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot	snapshot_now;
+	CommandId	command_id = FirstCommandId;
 
-		PG_RE_THROW();
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
 	}
-	PG_END_TRY();
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
 }
 
 /*
@@ -1931,6 +2460,22 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_abort(rb, txn, lsn);
+
+		/*
+		 * We might have decoded changes for this transaction that could load
+		 * the cache as per the current transaction's view (consider DDL's
+		 * happened in this transaction). We don't want the decoding of future
+		 * transactions to use those cache entries so execute invalidations.
+		 */
+		if (txn->ninvalidations > 0)
+			ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+											   txn->invalidations);
+	}
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2000,6 +2545,10 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2082,7 +2631,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2131,12 +2680,21 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2144,6 +2702,8 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2155,19 +2715,41 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* If streaming supported, update the total size in top level as well. */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2196,6 +2778,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2387,6 +2970,51 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ *
+ * Note that, we skip transactions that contains incomplete changes. There
+ * is a scope of optimization here such that we can select the largest transaction
+ * which has complete changes.  But that will make the code and design quite complex
+ * and that might not be worth the benefit.  If we plan to stream the transactions
+ * that contains incomplete changes then we need to find a way to partially
+ * stream/truncate the transaction changes in-memory and build a mechanism to
+ * partially truncate the spilled files.  Additionally, whenever we partially
+ * stream the transaction we need to maintain the last streamed lsn and next time
+ * we need to restore from that segment and the offset in WAL.  As we stream the
+ * changes from the top transaction and restore them subtransaction wise, we need
+ * to even remember the subxact from where we streamed the last change.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	Size largest_size = 0;
+	ReorderBufferTXN *largest = NULL;
+
+	/* Find the largest top-level transaction. */
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		if ((largest != NULL || txn->total_size > largest_size) &&
+			(txn->total_size > 0) && !(rbtxn_has_incomplete_tuple(txn)))
+		{
+			largest = txn;
+			largest_size = txn->total_size;
+		}
+	}
+
+	return largest;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
@@ -2419,11 +3047,33 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if possible.  Otherwise, spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStartStreaming(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it
+			 * from memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2501,7 +3151,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 
 		spilled++;
 	}
@@ -2713,6 +3363,136 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+/* Returns true, if the output plugin supports streaming, false, otherwise. */
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/* Returns true, if the streaming can be started now, false, otherwise. */
+static inline bool
+ReorderBufferCanStartStreaming(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild *builder = ctx->snapshot_builder;
+
+	/*
+	 * We can't start streaming immediately even if the streaming is enabled
+	 * because we previously decoded this transaction and now just are
+	 * restarting.
+	 */
+	if (ReorderBufferCanStream(rb) &&
+		!SnapBuildXactNeedsSkip(builder, ctx->reader->EndRecPtr))
+	{
+		/* We must have a consistent snapshot by this time */
+		Assert(SnapBuildCurrentState(builder) == SNAPBUILD_CONSISTENT);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * We can't make any assumptions about base snapshot here, similar to what
+	 * ReorderBufferCommit() does. That relies on base_snapshot getting
+	 * transferred from subxact in ReorderBufferCommitChild(), but that was
+	 * not yet called as the transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 *
+	 * Unlike DecodeCommit which adds xids of all the subtransactions in
+	 * snapshot's xip array via SnapBuildCommittedTxn, we can't do that here
+	 * but we do add them to subxip array instead via ReorderBufferCopySnap.
+	 * This allows the catalog changes made in subtransactions decoded till
+	 * now to be visible.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		/*
+		 * If this transaction has no snapshot, it didn't make any changes to
+		 * the database till now, so there's nothing to decode.
+		 */
+		if (txn->base_snapshot == NULL)
+		{
+			Assert(txn->ninvalidations == 0);
+			return;
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -2813,7 +3593,7 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		dlist_container(ReorderBufferChange, node, cleanup_iter.cur);
 
 		dlist_delete(&cleanup->node);
-		ReorderBufferReturnChange(rb, cleanup);
+		ReorderBufferReturnChange(rb, cleanup, true);
 	}
 	txn->nentries_mem = 0;
 	Assert(dlist_is_empty(&txn->changes));
@@ -3522,7 +4302,7 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			dlist_container(ReorderBufferChange, node, it.cur);
 
 			dlist_delete(&change->node);
-			ReorderBufferReturnChange(rb, change);
+			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
 
@@ -3812,6 +4592,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases, we assume the CID is from the future
+	 * command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cdb12..aa17f7df84 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 0d28f01ca9..b188427563 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -19,6 +19,7 @@
 
 #include "access/relscan.h"
 #include "access/sdir.h"
+#include "access/xact.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1017,6 +1027,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1056,6 +1073,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with
+	 * valid CheckXidAlive for catalog or regular tables.  See detailed
+	 * comments in xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1713,6 +1738,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1730,6 +1763,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1748,6 +1789,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1764,6 +1812,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 53480116a4..c18554bae2 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,10 @@ typedef enum
 /* Synchronous commit level */
 extern int	synchronous_commit;
 
+/* used during logical streaming of a transaction */
+extern TransactionId CheckXidAlive;
+extern bool bsysscan;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index deef31825d..b0fae9808b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -121,5 +121,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+extern void ResetLogicalStreamingState(void);
 
 #endif
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 42bc817648..1ae17d5f11 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -162,6 +162,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -181,6 +184,40 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -248,6 +285,13 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
+	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
 	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
@@ -313,6 +357,12 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -484,12 +534,14 @@ void		ReorderBufferFree(ReorderBuffer *);
 ReorderBufferTupleBuf *ReorderBufferGetTupleBuf(ReorderBuffer *, Size tuple_len);
 void		ReorderBufferReturnTupleBuf(ReorderBuffer *, ReorderBufferTupleBuf *tuple);
 ReorderBufferChange *ReorderBufferGetChange(ReorderBuffer *);
-void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
+void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *, bool);
 
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool toast_insert);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
2.23.0

v42/v42-0006-Add-TAP-test-for-streaming-vs.-DDL.patch000644 000765 000024 00000010616 13706560046 022774 0ustar00dilipkumarstaff000000 000000 From ad78f92f84ccb4e3a0557cf118d088cb7ece022a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v42 6/7] Add TAP test for streaming vs. DDL

---
 .../subscription/t/014_stream_through_ddl.pl  | 98 +++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000000..b8d78b1972
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v42/v42-0004-Add-support-for-streaming-to-built-in-replicatio.patch000644 000765 000024 00000273335 13706560046 026303 0ustar00dilipkumarstaff000000 000000 From e4eeebf36fd51b3c3d2a110eb8a219879d1f3c3e Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v42 4/7] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml      |   5 +-
 doc/src/sgml/ref/create_subscription.sgml     |  11 +
 src/backend/catalog/pg_subscription.c         |   1 +
 src/backend/commands/subscriptioncmds.c       |  49 +-
 src/backend/postmaster/pgstat.c               |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |   4 +
 src/backend/replication/logical/proto.c       | 140 ++-
 src/backend/replication/logical/worker.c      | 946 +++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c   | 348 ++++++-
 src/include/catalog/pg_subscription.h         |   3 +
 src/include/pgstat.h                          |   6 +-
 src/include/replication/logicalproto.h        |  46 +-
 src/include/replication/walreceiver.h         |   1 +
 src/test/subscription/t/009_stream_simple.pl  |  86 ++
 src/test/subscription/t/010_stream_subxact.pl | 102 ++
 src/test/subscription/t/011_stream_ddl.pl     |  95 ++
 .../t/012_stream_subxact_abort.pl             |  82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |  84 ++
 src/test/subscription/t/015_stream_binary.pl  |  86 ++
 19 files changed, 2060 insertions(+), 47 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/015_stream_binary.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70cdf..a81bd54efc 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal> and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c54fe..b7d7457d00 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf0c6..311d46225a 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377a85..4c58ad8b07 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,8 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +197,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +350,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +375,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -439,6 +455,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -698,6 +721,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +732,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +789,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +834,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +877,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 6c97f68671..479e3cadf9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e9057230e4..a6101ace30 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097bf5..ff25924e68 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 2fcf2e61bc..98e7fd0576 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,62 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +291,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,16 +752,322 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or
+	 * inside streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		!in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -635,6 +1081,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1099,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1138,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1256,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1408,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1781,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1922,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2050,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2162,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1947,6 +2435,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1979,6 +2481,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2145,6 +3080,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc4c1..3360bd5dd0 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,29 +47,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,11 +123,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -115,16 +149,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +226,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +255,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +279,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +300,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +331,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +394,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +444,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +486,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,6 +507,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -406,7 +539,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +559,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +584,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +605,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +631,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +663,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -605,6 +743,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -641,6 +886,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -771,11 +1048,44 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -811,7 +1121,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35000..1d091546bf 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1edf..0dfbac46b4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc85c..655144d03a 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,49 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbee54..6c0a4e30a8 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/015_stream_binary.pl b/src/test/subscription/t/015_stream_binary.pl
new file mode 100644
index 0000000000..fa2362e32b
--- /dev/null
+++ b/src/test/subscription/t/015_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v42/v42-0007-Add-streaming-option-in-pg_dump.patch000644 000765 000024 00000005262 13706560046 023051 0ustar00dilipkumarstaff000000 000000 From 202716074e871612c1471f6fd7ee41cde25be7d7 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 11:09:18 +0530
Subject: [PATCH v42 7/7] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 17 +++++++++++++++--
 src/bin/pg_dump/pg_dump.h |  1 +
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 94459b3539..f69d64cd16 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4240,10 +4241,17 @@ getSubscriptions(Archive *fout)
 
 	if (fout->remoteVersion >= 140000)
 		appendPQExpBuffer(query,
-						  " s.subbinary\n");
+						  " s.subbinary,\n");
 	else
 		appendPQExpBuffer(query,
-						  " false AS subbinary\n");
+						  " false AS subbinary,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBuffer(query,
+						  " s.substream\n");
+	else
+		appendPQExpBuffer(query,
+						  " false AS substream\n");
 
 	appendPQExpBuffer(query,
 					  "FROM pg_subscription s\n"
@@ -4263,6 +4271,7 @@ getSubscriptions(Archive *fout)
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
+	i_substream = PQfnumber(res, "substream");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4286,6 +4295,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subpublications));
 		subinfo[i].subbinary =
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
+		subinfo[i].substream =
+			pg_strdup(PQgetvalue(res, i, i_substream));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4357,6 +4368,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subbinary, "t") == 0)
 		appendPQExpBuffer(query, ", binary = true");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index da97b731b1..cc10c7c1cc 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -626,6 +626,7 @@ typedef struct _SubscriptionInfo
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subbinary;
+	char       *substream;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
-- 
2.23.0

v42/v42-0001-Extend-the-logical-decoding-output-plugin-API-wi.patch000644 000765 000024 00000114056 13706560046 026070 0ustar00dilipkumarstaff000000 000000 From f311ae1ba840d2540a39d7cf6e18c5fe61d9d2b3 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:01:24 +0530
Subject: [PATCH v42 1/7] Extend the logical decoding output plugin API with
 stream methods.

This adds seven methods to the output plugin API, adding support for
streaming changes for large in-porgress transactions.

* stream_start
* stream_stop
* stream_abort
* stream_commit
* stream_change
* stream_message
* stream_truncate

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate a chunk of changes
streamed for a particular toplevel transaction.

This commit simply adds these new APIs and the upcoming patch to "allow
the streaming mode in ReorderBuffer" will use these APIs.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/test_decoding.c     | 176 +++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 218 ++++++++++++++
 src/backend/replication/logical/logical.c | 351 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 +++++
 src/include/replication/reorderbuffer.h   |  59 ++++
 6 files changed, 878 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c948856e..dbef52a3af 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
 }
 
 
@@ -540,3 +569,150 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+/*
+ * We never try to stream any empty xact so we don't need any special handling
+ * for skip_empty_xacts in streaming mode APIs.
+ */
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "opening a streamed block for transaction");
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * We never try to stream any empty xact so we don't need any special handling
+ * for skip_empty_xacts in streaming mode APIs.
+ */
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "closing a streamed block for transaction");
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * We never try to stream any empty xact so we don't need any special handling
+ * for skip_empty_xacts in streaming mode APIs.
+ */
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "aborting streamed (sub)transaction");
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * We never try to stream any empty xact so we don't need any special handling
+ * for skip_empty_xacts in streaming mode APIs.
+ */
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "committing streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the changes as the transaction can abort
+ * at a later point in time.  We don't want users to see the changes until the
+ * transaction is committed.
+ */
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "streaming change for transaction");
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the contents for transactional messages
+ * as the transaction can abort at a later point in time.  We don't want users to
+ * see the message contents until the transaction is committed.
+ */
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (transactional)
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu",
+						 transactional, prefix, sz);
+	}
+	else
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+						 transactional, prefix, sz);
+		appendBinaryStringInfo(ctx->out, message, sz);
+	}
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the detailed information of Truncate.
+ * See pg_decode_stream_change.
+ */
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "streaming truncate for transaction");
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index c89f93cf6b..791a62b57c 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,6 +389,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -401,6 +408,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_start_cb</function>,
+     <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -679,6 +695,117 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                           ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                             ReorderBufferTXN *txn,
+                                             XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+    
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The actual changes are not displayed as the transaction can abort at a later
+      point in time and we don't decode changes for aborted transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                             ReorderBufferTXN *txn,
+                                             Relation relation,
+                                             ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The message contents for transactional messages are not displayed as the transaction
+      can abort at a later point in time and we don't decode changes for aborted
+      transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr message_lsn,
+                                              bool transactional,
+                                              const char *prefix,
+                                              Size message_size,
+                                              const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               int nrelations,
+                                               Relation relations[],
+                                               ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -747,4 +874,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_start_cb</function>, <function>stream_stop_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_commit_cb</function>
+    and <function>stream_change_cb</function>) and two optional callbacks
+    (<function>stream_message_cb</function>) and (<function>stream_truncate_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be3b0..05d24b93da 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 Relation relation, ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr message_lsn, bool transactional,
+									  const char *prefix, Size message_size, const char *message);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   int nrelations, Relation relations[], ReorderBufferChange *change);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require start/stop/abort/commit/change
+	 * callbacks. The message and truncate callbacks are optional, similar to
+	 * regular output plugins. We however enable streaming when at least one
+	 * of the methods is enabled so that we can easily identify missing
+	 * methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional, so we do not
+	 * fail with ERROR when missing, but the wrappers simply do nothing. We
+	 * must set the ReorderBuffer callbacks to something, otherwise the calls
+	 * from there will crash (we don't want to move the checks there).
+	 */
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,307 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this message's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_start_cb callback")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this message's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_stop_cb callback")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_abort_cb callback")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_commit_cb callback")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_change_cb callback")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475e5d..deef31825d 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -79,6 +79,11 @@ typedef struct LogicalDecodingContext
 	 */
 	void	   *output_writer_private;
 
+	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236c57..b78c796450 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
 /*
  * Output plugin callbacks
  */
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1055e99e2e..42bc817648 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -348,6 +348,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											   ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
 struct ReorderBuffer
 {
 	/*
@@ -386,6 +434,17 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamTruncateCB stream_truncate;
+
 	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
-- 
2.23.0

v42/v42-0003-Extend-the-BufFile-interface-for-the-streaming-o.patch000644 000765 000024 00000037647 13706560046 026101 0ustar00dilipkumarstaff000000 000000 From 18afc9fdf67f173cdc73ccdc8ad5d1f1e4c1e3f8 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v42 3/7] Extend the BufFile interface for the streaming of
 in-progress transactions.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times.

Implement the interface for BufFileTruncate interface to allow files to be
truncated up to a particular offset.  Extend BufFileSeek API to support
SEEK_END case.  Add an option to provide a mode while opening the shared
BufFiles instead of always opening in read-only mode.
---
 src/backend/postmaster/pgstat.c           |  3 +
 src/backend/storage/file/buffile.c        | 86 +++++++++++++++++---
 src/backend/storage/file/fd.c             |  9 +--
 src/backend/storage/file/sharedfileset.c  | 98 +++++++++++++++++++++--
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  4 +-
 10 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 88992c2da2..6c97f68671 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 2d7a082320..a9ca5d929c 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile supports temporary files that can be used by the single backend when
+ * the corresponding files need to be survived across the transaction and need
+ * to be opened and closed multiple times.  Such files need to be created as a
+ * member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -364,6 +368,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -666,11 +673,22 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
+			break;
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +856,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset = file->curOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420efb2..f376a97ed6 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594756..9a3dc102f5 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can also be used by backends when the temporary files need
+ * to be opened/closed multiple times and the underlying files need to survive
+ * across transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,19 +29,29 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -222,6 +254,58 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 		SharedFileSetDeleteAll(fileset);
 }
 
+/*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
 /*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59c50..788815cdab 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a4303b..b83fb50dac 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..807a9c1edf 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752bab0d..fc34c49522 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d7df..e209f047e8 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf077e5..d5edb600af 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
2.23.0

v42/v42-0005-Enable-streaming-for-all-subscription-TAP-tests.patch000644 000765 000024 00000035515 13706560046 026044 0ustar00dilipkumarstaff000000 000000 From 2b39a173ebfe7f43e1b8cc592d1ca3af193741d1 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v42 5/7] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318fc7c..6f7bedc130 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2fbc..94c71f8ae2 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 9f140b552b..21410fac1c 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9181..a6fae9c3f1 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17a78..202871a658 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10a19..70c86b22ac 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6d63..f9c8d1d348 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334ed89..cdf9b8e7bb 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bdba9..21f50c7012 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133f69..30561d8f96 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae38592b..9a6bac6822 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..ed56fbf96c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cba4c..4df1ddef63 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1a8a..c3caff6149 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7c2f..c62eb521e7 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30f59..2be7542831 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0578..2da9607a7d 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9435..96ffc091b0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
2.23.0

#454Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#453)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Jul 24, 2020 at 7:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Your changes look fine to me. Additionally, I have changed a test
case of getting the streaming changes in 0002. Instead of just
showing the count, I am showing that the transaction is actually
streaming.

If you want to show the changes then there is no need to display 157
rows probably a few (10-15) should be sufficient. If we can do that
by increasing the size of the row then good, otherwise, I think it is
better to retain the test to display the count.

Today, I have again looked at the first patch
(v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't
find any more problems with it so planning to commit the same unless
you or someone else want to add more to it. Just for ease of others,
"the next patch extends the logical decoding output plugin API with
stream methods". It adds seven methods to the output plugin API,
adding support for streaming changes for large in-progress
transactions. The methods are stream_start, stream_stop, stream_abort,
stream_commit, stream_change, stream_message, and stream_truncate.
Most of this is a simple extension of the existing methods, with the
semantic difference that the transaction (or subtransaction) is
incomplete and may be aborted later (which is something the regular
API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these new
stream methods. The stream_start/start_stop are used to demarcate a
chunk of changes streamed for a particular toplevel transaction.

This commit simply adds these new APIs and the upcoming patch to
"allow the streaming mode in ReorderBuffer" will use these APIs.

--
With Regards,
Amit Kapila.

#455Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#454)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Jul 25, 2020 at 5:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 24, 2020 at 7:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Your changes look fine to me. Additionally, I have changed a test
case of getting the streaming changes in 0002. Instead of just
showing the count, I am showing that the transaction is actually
streaming.

If you want to show the changes then there is no need to display 157
rows probably a few (10-15) should be sufficient. If we can do that
by increasing the size of the row then good, otherwise, I think it is
better to retain the test to display the count.

I think in existing test cases also we are displaying multiple lines
e.g. toast.out is showing 235 rows. But maybe I will try to reduce it
to the less number of rows.

Today, I have again looked at the first patch
(v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't
find any more problems with it so planning to commit the same unless
you or someone else want to add more to it. Just for ease of others,
"the next patch extends the logical decoding output plugin API with
stream methods". It adds seven methods to the output plugin API,
adding support for streaming changes for large in-progress
transactions. The methods are stream_start, stream_stop, stream_abort,
stream_commit, stream_change, stream_message, and stream_truncate.
Most of this is a simple extension of the existing methods, with the
semantic difference that the transaction (or subtransaction) is
incomplete and may be aborted later (which is something the regular
API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these new
stream methods. The stream_start/start_stop are used to demarcate a
chunk of changes streamed for a particular toplevel transaction.

This commit simply adds these new APIs and the upcoming patch to
"allow the streaming mode in ReorderBuffer" will use these APIs.

LGTM

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#456Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#455)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sun, Jul 26, 2020 at 11:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Jul 25, 2020 at 5:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 24, 2020 at 7:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Your changes look fine to me. Additionally, I have changed a test
case of getting the streaming changes in 0002. Instead of just
showing the count, I am showing that the transaction is actually
streaming.

If you want to show the changes then there is no need to display 157
rows probably a few (10-15) should be sufficient. If we can do that
by increasing the size of the row then good, otherwise, I think it is
better to retain the test to display the count.

I think in existing test cases also we are displaying multiple lines
e.g. toast.out is showing 235 rows. But maybe I will try to reduce it
to the less number of rows.

Changed, now only 27 rows.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v43.tarapplication/x-tar; name=v43.tarDownload
v43/000755 000765 000024 00000000000 13707303644 013005 5ustar00dilipkumarstaff000000 000000 v43/v43-0006-Add-TAP-test-for-streaming-vs.-DDL.patch000644 000765 000024 00000010616 13707303644 022776 0ustar00dilipkumarstaff000000 000000 From d72e41ca0e9c370c38e02c958662dd5c14cbcd22 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v43 6/7] Add TAP test for streaming vs. DDL

---
 .../subscription/t/014_stream_through_ddl.pl  | 98 +++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000000..b8d78b1972
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v43/v43-0005-Enable-streaming-for-all-subscription-TAP-tests.patch000644 000765 000024 00000035515 13707303644 026046 0ustar00dilipkumarstaff000000 000000 From 13cec6ce6ac04f90b56314c3c07e07c30da77d1b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v43 5/7] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318fc7c..6f7bedc130 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2fbc..94c71f8ae2 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 9f140b552b..21410fac1c 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9181..a6fae9c3f1 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17a78..202871a658 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10a19..70c86b22ac 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6d63..f9c8d1d348 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334ed89..cdf9b8e7bb 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bdba9..21f50c7012 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133f69..30561d8f96 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae38592b..9a6bac6822 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..ed56fbf96c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cba4c..4df1ddef63 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1a8a..c3caff6149 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7c2f..c62eb521e7 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30f59..2be7542831 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0578..2da9607a7d 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9435..96ffc091b0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
2.23.0

v43/v43-0007-Add-streaming-option-in-pg_dump.patch000644 000765 000024 00000005262 13707303644 023053 0ustar00dilipkumarstaff000000 000000 From 543df3a3004bdc3f2a7552b9d2be30da706d70c2 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 11:09:18 +0530
Subject: [PATCH v43 7/7] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 17 +++++++++++++++--
 src/bin/pg_dump/pg_dump.h |  1 +
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 94459b3539..f69d64cd16 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4240,10 +4241,17 @@ getSubscriptions(Archive *fout)
 
 	if (fout->remoteVersion >= 140000)
 		appendPQExpBuffer(query,
-						  " s.subbinary\n");
+						  " s.subbinary,\n");
 	else
 		appendPQExpBuffer(query,
-						  " false AS subbinary\n");
+						  " false AS subbinary,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBuffer(query,
+						  " s.substream\n");
+	else
+		appendPQExpBuffer(query,
+						  " false AS substream\n");
 
 	appendPQExpBuffer(query,
 					  "FROM pg_subscription s\n"
@@ -4263,6 +4271,7 @@ getSubscriptions(Archive *fout)
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
+	i_substream = PQfnumber(res, "substream");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4286,6 +4295,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subpublications));
 		subinfo[i].subbinary =
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
+		subinfo[i].substream =
+			pg_strdup(PQgetvalue(res, i, i_substream));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4357,6 +4368,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subbinary, "t") == 0)
 		appendPQExpBuffer(query, ", binary = true");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index da97b731b1..cc10c7c1cc 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -626,6 +626,7 @@ typedef struct _SubscriptionInfo
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subbinary;
+	char       *substream;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
-- 
2.23.0

v43/v43-0004-Add-support-for-streaming-to-built-in-replicatio.patch000644 000765 000024 00000273335 13707303644 026305 0ustar00dilipkumarstaff000000 000000 From 9356dc67f2de994fdce351720fc38cf2c25cf665 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v43 4/7] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml      |   5 +-
 doc/src/sgml/ref/create_subscription.sgml     |  11 +
 src/backend/catalog/pg_subscription.c         |   1 +
 src/backend/commands/subscriptioncmds.c       |  49 +-
 src/backend/postmaster/pgstat.c               |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |   4 +
 src/backend/replication/logical/proto.c       | 140 ++-
 src/backend/replication/logical/worker.c      | 946 +++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c   | 348 ++++++-
 src/include/catalog/pg_subscription.h         |   3 +
 src/include/pgstat.h                          |   6 +-
 src/include/replication/logicalproto.h        |  46 +-
 src/include/replication/walreceiver.h         |   1 +
 src/test/subscription/t/009_stream_simple.pl  |  86 ++
 src/test/subscription/t/010_stream_subxact.pl | 102 ++
 src/test/subscription/t/011_stream_ddl.pl     |  95 ++
 .../t/012_stream_subxact_abort.pl             |  82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |  84 ++
 src/test/subscription/t/015_stream_binary.pl  |  86 ++
 19 files changed, 2060 insertions(+), 47 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/015_stream_binary.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70cdf..a81bd54efc 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal> and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c54fe..b7d7457d00 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf0c6..311d46225a 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377a85..4c58ad8b07 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,8 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +197,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +350,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +375,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -439,6 +455,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -698,6 +721,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +732,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +789,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +834,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +877,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 6c97f68671..479e3cadf9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e9057230e4..a6101ace30 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097bf5..ff25924e68 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 2fcf2e61bc..98e7fd0576 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,62 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +291,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,16 +752,322 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or
+	 * inside streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		!in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -635,6 +1081,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1099,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1138,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1256,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1408,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1781,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1922,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2050,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2162,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1947,6 +2435,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1979,6 +2481,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2145,6 +3080,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc4c1..3360bd5dd0 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,29 +47,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,11 +123,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -115,16 +149,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +226,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +255,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +279,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +300,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +331,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +394,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +444,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +486,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,6 +507,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -406,7 +539,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +559,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +584,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +605,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +631,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +663,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -605,6 +743,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -641,6 +886,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -771,11 +1048,44 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -811,7 +1121,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35000..1d091546bf 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1edf..0dfbac46b4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc85c..655144d03a 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,49 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbee54..6c0a4e30a8 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/015_stream_binary.pl b/src/test/subscription/t/015_stream_binary.pl
new file mode 100644
index 0000000000..fa2362e32b
--- /dev/null
+++ b/src/test/subscription/t/015_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.23.0

v43/v43-0001-Extend-the-logical-decoding-output-plugin-API-wi.patch000644 000765 000024 00000114056 13707303644 026072 0ustar00dilipkumarstaff000000 000000 From f311ae1ba840d2540a39d7cf6e18c5fe61d9d2b3 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 30 Jun 2020 11:01:24 +0530
Subject: [PATCH v43 1/7] Extend the logical decoding output plugin API with
 stream methods.

This adds seven methods to the output plugin API, adding support for
streaming changes for large in-porgress transactions.

* stream_start
* stream_stop
* stream_abort
* stream_commit
* stream_change
* stream_message
* stream_truncate

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate a chunk of changes
streamed for a particular toplevel transaction.

This commit simply adds these new APIs and the upcoming patch to "allow
the streaming mode in ReorderBuffer" will use these APIs.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/test_decoding.c     | 176 +++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 218 ++++++++++++++
 src/backend/replication/logical/logical.c | 351 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  69 +++++
 src/include/replication/reorderbuffer.h   |  59 ++++
 6 files changed, 878 insertions(+)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 93c948856e..dbef52a3af 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -62,6 +62,28 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static void pg_decode_stream_start(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn);
+static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_change(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									Relation relation,
+									ReorderBufferChange *change);
+static void pg_decode_stream_message(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn, XLogRecPtr message_lsn,
+									 bool transactional, const char *prefix,
+									 Size sz, const char *message);
+static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  int nrelations, Relation relations[],
+									  ReorderBufferChange *change);
 
 void
 _PG_init(void)
@@ -83,6 +105,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->stream_start_cb = pg_decode_stream_start;
+	cb->stream_stop_cb = pg_decode_stream_stop;
+	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_commit_cb = pg_decode_stream_commit;
+	cb->stream_change_cb = pg_decode_stream_change;
+	cb->stream_message_cb = pg_decode_stream_message;
+	cb->stream_truncate_cb = pg_decode_stream_truncate;
 }
 
 
@@ -540,3 +569,150 @@ pg_decode_message(LogicalDecodingContext *ctx,
 	appendBinaryStringInfo(ctx->out, message, sz);
 	OutputPluginWrite(ctx, true);
 }
+
+/*
+ * We never try to stream any empty xact so we don't need any special handling
+ * for skip_empty_xacts in streaming mode APIs.
+ */
+static void
+pg_decode_stream_start(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "opening a streamed block for transaction");
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * We never try to stream any empty xact so we don't need any special handling
+ * for skip_empty_xacts in streaming mode APIs.
+ */
+static void
+pg_decode_stream_stop(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "closing a streamed block for transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "closing a streamed block for transaction");
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * We never try to stream any empty xact so we don't need any special handling
+ * for skip_empty_xacts in streaming mode APIs.
+ */
+static void
+pg_decode_stream_abort(LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "aborting streamed (sub)transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "aborting streamed (sub)transaction");
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * We never try to stream any empty xact so we don't need any special handling
+ * for skip_empty_xacts in streaming mode APIs.
+ */
+static void
+pg_decode_stream_commit(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "committing streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "committing streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the changes as the transaction can abort
+ * at a later point in time.  We don't want users to see the changes until the
+ * transaction is committed.
+ */
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						Relation relation,
+						ReorderBufferChange *change)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "streaming change for transaction");
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the contents for transactional messages
+ * as the transaction can abort at a later point in time.  We don't want users to
+ * see the message contents until the transaction is committed.
+ */
+static void
+pg_decode_stream_message(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
+						 const char *prefix, Size sz, const char *message)
+{
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (transactional)
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu",
+						 transactional, prefix, sz);
+	}
+	else
+	{
+		appendStringInfo(ctx->out, "streaming message: transactional: %d prefix: %s, sz: %zu content:",
+						 transactional, prefix, sz);
+		appendBinaryStringInfo(ctx->out, message, sz);
+	}
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * In streaming mode, we don't display the detailed information of Truncate.
+ * See pg_decode_stream_change.
+ */
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  int nrelations, Relation relations[],
+						  ReorderBufferChange *change)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "streaming truncate for transaction");
+	OutputPluginWrite(ctx, true);
+}
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index c89f93cf6b..791a62b57c 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,6 +389,13 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeStreamStartCB stream_start_cb;
+    LogicalDecodeStreamStopCB stream_stop_cb;
+    LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamCommitCB stream_commit_cb;
+    LogicalDecodeStreamChangeCB stream_change_cb;
+    LogicalDecodeStreamMessageCB stream_message_cb;
+    LogicalDecodeStreamTruncateCB stream_truncate_cb;
 } OutputPluginCallbacks;
 
 typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
@@ -401,6 +408,15 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      If <function>truncate_cb</function> is not set but a
      <command>TRUNCATE</command> is to be decoded, the action will be ignored.
     </para>
+
+    <para>
+     An output plugin may also define functions to support streaming of large,
+     in-progress transactions. The <function>stream_start_cb</function>,
+     <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
+     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_truncate_cb</function> are optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -679,6 +695,117 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-start">
+     <title>Stream Start Callback</title>
+     <para>
+      The <function>stream_start_cb</function> callback is called when opening
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-stop">
+     <title>Stream Stop Callback</title>
+     <para>
+      The <function>stream_stop_cb</function> callback is called when closing
+      a block of streamed changes from an in-progress transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+                                           ReorderBufferTXN *txn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort">
+     <title>Stream Abort Callback</title>
+     <para>
+      The <function>stream_abort_cb</function> callback is called to abort
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit">
+     <title>Stream Commit Callback</title>
+     <para>
+      The <function>stream_commit_cb</function> callback is called to commit
+      a previously streamed transaction.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+                                             ReorderBufferTXN *txn,
+                                             XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+    
+    <sect3 id="logicaldecoding-output-plugin-stream-change">
+     <title>Stream Change Callback</title>
+     <para>
+      The <function>stream_change_cb</function> callback is called when sending
+      a change in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The actual changes are not displayed as the transaction can abort at a later
+      point in time and we don't decode changes for aborted transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+                                             ReorderBufferTXN *txn,
+                                             Relation relation,
+                                             ReorderBufferChange *change);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-message">
+     <title>Stream Message Callback</title>
+     <para>
+      The <function>stream_message_cb</function> callback is called when sending
+      a generic message in a block of streamed changes (demarcated by
+      <function>stream_start_cb</function> and <function>stream_stop_cb</function> calls).
+      The message contents for transactional messages are not displayed as the transaction
+      can abort at a later point in time and we don't decode changes for aborted
+      transactions.
+<programlisting>
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr message_lsn,
+                                              bool transactional,
+                                              const char *prefix,
+                                              Size message_size,
+                                              const char *message);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-truncate">
+     <title>Stream Truncate Callback</title>
+     <para>
+      The <function>stream_truncate_cb</function> callback is called for a
+      <command>TRUNCATE</command> command in a block of streamed changes
+      (demarcated by <function>stream_start_cb</function> and
+      <function>stream_stop_cb</function> calls).
+<programlisting>
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               int nrelations,
+                                               Relation relations[],
+                                               ReorderBufferChange *change);
+</programlisting>
+      The parameters are analogous to the <function>stream_change_cb</function>
+      callback.  However, because <command>TRUNCATE</command> actions on
+      tables connected by foreign keys need to be executed together, this
+      callback receives an array of relations instead of just a single one.
+      See the description of the <xref linkend="sql-truncate"/> statement for
+      details.
+     </para>
+    </sect3>
+
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
@@ -747,4 +874,95 @@ OutputPluginWrite(ctx, true);
      </para>
    </note>
   </sect1>
+
+  <sect1 id="logicaldecoding-streaming">
+   <title>Streaming of Large Transactions for Logical Decoding</title>
+
+   <para>
+    The basic output plugin callbacks (e.g. <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) are only invoked when the transaction
+    actually commits. The changes are still decoded from the transaction
+    log, but are only passed to the output plugin at commit (and discarded
+    if the transaction aborts).
+   </para>
+
+   <para>
+    This means that while the decoding happens incrementally, and may spill
+    to disk to keep memory usage under control, all the decoded changes have
+    to be transmitted when the transaction finally commits (or more precisely,
+    when the commit is decoded from the transaction log). Depending on the
+    size of the transaction and network bandwidth, the transfer time may
+    significantly increase the apply lag.
+   </para>
+
+   <para>
+    To reduce the apply lag caused by large transactions, an output plugin
+    may provide additional callback to support incremental streaming of
+    in-progress transactions. There are multiple required streaming callbacks
+    (<function>stream_start_cb</function>, <function>stream_stop_cb</function>,
+    <function>stream_abort_cb</function>, <function>stream_commit_cb</function>
+    and <function>stream_change_cb</function>) and two optional callbacks
+    (<function>stream_message_cb</function>) and (<function>stream_truncate_cb</function>).
+   </para>
+
+   <para>
+    When streaming an in-progress transaction, the changes (and messages) are
+    streamed in blocks demarcated by <function>stream_start_cb</function>
+    and <function>stream_stop_cb</function> callbacks. Once all the decoded
+    changes are transmitted, the transaction is committed using the
+    <function>stream_commit_cb</function> callback (or possibly aborted using
+    the <function>stream_abort_cb</function> callback).
+   </para>
+
+   <para>
+    One example sequence of streaming callback calls for one transaction may
+    look like this:
+<programlisting>
+stream_start_cb(...);   &lt;-- start of first block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_message_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of first block of changes
+
+stream_start_cb(...);   &lt;-- start of second block of changes
+  stream_change_cb(...);
+  stream_change_cb(...);
+  stream_change_cb(...);
+  ...
+  stream_message_cb(...);
+  stream_change_cb(...);
+stream_stop_cb(...);    &lt;-- end of second block of changes
+
+stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+</programlisting>
+   </para>
+
+   <para>
+    The actual sequence of callback calls may be more complicated, of course.
+    There may be blocks for multiple streamed transactions, some of the
+    transactions may get aborted, etc.
+   </para>
+
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by amount of memory
+    currently used for decoded changes) is selected and streamed.  However, in
+    some cases we still have to spill to the disk even if streaming is enabled
+    because if we cross the memory limit but we still have not decoded the
+    complete tuple e.g. only decoded toast table insert but not the main table
+    insert.
+   </para>
+
+   <para>
+    Even when streaming large transactions, the changes are still applied in
+    commit order, preserving the same guarantees as the non-streaming mode.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be3b0..05d24b93da 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -65,6 +65,23 @@ static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr message_lsn, bool transactional,
 							   const char *prefix, Size message_size, const char *message);
 
+/* streaming callbacks */
+static void stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr first_lsn);
+static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								   XLogRecPtr last_lsn);
+static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr abort_lsn);
+static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
+static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 Relation relation, ReorderBufferChange *change);
+static void stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr message_lsn, bool transactional,
+									  const char *prefix, Size message_size, const char *message);
+static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   int nrelations, Relation relations[], ReorderBufferChange *change);
+
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
 
 /*
@@ -189,6 +206,39 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->commit = commit_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
+	/*
+	 * To support streaming, we require start/stop/abort/commit/change
+	 * callbacks. The message and truncate callbacks are optional, similar to
+	 * regular output plugins. We however enable streaming when at least one
+	 * of the methods is enabled so that we can easily identify missing
+	 * methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
+		(ctx->callbacks.stream_stop_cb != NULL) ||
+		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_commit_cb != NULL) ||
+		(ctx->callbacks.stream_change_cb != NULL) ||
+		(ctx->callbacks.stream_message_cb != NULL) ||
+		(ctx->callbacks.stream_truncate_cb != NULL);
+
+	/*
+	 * streaming callbacks
+	 *
+	 * stream_message and stream_truncate callbacks are optional, so we do not
+	 * fail with ERROR when missing, but the wrappers simply do nothing. We
+	 * must set the ReorderBuffer callbacks to something, otherwise the calls
+	 * from there will crash (we don't want to move the checks there).
+	 */
+	ctx->reorder->stream_start = stream_start_cb_wrapper;
+	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
+	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
+	ctx->reorder->stream_change = stream_change_cb_wrapper;
+	ctx->reorder->stream_message = stream_message_cb_wrapper;
+	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -866,6 +916,307 @@ message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr first_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_start";
+	state.report_location = first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this message's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = first_lsn;
+
+	/* in streaming mode, stream_start_cb is required */
+	if (ctx->callbacks.stream_start_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_start_cb callback")));
+
+	ctx->callbacks.stream_start_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+					   XLogRecPtr last_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_stop";
+	state.report_location = last_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this message's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = last_lsn;
+
+	/* in streaming mode, stream_stop_cb is required */
+	if (ctx->callbacks.stream_stop_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_stop_cb callback")));
+
+	ctx->callbacks.stream_stop_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_abort";
+	state.report_location = abort_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = abort_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_abort_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_abort_cb callback")));
+
+	ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode, stream_abort_cb is required */
+	if (ctx->callbacks.stream_commit_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_commit_cb callback")));
+
+	ctx->callbacks.stream_commit_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_change";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	/* in streaming mode, stream_change_cb is required */
+	if (ctx->callbacks.stream_change_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_change_cb callback")));
+
+	ctx->callbacks.stream_change_cb(ctx, txn, relation, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr message_lsn, bool transactional,
+						  const char *prefix, Size message_size, const char *message)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (ctx->callbacks.stream_message_cb == NULL)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_message";
+	state.report_location = message_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn != NULL ? txn->xid : InvalidTransactionId;
+	ctx->write_location = message_lsn;
+
+	/* do the actual work: call callback */
+	ctx->callbacks.stream_message_cb(ctx, txn, message_lsn, transactional, prefix,
+									 message_size, message);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   int nrelations, Relation relations[],
+						   ReorderBufferChange *change)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* this callback is optional */
+	if (!ctx->callbacks.stream_truncate_cb)
+		return;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_truncate";
+	state.report_location = change->lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+
+	/*
+	 * report this change's lsn so replies from clients can give an up2date
+	 * answer. This won't ever be enough (and shouldn't be!) to confirm
+	 * receipt of this transaction, but it might allow another transaction's
+	 * commit to be confirmed with one message.
+	 */
+	ctx->write_location = change->lsn;
+
+	ctx->callbacks.stream_truncate_cb(ctx, txn, nrelations, relations, change);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Set the required catalog xmin horizon for historic snapshots in the current
  * replication slot.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c2f2475e5d..deef31825d 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -79,6 +79,11 @@ typedef struct LogicalDecodingContext
 	 */
 	void	   *output_writer_private;
 
+	/*
+	 * Does the output plugin support streaming, and is it enabled?
+	 */
+	bool		streaming;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 3dd9236c57..b78c796450 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,67 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Called when starting to stream a block of changes from in-progress
+ * transaction (may be called repeatedly, if it's streamed in multiple
+ * chunks).
+ */
+typedef void (*LogicalDecodeStreamStartCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn);
+
+/*
+ * Called when stopping to stream a block of changes from in-progress
+ * transaction to a remote node (may be called repeatedly, if it's streamed
+ * in multiple chunks).
+ */
+typedef void (*LogicalDecodeStreamStopCB) (struct LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn);
+
+/*
+ * Called to discard changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/*
+ * Called to apply changes streamed to remote node from in-progress
+ * transaction.
+ */
+typedef void (*LogicalDecodeStreamCommitCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Callback for streaming individual changes from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamChangeCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/*
+ * Callback for streaming generic logical decoding messages from in-progress
+ * transactions.
+ */
+typedef void (*LogicalDecodeStreamMessageCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix,
+											  Size message_size,
+											  const char *message);
+
+/*
+ * Callback for streaming truncates from in-progress transactions.
+ */
+typedef void (*LogicalDecodeStreamTruncateCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
 /*
  * Output plugin callbacks
  */
@@ -112,6 +173,14 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+	/* streaming of changes */
+	LogicalDecodeStreamStartCB stream_start_cb;
+	LogicalDecodeStreamStopCB stream_stop_cb;
+	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamCommitCB stream_commit_cb;
+	LogicalDecodeStreamChangeCB stream_change_cb;
+	LogicalDecodeStreamMessageCB stream_message_cb;
+	LogicalDecodeStreamTruncateCB stream_truncate_cb;
 } OutputPluginCallbacks;
 
 /* Functions in replication/logical/logical.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1055e99e2e..42bc817648 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -348,6 +348,54 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* start streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStartCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr first_lsn);
+
+/* stop streaming transaction callback signature */
+typedef void (*ReorderBufferStreamStopCB) (
+										   ReorderBuffer *rb,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr last_lsn);
+
+/* discard streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortCB) (
+											ReorderBuffer *rb,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
+
+/* commit streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* stream change callback signature */
+typedef void (*ReorderBufferStreamChangeCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 Relation relation,
+											 ReorderBufferChange *change);
+
+/* stream message callback signature */
+typedef void (*ReorderBufferStreamMessageCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr message_lsn,
+											  bool transactional,
+											  const char *prefix, Size sz,
+											  const char *message);
+
+/* stream truncate callback signature */
+typedef void (*ReorderBufferStreamTruncateCB) (
+											   ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   int nrelations,
+											   Relation relations[],
+											   ReorderBufferChange *change);
+
 struct ReorderBuffer
 {
 	/*
@@ -386,6 +434,17 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction.
+	 */
+	ReorderBufferStreamStartCB stream_start;
+	ReorderBufferStreamStopCB stream_stop;
+	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamCommitCB stream_commit;
+	ReorderBufferStreamChangeCB stream_change;
+	ReorderBufferStreamMessageCB stream_message;
+	ReorderBufferStreamTruncateCB stream_truncate;
+
 	/*
 	 * Pointer that will be passed untouched to the callbacks.
 	 */
-- 
2.23.0

v43/v43-0003-Extend-the-BufFile-interface-for-the-streaming-o.patch000644 000765 000024 00000037647 13707303644 026103 0ustar00dilipkumarstaff000000 000000 From a455b0e8491234f3b792d8262c61fe67fd6208af Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v43 3/7] Extend the BufFile interface for the streaming of
 in-progress transactions.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times.

Implement the interface for BufFileTruncate interface to allow files to be
truncated up to a particular offset.  Extend BufFileSeek API to support
SEEK_END case.  Add an option to provide a mode while opening the shared
BufFiles instead of always opening in read-only mode.
---
 src/backend/postmaster/pgstat.c           |  3 +
 src/backend/storage/file/buffile.c        | 86 +++++++++++++++++---
 src/backend/storage/file/fd.c             |  9 +--
 src/backend/storage/file/sharedfileset.c  | 98 +++++++++++++++++++++--
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  4 +-
 10 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 88992c2da2..6c97f68671 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 2d7a082320..a9ca5d929c 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile supports temporary files that can be used by the single backend when
+ * the corresponding files need to be survived across the transaction and need
+ * to be opened and closed multiple times.  Such files need to be created as a
+ * member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -364,6 +368,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -666,11 +673,22 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
+			break;
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +856,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset = file->curOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420efb2..f376a97ed6 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594756..9a3dc102f5 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can also be used by backends when the temporary files need
+ * to be opened/closed multiple times and the underlying files need to survive
+ * across transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,19 +29,29 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -222,6 +254,58 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 		SharedFileSetDeleteAll(fileset);
 }
 
+/*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
 /*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59c50..788815cdab 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a4303b..b83fb50dac 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..807a9c1edf 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752bab0d..fc34c49522 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d7df..e209f047e8 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf077e5..d5edb600af 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
2.23.0

v43/v43-0002-Implement-streaming-mode-in-ReorderBuffer.patch000644 000765 000024 00000227440 13707303644 025031 0ustar00dilipkumarstaff000000 000000 From 440611d606409b406b0e0c62924185fc82abdb71 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 15 Jul 2020 18:38:32 +0530
Subject: [PATCH v43 2/7] Implement streaming mode in ReorderBuffer.

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke new stream API methods. However, sometimes if we have
incomplete toast or speculative insert we spill to the disk because we can
not generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

Now that we can stream in-progress transactions, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

We also provide a new option via SQL APIs to fetch the changes being
streamed.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/Makefile                |   2 +-
 contrib/test_decoding/expected/stream.out     |  66 ++
 contrib/test_decoding/expected/truncate.out   |   6 +
 contrib/test_decoding/sql/stream.sql          |  21 +
 contrib/test_decoding/sql/truncate.sql        |   1 +
 contrib/test_decoding/test_decoding.c         |  13 +
 doc/src/sgml/logicaldecoding.sgml             |   9 +-
 doc/src/sgml/test-decoding.sgml               |  22 +
 src/backend/access/heap/heapam.c              |  13 +
 src/backend/access/heap/heapam_visibility.c   |  42 +-
 src/backend/access/index/genam.c              |  53 +
 src/backend/access/table/tableam.c            |   8 +
 src/backend/access/transam/xact.c             |  19 +
 src/backend/replication/logical/decode.c      |  17 +-
 src/backend/replication/logical/logical.c     |  10 +
 .../replication/logical/reorderbuffer.c       | 969 ++++++++++++++++--
 src/include/access/heapam_xlog.h              |   1 +
 src/include/access/tableam.h                  |  55 +
 src/include/access/xact.h                     |   4 +
 src/include/replication/logical.h             |   1 +
 src/include/replication/reorderbuffer.h       |  56 +-
 21 files changed, 1282 insertions(+), 106 deletions(-)
 create mode 100644 contrib/test_decoding/expected/stream.out
 create mode 100644 contrib/test_decoding/sql/stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c582a5..ed9a3d6c0e 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,7 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate
+	spill slot truncate stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
new file mode 100644
index 0000000000..26ea8caf5f
--- /dev/null
+++ b/contrib/test_decoding/expected/stream.out
@@ -0,0 +1,66 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+COMMIT;
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ opening a streamed block for transaction
+ streaming message: transactional: 1 prefix: test, sz: 50
+ closing a streamed block for transaction
+ aborting streamed (sub)transaction
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ committing streamed transaction
+(27 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/truncate.out b/contrib/test_decoding/expected/truncate.out
index 1cf2ae835c..e64d377214 100644
--- a/contrib/test_decoding/expected/truncate.out
+++ b/contrib/test_decoding/expected/truncate.out
@@ -25,3 +25,9 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
  COMMIT
 (9 rows)
 
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
new file mode 100644
index 0000000000..8889c3a59b
--- /dev/null
+++ b/contrib/test_decoding/sql/stream.sql
@@ -0,0 +1,21 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+COMMIT;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/truncate.sql b/contrib/test_decoding/sql/truncate.sql
index 5aecdf0881..5633854e0d 100644
--- a/contrib/test_decoding/sql/truncate.sql
+++ b/contrib/test_decoding/sql/truncate.sql
@@ -11,3 +11,4 @@ TRUNCATE tab1, tab1 RESTART IDENTITY CASCADE;
 TRUNCATE tab1, tab2;
 
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index dbef52a3af..d8e2b416c6 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -122,6 +122,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	ListCell   *option;
 	TestDecodingData *data;
+	bool enable_streaming = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -212,6 +213,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "stream-changes") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_streaming))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -221,6 +232,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 							elem->arg ? strVal(elem->arg) : "(null)")));
 		}
 	}
+
+	ctx->streaming &= enable_streaming;
 }
 
 /* cleanup this plugin's resources */
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 791a62b57c..1571d71a5b 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d67b..fe7c9783fa 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes', '1');
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d881f4cd46..33a45800b3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1288,6 +1288,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments in xact.c where
+	 * these variables are declared.  Normally we have such a check at tableam
+	 * level API but this is called from many places so we need to ensure it
+	 * here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
@@ -1945,6 +1955,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aa..c77128087c 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the tuple
+		 * yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1659,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae39a..9d9a70a354 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,9 +430,36 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments in xact.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
+/*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments in xact.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
 /*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments in xact.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 4b2bb29559..a61e279d68 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -234,6 +234,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	Relation	rel = scan->rs_rd;
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
+	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
 	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d4f7c29847..99722eea4b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -82,6 +82,19 @@ bool		XactDeferrable;
 
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool		bsysscan = false;
+
 /*
  * When running as a parallel worker, we place only a single
  * TransactionStateData on the parallel worker's state stack, and the XID
@@ -2680,6 +2693,9 @@ AbortTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* If in parallel mode, clean up workers and exit parallel mode. */
 	if (IsInParallelMode())
 	{
@@ -4982,6 +4998,9 @@ AbortSubTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* Exit from parallel mode, if necessary. */
 	if (IsInParallelMode())
 	{
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index f3a1c31a29..f21f61d5e1 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 05d24b93da..42f284b33f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1442,3 +1442,13 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Clear logical streaming state during (sub)transaction abort.
+ */
+void
+ResetLogicalStreamingState(void)
+{
+	CheckXidAlive = InvalidTransactionId;
+	bsysscan = false;
+}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index ce6e62152f..c469536b5f 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -178,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -236,6 +252,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +261,16 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static inline bool ReorderBufferCanStartStreaming(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -367,6 +394,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -416,13 +446,15 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 }
 
 /*
- * Free an ReorderBufferChange.
+ * Free a ReorderBufferChange and update memory accounting, if requested.
  */
 void
-ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
+						  bool upd_mem)
 {
 	/* update memory accounting info */
-	ReorderBufferChangeMemoryUpdate(rb, change, false);
+	if (upd_mem)
+		ReorderBufferChangeMemoryUpdate(rb, change, false);
 
 	/* free contained data */
 	switch (change->action)
@@ -624,16 +656,101 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 }
 
 /*
- * Queue a change into a transaction so it can be replayed upon commit.
+ * Record the partial change for the streaming of in-progress transactions.  We
+ * can stream only complete changes so if we have a partial change like toast
+ * table insert or speculative then we mark such a 'txn' so that it can't be
+ * streamed.  We also ensure that if the changes in such a 'txn' are above
+ * logical_decoding_work_mem threshold then we stream them as soon as we have a
+ * complete change.
+ */
+static void
+ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								  ReorderBufferChange *change,
+								  bool toast_insert)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The partial changes need to be processed only while streaming
+	 * in-progress transactions.
+	 */
+	if (!ReorderBufferCanStream(rb))
+		return;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * Set the toast insert bit whenever we get toast insert to indicate a partial
+	 * change and clear it when we get the insert or update on main table (Both
+	 * update and insert will do the insert in the toast table).
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			 IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial change and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+	{
+		/*
+		 * Speculative confirm change must be preceded by speculative insertion.
+		 */
+		Assert(rbtxn_has_spec_insert(toptxn));
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+	}
+
+	/*
+	 * Stream the transaction if it is serialized before and the changes are
+	 * now complete in the top-level transaction.
+	 *
+	 * The reason for doing the streaming of such a transaction as soon as
+	 * we get the complete change for it is that previously it would have
+	 * reached the memory threshold and wouldn't get streamed because of
+	 * incomplete changes.  Delaying such transactions would increase apply
+	 * lag for them.
+	 */
+	if (ReorderBufferCanStartStreaming(rb) &&
+		!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		rbtxn_is_serialized(txn))
+		ReorderBufferStreamTXN(rb, toptxn);
+}
+
+/*
+ * Queue a change into a transaction so it can be replayed upon commit or will be
+ * streamed when we reach logical_decoding_work_mem threshold.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the previous changes we have detected that the transaction
+	 * is aborted.  So there is no point in collecting further changes for it.
+	 */
+	if (txn->concurrent_abort)
+	{
+		/*
+		 * We don't need to update memory accounting for this change as we have
+		 * not added it to the queue yet.
+		 */
+		ReorderBufferReturnChange(rb, change, false);
+		return;
+	}
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -645,6 +762,9 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* process partial change */
+	ReorderBufferProcessPartialChange(rb, txn, change, toast_insert);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -674,7 +794,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -763,6 +883,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 #endif
 }
 
+/*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the (sub)transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
 /*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
@@ -1018,6 +1170,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1032,6 +1187,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1148,7 +1306,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 	{
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1234,7 +1392,7 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1280,7 +1438,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1297,7 +1455,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1309,6 +1467,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		dlist_delete(&txn->base_snapshot_node);
 	}
 
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
 	/*
 	 * Remove TXN from its containing list.
 	 *
@@ -1334,6 +1501,91 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	ReorderBufferReturnTXN(rb, txn);
 }
 
+/*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change, true);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
+	 * streamed always, even if it does not contain any changes (that is, when
+	 * all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the toplevel xact (we send the XID in all messages), but we never
+	 * stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
+	 * memory. We could also keep the hash table and update it with new ctid
+	 * values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
 /*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1485,57 +1737,178 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the
+	 * xid aborted.  That will happen during catalog access.
+	 */
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						  ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction. This resets the TXN such that it
+ * can be used to stream the remaining data of transaction being processed.
+ */
+static void
+ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+					  Snapshot snapshot_now,
+					  CommandId command_id,
+					  XLogRecPtr last_lsn,
+					  ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Free all resources allocated for toast reconstruction */
+	ReorderBufferToastReset(rb, txn);
+
+	/* Return the spec insert change if it is not NULL */
+	if (specinsert != NULL)
+	{
+		ReorderBufferReturnChange(rb, specinsert, true);
+		specinsert = NULL;
 	}
 
-	snapshot_now = txn->base_snapshot;
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. We iterate over the top and subtransactions (using a k-way
+ * merge) and replay the changes in lsn order.
+ *
+ * If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool stream_started = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1558,14 +1931,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming ? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* We only need to send begin/commit for non-streamed transactions. */
+		if (!streaming)
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1573,6 +1947,36 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * We can't call start stream callback before processing first
+			 * change.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					txn->origin_id = change->origin_id;
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1649,7 +2053,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1685,11 +2090,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1714,7 +2119,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					/* clear out a pending (and thus failed) speculation */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
@@ -1747,7 +2152,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1756,10 +2164,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1790,7 +2195,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1837,7 +2241,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		 */
 		if (specinsert)
 		{
-			ReorderBufferReturnChange(rb, specinsert);
+			ReorderBufferReturnChange(rb, specinsert, true);
 			specinsert = NULL;
 		}
 
@@ -1845,14 +2249,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, send the last message for this set of
+		 * changes depending upon streaming mode.
+		 */
+		if (streaming)
+		{
+			if (stream_started)
+			{
+				rb->stream_stop(rb, txn, prev_lsn);
+				stream_started = false;
+			}
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot for the next set of changes in
+		 * streaming mode.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1870,14 +2295,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as
+		 * streamed (if they contained changes). Otherwise, remove all the
+		 * changes and deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1896,15 +2334,106 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
+		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * This error can only occur when we are sending the data in
+			 * streaming mode and the streaming in not finished yet.
+			 */
+			Assert(streaming);
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+			curtxn->concurrent_abort = true;
+
+			/* Reset the TXN so that it is allowed to stream remaining data. */
+			ReorderBufferResetTXN(rb, txn, snapshot_now,
+								  command_id, prev_lsn,
+								  specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+	}
+	PG_END_TRY();
+}
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot	snapshot_now;
+	CommandId	command_id = FirstCommandId;
 
-		PG_RE_THROW();
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
 	}
-	PG_END_TRY();
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
 }
 
 /*
@@ -1931,6 +2460,22 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_abort(rb, txn, lsn);
+
+		/*
+		 * We might have decoded changes for this transaction that could load
+		 * the cache as per the current transaction's view (consider DDL's
+		 * happened in this transaction). We don't want the decoding of future
+		 * transactions to use those cache entries so execute invalidations.
+		 */
+		if (txn->ninvalidations > 0)
+			ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+											   txn->invalidations);
+	}
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2000,6 +2545,10 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2082,7 +2631,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2131,12 +2680,21 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2144,6 +2702,8 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2155,19 +2715,41 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* If streaming supported, update the total size in top level as well. */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2196,6 +2778,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2387,6 +2970,51 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ *
+ * Note that, we skip transactions that contains incomplete changes. There
+ * is a scope of optimization here such that we can select the largest transaction
+ * which has complete changes.  But that will make the code and design quite complex
+ * and that might not be worth the benefit.  If we plan to stream the transactions
+ * that contains incomplete changes then we need to find a way to partially
+ * stream/truncate the transaction changes in-memory and build a mechanism to
+ * partially truncate the spilled files.  Additionally, whenever we partially
+ * stream the transaction we need to maintain the last streamed lsn and next time
+ * we need to restore from that segment and the offset in WAL.  As we stream the
+ * changes from the top transaction and restore them subtransaction wise, we need
+ * to even remember the subxact from where we streamed the last change.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	Size largest_size = 0;
+	ReorderBufferTXN *largest = NULL;
+
+	/* Find the largest top-level transaction. */
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		if ((largest != NULL || txn->total_size > largest_size) &&
+			(txn->total_size > 0) && !(rbtxn_has_incomplete_tuple(txn)))
+		{
+			largest = txn;
+			largest_size = txn->total_size;
+		}
+	}
+
+	return largest;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
@@ -2419,11 +3047,33 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if possible.  Otherwise, spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStartStreaming(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it
+			 * from memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2501,7 +3151,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 
 		spilled++;
 	}
@@ -2713,6 +3363,136 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+/* Returns true, if the output plugin supports streaming, false, otherwise. */
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/* Returns true, if the streaming can be started now, false, otherwise. */
+static inline bool
+ReorderBufferCanStartStreaming(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild *builder = ctx->snapshot_builder;
+
+	/*
+	 * We can't start streaming immediately even if the streaming is enabled
+	 * because we previously decoded this transaction and now just are
+	 * restarting.
+	 */
+	if (ReorderBufferCanStream(rb) &&
+		!SnapBuildXactNeedsSkip(builder, ctx->reader->EndRecPtr))
+	{
+		/* We must have a consistent snapshot by this time */
+		Assert(SnapBuildCurrentState(builder) == SNAPBUILD_CONSISTENT);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * We can't make any assumptions about base snapshot here, similar to what
+	 * ReorderBufferCommit() does. That relies on base_snapshot getting
+	 * transferred from subxact in ReorderBufferCommitChild(), but that was
+	 * not yet called as the transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 *
+	 * Unlike DecodeCommit which adds xids of all the subtransactions in
+	 * snapshot's xip array via SnapBuildCommittedTxn, we can't do that here
+	 * but we do add them to subxip array instead via ReorderBufferCopySnap.
+	 * This allows the catalog changes made in subtransactions decoded till
+	 * now to be visible.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		/*
+		 * If this transaction has no snapshot, it didn't make any changes to
+		 * the database till now, so there's nothing to decode.
+		 */
+		if (txn->base_snapshot == NULL)
+		{
+			Assert(txn->ninvalidations == 0);
+			return;
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -2813,7 +3593,7 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		dlist_container(ReorderBufferChange, node, cleanup_iter.cur);
 
 		dlist_delete(&cleanup->node);
-		ReorderBufferReturnChange(rb, cleanup);
+		ReorderBufferReturnChange(rb, cleanup, true);
 	}
 	txn->nentries_mem = 0;
 	Assert(dlist_is_empty(&txn->changes));
@@ -3522,7 +4302,7 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			dlist_container(ReorderBufferChange, node, it.cur);
 
 			dlist_delete(&change->node);
-			ReorderBufferReturnChange(rb, change);
+			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
 
@@ -3812,6 +4592,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases, we assume the CID is from the future
+	 * command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cdb12..aa17f7df84 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 0d28f01ca9..b188427563 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -19,6 +19,7 @@
 
 #include "access/relscan.h"
 #include "access/sdir.h"
+#include "access/xact.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1017,6 +1027,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1056,6 +1073,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with
+	 * valid CheckXidAlive for catalog or regular tables.  See detailed
+	 * comments in xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1713,6 +1738,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1730,6 +1763,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1748,6 +1789,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1764,6 +1812,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 53480116a4..c18554bae2 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,10 @@ typedef enum
 /* Synchronous commit level */
 extern int	synchronous_commit;
 
+/* used during logical streaming of a transaction */
+extern TransactionId CheckXidAlive;
+extern bool bsysscan;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index deef31825d..b0fae9808b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -121,5 +121,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+extern void ResetLogicalStreamingState(void);
 
 #endif
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 42bc817648..1ae17d5f11 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -162,6 +162,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -181,6 +184,40 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -248,6 +285,13 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
+	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
 	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
@@ -313,6 +357,12 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -484,12 +534,14 @@ void		ReorderBufferFree(ReorderBuffer *);
 ReorderBufferTupleBuf *ReorderBufferGetTupleBuf(ReorderBuffer *, Size tuple_len);
 void		ReorderBufferReturnTupleBuf(ReorderBuffer *, ReorderBufferTupleBuf *tuple);
 ReorderBufferChange *ReorderBufferGetChange(ReorderBuffer *);
-void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
+void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *, bool);
 
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool toast_insert);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
2.23.0

#457Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#455)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sun, Jul 26, 2020 at 11:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Today, I have again looked at the first patch
(v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't
find any more problems with it so planning to commit the same unless
you or someone else want to add more to it. Just for ease of others,
"the next patch extends the logical decoding output plugin API with
stream methods". It adds seven methods to the output plugin API,
adding support for streaming changes for large in-progress
transactions. The methods are stream_start, stream_stop, stream_abort,
stream_commit, stream_change, stream_message, and stream_truncate.
Most of this is a simple extension of the existing methods, with the
semantic difference that the transaction (or subtransaction) is
incomplete and may be aborted later (which is something the regular
API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these new
stream methods. The stream_start/start_stop are used to demarcate a
chunk of changes streamed for a particular toplevel transaction.

This commit simply adds these new APIs and the upcoming patch to
"allow the streaming mode in ReorderBuffer" will use these APIs.

LGTM

Pushed. Feel free to submit the remaining patches.

--
With Regards,
Amit Kapila.

#458Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#457)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Jul 28, 2020 at 9:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Jul 26, 2020 at 11:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Today, I have again looked at the first patch
(v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't
find any more problems with it so planning to commit the same unless
you or someone else want to add more to it. Just for ease of others,
"the next patch extends the logical decoding output plugin API with
stream methods". It adds seven methods to the output plugin API,
adding support for streaming changes for large in-progress
transactions. The methods are stream_start, stream_stop, stream_abort,
stream_commit, stream_change, stream_message, and stream_truncate.
Most of this is a simple extension of the existing methods, with the
semantic difference that the transaction (or subtransaction) is
incomplete and may be aborted later (which is something the regular
API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these new
stream methods. The stream_start/start_stop are used to demarcate a
chunk of changes streamed for a particular toplevel transaction.

This commit simply adds these new APIs and the upcoming patch to
"allow the streaming mode in ReorderBuffer" will use these APIs.

LGTM

Pushed. Feel free to submit the remaining patches.

Thanks, please find the rebased patch set.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v44.tarapplication/x-tar; name=v44.tarDownload
v44/0000775000175000017500000000000013710202241012551 5ustar  dilipkumardilipkumarv44/v44-0001-Implement-streaming-mode-in-ReorderBuffer.patch0000664000175000017500000022756413710202241024604 0ustar  dilipkumardilipkumarFrom aab119ec147c7d7a06eaf8ddb4c9b830b44a18ee Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 15 Jul 2020 18:38:32 +0530
Subject: [PATCH v44 1/6] Implement streaming mode in ReorderBuffer.

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke new stream API methods. However, sometimes if we have
incomplete toast or speculative insert we spill to the disk because we can
not generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

Now that we can stream in-progress transactions, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

We also provide a new option via SQL APIs to fetch the changes being
streamed.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/Makefile                  |   2 +-
 contrib/test_decoding/expected/stream.out       |  66 ++
 contrib/test_decoding/expected/truncate.out     |   6 +
 contrib/test_decoding/sql/stream.sql            |  21 +
 contrib/test_decoding/sql/truncate.sql          |   1 +
 contrib/test_decoding/test_decoding.c           |  13 +
 doc/src/sgml/logicaldecoding.sgml               |   9 +-
 doc/src/sgml/test-decoding.sgml                 |  22 +
 src/backend/access/heap/heapam.c                |  13 +
 src/backend/access/heap/heapam_visibility.c     |  42 +-
 src/backend/access/index/genam.c                |  53 ++
 src/backend/access/table/tableam.c              |   8 +
 src/backend/access/transam/xact.c               |  19 +
 src/backend/replication/logical/decode.c        |  17 +-
 src/backend/replication/logical/logical.c       |  10 +
 src/backend/replication/logical/reorderbuffer.c | 969 +++++++++++++++++++++---
 src/include/access/heapam_xlog.h                |   1 +
 src/include/access/tableam.h                    |  55 ++
 src/include/access/xact.h                       |   4 +
 src/include/replication/logical.h               |   1 +
 src/include/replication/reorderbuffer.h         |  56 +-
 21 files changed, 1282 insertions(+), 106 deletions(-)
 create mode 100644 contrib/test_decoding/expected/stream.out
 create mode 100644 contrib/test_decoding/sql/stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c58..ed9a3d6 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,7 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate
+	spill slot truncate stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
new file mode 100644
index 0000000..26ea8ca
--- /dev/null
+++ b/contrib/test_decoding/expected/stream.out
@@ -0,0 +1,66 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+COMMIT;
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ opening a streamed block for transaction
+ streaming message: transactional: 1 prefix: test, sz: 50
+ closing a streamed block for transaction
+ aborting streamed (sub)transaction
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ committing streamed transaction
+(27 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/truncate.out b/contrib/test_decoding/expected/truncate.out
index 1cf2ae8..e64d377 100644
--- a/contrib/test_decoding/expected/truncate.out
+++ b/contrib/test_decoding/expected/truncate.out
@@ -25,3 +25,9 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
  COMMIT
 (9 rows)
 
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
new file mode 100644
index 0000000..8889c3a
--- /dev/null
+++ b/contrib/test_decoding/sql/stream.sql
@@ -0,0 +1,21 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+COMMIT;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/truncate.sql b/contrib/test_decoding/sql/truncate.sql
index 5aecdf0..5633854 100644
--- a/contrib/test_decoding/sql/truncate.sql
+++ b/contrib/test_decoding/sql/truncate.sql
@@ -11,3 +11,4 @@ TRUNCATE tab1, tab1 RESTART IDENTITY CASCADE;
 TRUNCATE tab1, tab2;
 
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index dbef52a..d8e2b41 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -122,6 +122,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	ListCell   *option;
 	TestDecodingData *data;
+	bool enable_streaming = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -212,6 +213,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "stream-changes") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_streaming))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -221,6 +232,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 							elem->arg ? strVal(elem->arg) : "(null)")));
 		}
 	}
+
+	ctx->streaming &= enable_streaming;
 }
 
 /* cleanup this plugin's resources */
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 791a62b..1571d71 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d..fe7c978 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes', '1');
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2c9bb0c..717da4c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1298,6 +1298,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments in xact.c where
+	 * these variables are declared.  Normally we have such a check at tableam
+	 * level API but this is called from many places so we need to ensure it
+	 * here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
@@ -1955,6 +1965,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..c771280 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the tuple
+		 * yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1659,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..9d9a70a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,10 +430,37 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments in xact.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments in xact.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments in xact.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 4e8553d..6e9bb87 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -249,6 +249,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
 	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
+	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d4f7c29..99722ee 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -83,6 +83,19 @@ bool		XactDeferrable;
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
 /*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool		bsysscan = false;
+
+/*
  * When running as a parallel worker, we place only a single
  * TransactionStateData on the parallel worker's state stack, and the XID
  * reflected there will be that of the *innermost* currently-active
@@ -2680,6 +2693,9 @@ AbortTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* If in parallel mode, clean up workers and exit parallel mode. */
 	if (IsInParallelMode())
 	{
@@ -4982,6 +4998,9 @@ AbortSubTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* Exit from parallel mode, if necessary. */
 	if (IsInParallelMode())
 	{
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index f3a1c31..f21f61d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 05d24b9..42f284b 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1442,3 +1442,13 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Clear logical streaming state during (sub)transaction abort.
+ */
+void
+ResetLogicalStreamingState(void)
+{
+	CheckXidAlive = InvalidTransactionId;
+	bsysscan = false;
+}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index ce6e621..c469536 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -178,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -236,6 +252,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +261,16 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static inline bool ReorderBufferCanStartStreaming(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -367,6 +394,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -416,13 +446,15 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 }
 
 /*
- * Free an ReorderBufferChange.
+ * Free a ReorderBufferChange and update memory accounting, if requested.
  */
 void
-ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
+						  bool upd_mem)
 {
 	/* update memory accounting info */
-	ReorderBufferChangeMemoryUpdate(rb, change, false);
+	if (upd_mem)
+		ReorderBufferChangeMemoryUpdate(rb, change, false);
 
 	/* free contained data */
 	switch (change->action)
@@ -624,16 +656,101 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 }
 
 /*
- * Queue a change into a transaction so it can be replayed upon commit.
+ * Record the partial change for the streaming of in-progress transactions.  We
+ * can stream only complete changes so if we have a partial change like toast
+ * table insert or speculative then we mark such a 'txn' so that it can't be
+ * streamed.  We also ensure that if the changes in such a 'txn' are above
+ * logical_decoding_work_mem threshold then we stream them as soon as we have a
+ * complete change.
+ */
+static void
+ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								  ReorderBufferChange *change,
+								  bool toast_insert)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The partial changes need to be processed only while streaming
+	 * in-progress transactions.
+	 */
+	if (!ReorderBufferCanStream(rb))
+		return;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * Set the toast insert bit whenever we get toast insert to indicate a partial
+	 * change and clear it when we get the insert or update on main table (Both
+	 * update and insert will do the insert in the toast table).
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			 IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial change and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+	{
+		/*
+		 * Speculative confirm change must be preceded by speculative insertion.
+		 */
+		Assert(rbtxn_has_spec_insert(toptxn));
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+	}
+
+	/*
+	 * Stream the transaction if it is serialized before and the changes are
+	 * now complete in the top-level transaction.
+	 *
+	 * The reason for doing the streaming of such a transaction as soon as
+	 * we get the complete change for it is that previously it would have
+	 * reached the memory threshold and wouldn't get streamed because of
+	 * incomplete changes.  Delaying such transactions would increase apply
+	 * lag for them.
+	 */
+	if (ReorderBufferCanStartStreaming(rb) &&
+		!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		rbtxn_is_serialized(txn))
+		ReorderBufferStreamTXN(rb, toptxn);
+}
+
+/*
+ * Queue a change into a transaction so it can be replayed upon commit or will be
+ * streamed when we reach logical_decoding_work_mem threshold.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the previous changes we have detected that the transaction
+	 * is aborted.  So there is no point in collecting further changes for it.
+	 */
+	if (txn->concurrent_abort)
+	{
+		/*
+		 * We don't need to update memory accounting for this change as we have
+		 * not added it to the queue yet.
+		 */
+		ReorderBufferReturnChange(rb, change, false);
+		return;
+	}
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -645,6 +762,9 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* process partial change */
+	ReorderBufferProcessPartialChange(rb, txn, change, toast_insert);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -674,7 +794,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -764,6 +884,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the (sub)transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -1018,6 +1170,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1032,6 +1187,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1148,7 +1306,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 	{
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1234,7 +1392,7 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1280,7 +1438,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1297,7 +1455,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1310,6 +1468,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1335,6 +1502,91 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change, true);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
+	 * streamed always, even if it does not contain any changes (that is, when
+	 * all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the toplevel xact (we send the XID in all messages), but we never
+	 * stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
+	 * memory. We could also keep the hash table and update it with new ctid
+	 * values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
  */
@@ -1485,57 +1737,178 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  So for detecting the concurrent abort we set
+ * CheckXidAlive to the current (sub)transaction's xid for which this change
+ * belongs to.  And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error so that we can stop
+ * streaming current transaction and discard the already streamed changes on
+ * such an error.  We might have already streamed some of the changes for the
+ * aborted (sub)transaction, but that is fine because when we decode the abort
+ * we will stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the
+	 * xid aborted.  That will happen during catalog access.
+	 */
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						  ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction. This resets the TXN such that it
+ * can be used to stream the remaining data of transaction being processed.
+ */
+static void
+ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+					  Snapshot snapshot_now,
+					  CommandId command_id,
+					  XLogRecPtr last_lsn,
+					  ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Free all resources allocated for toast reconstruction */
+	ReorderBufferToastReset(rb, txn);
+
+	/* Return the spec insert change if it is not NULL */
+	if (specinsert != NULL)
+	{
+		ReorderBufferReturnChange(rb, specinsert, true);
+		specinsert = NULL;
 	}
 
-	snapshot_now = txn->base_snapshot;
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. We iterate over the top and subtransactions (using a k-way
+ * merge) and replay the changes in lsn order.
+ *
+ * If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool stream_started = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1558,14 +1931,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming ? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* We only need to send begin/commit for non-streamed transactions. */
+		if (!streaming)
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1573,6 +1947,36 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * We can't call start stream callback before processing first
+			 * change.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					txn->origin_id = change->origin_id;
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1649,7 +2053,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1685,11 +2090,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1714,7 +2119,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					/* clear out a pending (and thus failed) speculation */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
@@ -1747,7 +2152,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1756,10 +2164,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1790,7 +2195,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1837,7 +2241,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		 */
 		if (specinsert)
 		{
-			ReorderBufferReturnChange(rb, specinsert);
+			ReorderBufferReturnChange(rb, specinsert, true);
 			specinsert = NULL;
 		}
 
@@ -1845,14 +2249,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, send the last message for this set of
+		 * changes depending upon streaming mode.
+		 */
+		if (streaming)
+		{
+			if (stream_started)
+			{
+				rb->stream_stop(rb, txn, prev_lsn);
+				stream_started = false;
+			}
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot for the next set of changes in
+		 * streaming mode.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1870,14 +2295,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as
+		 * streamed (if they contained changes). Otherwise, remove all the
+		 * changes and deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1896,15 +2334,106 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
+		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * This error can only occur when we are sending the data in
+			 * streaming mode and the streaming in not finished yet.
+			 */
+			Assert(streaming);
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+			curtxn->concurrent_abort = true;
+
+			/* Reset the TXN so that it is allowed to stream remaining data. */
+			ReorderBufferResetTXN(rb, txn, snapshot_now,
+								  command_id, prev_lsn,
+								  specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+	}
+	PG_END_TRY();
+}
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot	snapshot_now;
+	CommandId	command_id = FirstCommandId;
 
-		PG_RE_THROW();
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
 	}
-	PG_END_TRY();
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
 }
 
 /*
@@ -1931,6 +2460,22 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_abort(rb, txn, lsn);
+
+		/*
+		 * We might have decoded changes for this transaction that could load
+		 * the cache as per the current transaction's view (consider DDL's
+		 * happened in this transaction). We don't want the decoding of future
+		 * transactions to use those cache entries so execute invalidations.
+		 */
+		if (txn->ninvalidations > 0)
+			ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+											   txn->invalidations);
+	}
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2000,6 +2545,10 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2082,7 +2631,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2131,12 +2680,21 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2144,6 +2702,8 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2155,19 +2715,41 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* If streaming supported, update the total size in top level as well. */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2196,6 +2778,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2388,6 +2971,51 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ *
+ * Note that, we skip transactions that contains incomplete changes. There
+ * is a scope of optimization here such that we can select the largest transaction
+ * which has complete changes.  But that will make the code and design quite complex
+ * and that might not be worth the benefit.  If we plan to stream the transactions
+ * that contains incomplete changes then we need to find a way to partially
+ * stream/truncate the transaction changes in-memory and build a mechanism to
+ * partially truncate the spilled files.  Additionally, whenever we partially
+ * stream the transaction we need to maintain the last streamed lsn and next time
+ * we need to restore from that segment and the offset in WAL.  As we stream the
+ * changes from the top transaction and restore them subtransaction wise, we need
+ * to even remember the subxact from where we streamed the last change.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	Size largest_size = 0;
+	ReorderBufferTXN *largest = NULL;
+
+	/* Find the largest top-level transaction. */
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		if ((largest != NULL || txn->total_size > largest_size) &&
+			(txn->total_size > 0) && !(rbtxn_has_incomplete_tuple(txn)))
+		{
+			largest = txn;
+			largest_size = txn->total_size;
+		}
+	}
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
  * disk until we reach under the memory limit.
@@ -2419,11 +3047,33 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if possible.  Otherwise, spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStartStreaming(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it
+			 * from memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2501,7 +3151,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 
 		spilled++;
 	}
@@ -2713,6 +3363,136 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+/* Returns true, if the output plugin supports streaming, false, otherwise. */
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/* Returns true, if the streaming can be started now, false, otherwise. */
+static inline bool
+ReorderBufferCanStartStreaming(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild *builder = ctx->snapshot_builder;
+
+	/*
+	 * We can't start streaming immediately even if the streaming is enabled
+	 * because we previously decoded this transaction and now just are
+	 * restarting.
+	 */
+	if (ReorderBufferCanStream(rb) &&
+		!SnapBuildXactNeedsSkip(builder, ctx->reader->EndRecPtr))
+	{
+		/* We must have a consistent snapshot by this time */
+		Assert(SnapBuildCurrentState(builder) == SNAPBUILD_CONSISTENT);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * We can't make any assumptions about base snapshot here, similar to what
+	 * ReorderBufferCommit() does. That relies on base_snapshot getting
+	 * transferred from subxact in ReorderBufferCommitChild(), but that was
+	 * not yet called as the transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 *
+	 * Unlike DecodeCommit which adds xids of all the subtransactions in
+	 * snapshot's xip array via SnapBuildCommittedTxn, we can't do that here
+	 * but we do add them to subxip array instead via ReorderBufferCopySnap.
+	 * This allows the catalog changes made in subtransactions decoded till
+	 * now to be visible.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		/*
+		 * If this transaction has no snapshot, it didn't make any changes to
+		 * the database till now, so there's nothing to decode.
+		 */
+		if (txn->base_snapshot == NULL)
+		{
+			Assert(txn->ninvalidations == 0);
+			return;
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -2813,7 +3593,7 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		dlist_container(ReorderBufferChange, node, cleanup_iter.cur);
 
 		dlist_delete(&cleanup->node);
-		ReorderBufferReturnChange(rb, cleanup);
+		ReorderBufferReturnChange(rb, cleanup, true);
 	}
 	txn->nentries_mem = 0;
 	Assert(dlist_is_empty(&txn->changes));
@@ -3522,7 +4302,7 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			dlist_container(ReorderBufferChange, node, it.cur);
 
 			dlist_delete(&change->node);
-			ReorderBufferReturnChange(rb, change);
+			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
 
@@ -3812,6 +4592,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases, we assume the CID is from the future
+	 * command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 7ba72c8..387eb34 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -19,6 +19,7 @@
 
 #include "access/relscan.h"
 #include "access/sdir.h"
+#include "access/xact.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1017,6 +1027,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1056,6 +1073,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with
+	 * valid CheckXidAlive for catalog or regular tables.  See detailed
+	 * comments in xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1713,6 +1738,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1730,6 +1763,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1748,6 +1789,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1764,6 +1812,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5348011..c18554b 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,10 @@ typedef enum
 /* Synchronous commit level */
 extern int	synchronous_commit;
 
+/* used during logical streaming of a transaction */
+extern TransactionId CheckXidAlive;
+extern bool bsysscan;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index deef318..b0fae98 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -121,5 +121,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+extern void ResetLogicalStreamingState(void);
 
 #endif
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 42bc817..1ae17d5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -162,6 +162,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -181,6 +184,40 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -249,6 +286,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
@@ -313,6 +357,12 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -484,12 +534,14 @@ void		ReorderBufferFree(ReorderBuffer *);
 ReorderBufferTupleBuf *ReorderBufferGetTupleBuf(ReorderBuffer *, Size tuple_len);
 void		ReorderBufferReturnTupleBuf(ReorderBuffer *, ReorderBufferTupleBuf *tuple);
 ReorderBufferChange *ReorderBufferGetChange(ReorderBuffer *);
-void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
+void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *, bool);
 
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool toast_insert);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v44/v44-0002-Extend-the-BufFile-interface-for-the-streaming-o.patch0000664000175000017500000003753413710202241025642 0ustar  dilipkumardilipkumarFrom 9aa5f7843c52a42926f042743b4a230207cd84e7 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v44 2/6] Extend the BufFile interface for the streaming of
 in-progress transactions.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times.

Implement the interface for BufFileTruncate interface to allow files to be
truncated up to a particular offset.  Extend BufFileSeek API to support
SEEK_END case.  Add an option to provide a mode while opening the shared
BufFiles instead of always opening in read-only mode.
---
 src/backend/postmaster/pgstat.c           |  3 +
 src/backend/storage/file/buffile.c        | 86 +++++++++++++++++++++++----
 src/backend/storage/file/fd.c             |  9 ++-
 src/backend/storage/file/sharedfileset.c  | 98 ++++++++++++++++++++++++++++---
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  4 +-
 10 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 88992c2..6c97f68 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 2d7a082..a9ca5d9 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile supports temporary files that can be used by the single backend when
+ * the corresponding files need to be survived across the transaction and need
+ * to be opened and closed multiple times.  Such files need to be created as a
+ * member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -364,6 +368,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -666,11 +673,22 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
+			break;
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +856,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset = file->curOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420e..f376a97 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594..9a3dc10 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can also be used by backends when the temporary files need
+ * to be opened/closed multiple times and the underlying files need to survive
+ * across transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,19 +29,29 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -223,6 +255,58 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 }
 
 /*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
+/*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
  */
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59..788815c 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a43..b83fb50 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201..807a9c1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752ba..fc34c49 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d..e209f04 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf07..d5edb60 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
1.8.3.1

v44/v44-0003-Add-support-for-streaming-to-built-in-replicatio.patch0000664000175000017500000027354513710202241026054 0ustar  dilipkumardilipkumarFrom 19b409909d1305b1da4c88d1a455f531dafe0fe5 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v44 3/6] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  49 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   4 +
 src/backend/replication/logical/proto.c            | 140 ++-
 src/backend/replication/logical/worker.c           | 946 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 348 +++++++-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  46 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/009_stream_simple.pl       |  86 ++
 src/test/subscription/t/010_stream_subxact.pl      | 102 +++
 src/test/subscription/t/011_stream_ddl.pl          |  95 +++
 .../subscription/t/012_stream_subxact_abort.pl     |  82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |  84 ++
 src/test/subscription/t/015_stream_binary.pl       |  86 ++
 19 files changed, 2060 insertions(+), 47 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/015_stream_binary.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70..a81bd54 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal> and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c5..b7d7457 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf..311d462 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377..4c58ad8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,8 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +197,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +350,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +375,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -439,6 +455,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -698,6 +721,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +732,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +789,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +834,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +877,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 6c97f68..479e3ca 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e905723..a6101ac 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097..ff25924 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 2fcf2e6..98e7fd0 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,62 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +291,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,17 +752,323 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or
+	 * inside streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		!in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -635,6 +1081,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1099,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1138,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1256,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1408,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1781,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1922,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2050,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2162,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1947,6 +2435,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1979,6 +2481,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2145,6 +3080,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc..3360bd5 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,29 +47,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,11 +123,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -115,16 +149,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +226,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +255,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +279,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +300,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +331,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +394,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +444,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +486,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,6 +507,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -406,7 +539,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +559,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +584,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +605,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +631,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +663,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -606,6 +744,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -642,6 +887,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -771,12 +1048,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -811,7 +1121,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35..1d09154 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc..655144d 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,49 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbe..6c0a4e3 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/015_stream_binary.pl b/src/test/subscription/t/015_stream_binary.pl
new file mode 100644
index 0000000..fa2362e
--- /dev/null
+++ b/src/test/subscription/t/015_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v44/v44-0004-Enable-streaming-for-all-subscription-TAP-tests.patch0000664000175000017500000003534213710202241025610 0ustar  dilipkumardilipkumarFrom a36cb462645fbda14b105e3745f804f8be5c3efb Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v44 4/6] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318f..6f7bedc 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 9f140b5..21410fa 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334e..cdf9b8e 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v44/v44-0005-Add-TAP-test-for-streaming-vs.-DDL.patch0000664000175000017500000001062113710202241022536 0ustar  dilipkumardilipkumarFrom cc4d046e70b9a0d71366f46bccb01970f1cd57f8 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v44 5/6] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v44/v44-0006-Add-streaming-option-in-pg_dump.patch0000664000175000017500000000524713710202241022622 0ustar  dilipkumardilipkumarFrom 2c97ee00fe7f65d1d25fcc036f323b9c23eb2c7d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 11:09:18 +0530
Subject: [PATCH v44 6/6] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 17 +++++++++++++++--
 src/bin/pg_dump/pg_dump.h |  1 +
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 94459b3..f69d64c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4240,10 +4241,17 @@ getSubscriptions(Archive *fout)
 
 	if (fout->remoteVersion >= 140000)
 		appendPQExpBuffer(query,
-						  " s.subbinary\n");
+						  " s.subbinary,\n");
 	else
 		appendPQExpBuffer(query,
-						  " false AS subbinary\n");
+						  " false AS subbinary,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBuffer(query,
+						  " s.substream\n");
+	else
+		appendPQExpBuffer(query,
+						  " false AS substream\n");
 
 	appendPQExpBuffer(query,
 					  "FROM pg_subscription s\n"
@@ -4263,6 +4271,7 @@ getSubscriptions(Archive *fout)
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
+	i_substream = PQfnumber(res, "substream");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4286,6 +4295,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subpublications));
 		subinfo[i].subbinary =
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
+		subinfo[i].substream =
+			pg_strdup(PQgetvalue(res, i, i_substream));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4357,6 +4368,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subbinary, "t") == 0)
 		appendPQExpBuffer(query, ", binary = true");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index da97b73..cc10c7c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -626,6 +626,7 @@ typedef struct _SubscriptionInfo
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subbinary;
+	char       *substream;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
-- 
1.8.3.1

#459Ajin Cherian
itsajin@gmail.com
In reply to: Dilip Kumar (#458)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jul 29, 2020 at 3:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Thanks, please find the rebased patch set.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

I was running some tests on this patch. I was generally trying to see how
the patch affects logical replication when doing bulk inserts. This issue
has been raised in the past, for eg: this [1]/messages/by-id/CAMsr+YE6aE6Re6smrMr-xCabRmCr=yzXEf2Yuv5upEDY5nMX8g@mail.gmail.com.
My test setup is:
1. Two postgres servers running - A and B
2. Create a pgbench setup on A. (pgbench -i -s 5 postgres)
3. replicate the 3 tables (schema only) on B.
4. Three publishers on A for the 3 tables of pgbench; pgbench_accounts,
pgbench_branches and pgbench_tellers;
5. Three subscribers on B for the same tables. (streaming on and off based
on the scenarios described below)

run pgbench with : pgbench -c 4 -T 100 postgres
While pgbench is running, Do a bulk insert on some other table not in the
publication list (say t1); INSERT INTO t1 (select i FROM
generate_series(1,10000000) i);

Four scenarios:
1. Pgbench with logical replication enabled without bulk insert
Avg TPS (out of 10 runs): 641 TPS
2.Pgbench without logical replication enabled with bulk insert (no pub/sub)
Avg TPS (out of 10 runs): 665 TPS
3, Pgbench with logical replication enabled with bulk insert
Avg TPS (out of 10 runs): 278 TPS
4. Pgbench with logical replication streaming on with bulk insert
Avg TPS (out of 10 runs): 440 TPS

As you can see, the bulk inserts, although on a totally unaffected table,
does impact the TPS. But what is good is that, enabling streaming improves
the TPS (about 58% improvement)

[1]: /messages/by-id/CAMsr+YE6aE6Re6smrMr-xCabRmCr=yzXEf2Yuv5upEDY5nMX8g@mail.gmail.com
/messages/by-id/CAMsr+YE6aE6Re6smrMr-xCabRmCr=yzXEf2Yuv5upEDY5nMX8g@mail.gmail.com

regards,
Ajin Cherian
Fujitsu Australia

#460Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#459)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Jul 30, 2020 at 12:28 PM Ajin Cherian <itsajin@gmail.com> wrote:

I was running some tests on this patch. I was generally trying to see how the patch affects logical replication when doing bulk inserts. This issue has been raised in the past, for eg: this [1].
My test setup is:
1. Two postgres servers running - A and B
2. Create a pgbench setup on A. (pgbench -i -s 5 postgres)
3. replicate the 3 tables (schema only) on B.
4. Three publishers on A for the 3 tables of pgbench; pgbench_accounts, pgbench_branches and pgbench_tellers;
5. Three subscribers on B for the same tables. (streaming on and off based on the scenarios described below)

run pgbench with : pgbench -c 4 -T 100 postgres
While pgbench is running, Do a bulk insert on some other table not in the publication list (say t1); INSERT INTO t1 (select i FROM generate_series(1,10000000) i);

Four scenarios:
1. Pgbench with logical replication enabled without bulk insert
Avg TPS (out of 10 runs): 641 TPS
2.Pgbench without logical replication enabled with bulk insert (no pub/sub)
Avg TPS (out of 10 runs): 665 TPS
3, Pgbench with logical replication enabled with bulk insert
Avg TPS (out of 10 runs): 278 TPS
4. Pgbench with logical replication streaming on with bulk insert
Avg TPS (out of 10 runs): 440 TPS

As you can see, the bulk inserts, although on a totally unaffected table, does impact the TPS. But what is good is that, enabling streaming improves the TPS (about 58% improvement)

Thanks for doing these tests, it is a good win and probably the reason
is that after patch we won't serialize such big transactions (as shown
in Konstantin's email [1]/messages/by-id/5f5143cc-9f73-3909-3ef7-d3895cc6cc90@postgrespro.ru) and they will be simply skipped.
Basically, it will try to stream such transactions and will skip them
as they are not required to be sent.

[1]: /messages/by-id/5f5143cc-9f73-3909-3ef7-d3895cc6cc90@postgrespro.ru

--
With Regards,
Amit Kapila.

#461Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#460)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Attaching an updated patch for the stats for streaming based on v2 of
Sawada's san replication slot stats framework and v44 of this patch series
. This is one patch that has both the stats framework from Sawada-san (1)
as well as my update for streaming, so it can be applied easily on top of
v44.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

streaming_stats_update.patchapplication/octet-stream; name=streaming_stats_update.patchDownload
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 7dcddf4..d77fbeb 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -315,6 +315,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </row>
 
      <row>
+      <entry><structname>pg_stat_replication_slots</structname><indexterm><primary>pg_stat_replication_slots</primary></indexterm></entry>
+      <entry>One row per replication slot, showing statistics about
+       replication slot usage.
+       See <link linkend="monitoring-pg-stat-replication-slots-view">
+       <structname>pg_stat_replication_slots</structname></link> for details.
+      </entry>
+     </row>
+
+     <row>
       <entry><structname>pg_stat_wal_receiver</structname><indexterm><primary>pg_stat_wal_receiver</primary></indexterm></entry>
       <entry>Only one row, showing statistics about the WAL receiver from
        that receiver's connected server.
@@ -2508,7 +2517,119 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
  </sect2>
 
- <sect2 id="monitoring-pg-stat-wal-receiver-view">
+ <sect2 id="monitoring-pg-stat-replication-slots-view">
+  <title><structname>pg_stat_replication_slots</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_replication_slots</primary>
+  </indexterm>
+
+   <para>
+    The <structname>pg_stat_replication_slots</structname> view will contain
+    one row per replication slot, showing statistics about replication
+    slot usage.
+   </para>
+
+   <table id="pg-stat-replication-slots-view" xreflabel="pg_stat_replication_slots">
+    <title><structname>pg_stat_replication_slots</structname> View</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="catalog_table_entry"><para role="column_definition">
+         Column Type
+        </para>
+        <para>
+         Description
+       </para></entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry role="catalog_table_entry"><para role="column_definition">
+         <structfield>name</structfield> <type>text</type>
+        </para>
+        <para>
+         A unique, cluster-wide identifier for the replication slot
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="catalog_table_entry"><para role="column_definition">
+         <structfield>spill_txns</structfield> <type>bigint</type>
+        </para>
+        <para>
+         Number of transactions spilled to disk after the memory used by
+         logical decoding exceeds <literal>logical_decoding_work_mem</literal>. The
+         counter gets incremented both for toplevel transactions and
+         subtransactions.
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="catalog_table_entry"><para role="column_definition">
+         <structfield>spill_count</structfield> <type>bigint</type>
+        </para>
+        <para>
+         Number of times transactions were spilled to disk. Transactions
+         may get spilled repeatedly, and this counter gets incremented on every
+         such invocation.
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="catalog_table_entry"><para role="column_definition">
+         <structfield>spill_bytes</structfield> <type>bigint</type>
+        </para>
+        <para>
+         Amount of decoded transaction data spilled to disk.
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="catalog_table_entry"><para role="column_definition">
+         <structfield>stream_txns</structfield> <type>bigint</type>
+        </para>
+        <para>
+         Number of in-progress transactions streamed to subscriber after
+         memory used by logical decoding exceeds <literal>logical_decoding_work_mem</literal>.
+         Streaming only works with toplevel transactions (subtransactions can't
+         be streamed independently), so the counter does not get incremented for
+         subtransactions.
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="catalog_table_entry"><para role="column_definition">
+         <structfield>stream_count</structfield> <type>bigint</type>
+        </para>
+        <para>
+         Number of times in-progress transactions were streamed to subscriber.
+         Transactions may get streamed repeatedly, and this counter gets incremented
+         on every such invocation.
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="catalog_table_entry"><para role="column_definition">
+         <structfield>stream_bytes</structfield> <type>bigint</type>
+        </para>
+        <para>
+         Amount of decoded in-progress transaction data streamed to subscriber.
+       </para></entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <para>
+    Tracking of spilled transactions works only for logical replication.  In
+    physical replication, the tracking mechanism will display 0 for spilled
+    statistics.
+   </para>
+  </sect2>
+
+  <sect2 id="monitoring-pg-stat-wal-receiver-view">
   <title><structname>pg_stat_wal_receiver</structname></title>
 
   <indexterm>
@@ -4707,6 +4828,26 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         can be granted EXECUTE to run the function.
        </para></entry>
       </row>
+
+      <row>
+        <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+          <primary>pg_stat_reset_replication_slot</primary>
+        </indexterm>
+        <function>pg_stat_reset_replication_slot</function> ( <type>text</type> )
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+         Resets statistics to zero for a single replication slot, or for all
+         replication slots in the cluster.  If the argument is NULL, all counters
+         shown in the <structname>pg_stat_replication_slots</structname> view for
+         all replication slots are reset.
+       </para>
+       <para>
+         This function is restricted to superusers by default, but other users
+         can be granted EXECUTE to run the function.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8625cbe..ceba837 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -790,6 +790,17 @@ CREATE VIEW pg_stat_replication AS
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
 
+CREATE VIEW pg_stat_replication_slots AS
+    SELECT
+            s.name,
+            s.spill_txns,
+            s.spill_count,
+            s.spill_bytes,
+            s.stream_txns,
+            s.stream_count,
+            s.stream_bytes
+    FROM pg_stat_get_replication_slots() AS s;
+
 CREATE VIEW pg_stat_slru AS
     SELECT
             s.name,
@@ -1441,6 +1452,7 @@ REVOKE EXECUTE ON FUNCTION pg_stat_reset_shared(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_stat_reset_slru(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_stat_reset_single_table_counters(oid) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_stat_reset_single_function_counters(oid) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_stat_reset_replication_slot(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION lo_import(text) FROM public;
 REVOKE EXECUTE ON FUNCTION lo_import(text, oid) FROM public;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 479e3ca..7aba571 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -51,6 +51,7 @@
 #include "postmaster/fork_process.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
+#include "replication/slot.h"
 #include "replication/walsender.h"
 #include "storage/backendid.h"
 #include "storage/dsm.h"
@@ -282,6 +283,8 @@ static int	localNumBackends = 0;
 static PgStat_ArchiverStats archiverStats;
 static PgStat_GlobalStats globalStats;
 static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
+static PgStat_ReplSlotStats	*replSlotStats;
+static int	nReplSlotStats;
 
 /*
  * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
@@ -340,6 +343,8 @@ static const char *pgstat_get_wait_io(WaitEventIO w);
 static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
 static void pgstat_send(void *msg, int len);
 
+static int pgstat_replslot_index(const char *name, bool create_it);
+
 static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
 static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
 static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
@@ -348,6 +353,7 @@ static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
 static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
 static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
 static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
+static void pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg, int len);
 static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
 static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
 static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
@@ -360,6 +366,7 @@ static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int le
 static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
 static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
 static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+static void pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
@@ -1430,6 +1437,36 @@ pgstat_reset_slru_counter(const char *name)
 }
 
 /* ----------
+ * pgstat_reset_replslot_counter() -
+ *
+ *	Tell the statistics collector to reset a single replication slot
+ *	counter, or all replication slots counters (when name is null).
+ *
+ *	Permission checking for this function is managed through the normal
+ *	GRANT system.
+ * ----------
+ */
+void
+pgstat_reset_replslot_counter(const char *name)
+{
+	PgStat_MsgResetreplslotcounter msg;
+
+	if (pgStatSock == PGINVALID_SOCKET)
+		return;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETREPLSLOTCOUNTER);
+	if (name)
+	{
+		memcpy(&msg.m_slotname, name, NAMEDATALEN);
+		msg.clearall = false;
+	}
+	else
+		msg.clearall = true;
+
+	pgstat_send(&msg, sizeof(msg));
+}
+
+/* ----------
  * pgstat_report_autovac() -
  *
  *	Called from autovacuum.c to report startup of an autovacuum process.
@@ -1629,6 +1666,49 @@ pgstat_report_tempfile(size_t filesize)
 	pgstat_send(&msg, sizeof(msg));
 }
 
+/* ----------
+ * pgstat_report_replslot() -
+ *
+ *	Tell the collector about replication slot statistics.
+ * ----------
+ */
+void
+pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
+					   int spillbytes, int streamtxns, int streamcount, int  streambytes)
+{
+	PgStat_MsgReplSlot msg;
+
+	/*
+	 * Prepare and send the message
+	 */
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
+	memcpy(&msg.m_slotname, slotname, NAMEDATALEN);
+	msg.m_drop = false;
+	msg.m_spill_txns = spilltxns;
+	msg.m_spill_count = spillcount;
+	msg.m_spill_bytes = spillbytes;
+	msg.m_stream_txns = streamtxns;
+	msg.m_stream_count = streamcount;
+	msg.m_stream_bytes = streambytes;
+	pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
+}
+
+/* ----------
+ * pgstat_report_replslot_drop() -
+ *
+ *	Tell the collector about dropping the replication slot.
+ * ----------
+ */
+void
+pgstat_report_replslot_drop(const char *slotname)
+{
+	PgStat_MsgReplSlot msg;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
+	memcpy(&msg.m_slotname, slotname, NAMEDATALEN);
+	msg.m_drop = true;
+	pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
+}
 
 /* ----------
  * pgstat_ping() -
@@ -2691,6 +2771,23 @@ pgstat_fetch_slru(void)
 	return slruStats;
 }
 
+/*
+ * ---------
+ * pgstat_fetch_replslot() -
+ *
+ *	Support function for the SQL-callable pgstat* functions. Returns
+ *	a pointer to the replication slot statistics struct and set the
+ *	number of entries to nslots_p.
+ * ---------
+ */
+PgStat_ReplSlotStats *
+pgstat_fetch_replslot(int *nslots_p)
+{
+	backend_read_statsfile();
+
+	*nslots_p = nReplSlotStats;
+	return replSlotStats;
+}
 
 /* ------------------------------------------------------------
  * Functions for management of the shared-memory PgBackendStatus array
@@ -4630,6 +4727,11 @@ PgstatCollectorMain(int argc, char *argv[])
 												 len);
 					break;
 
+				case PGSTAT_MTYPE_RESETREPLSLOTCOUNTER:
+					pgstat_recv_resetreplslotcounter(&msg.msg_resetreplslotcounter,
+													 len);
+					break;
+
 				case PGSTAT_MTYPE_AUTOVAC_START:
 					pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
 					break;
@@ -4680,6 +4782,10 @@ PgstatCollectorMain(int argc, char *argv[])
 												 len);
 					break;
 
+				case PGSTAT_MTYPE_REPLSLOT:
+					pgstat_recv_replslot(&msg.msg_replslot, len);
+					break;
+
 				default:
 					break;
 			}
@@ -4883,6 +4989,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
 	const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
 	const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
 	int			rc;
+	int			i;
 
 	elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4930,6 +5037,16 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
 	(void) rc;					/* we'll check for error with ferror */
 
 	/*
+	 * Write replication slot stats struct
+	 */
+	for (i = 0; i < nReplSlotStats; i++)
+	{
+		fputc('R', fpout);
+		rc = fwrite(&replSlotStats[i], sizeof(PgStat_ReplSlotStats), 1, fpout);
+		(void) rc;				/* we'll check for error with ferror */
+	}
+
+	/*
 	 * Walk through the database table.
 	 */
 	hash_seq_init(&hstat, pgStatDBHash);
@@ -5181,6 +5298,10 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
 						 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
+	/* Allocate the space for replication slot statistics */
+	replSlotStats = palloc0(max_replication_slots * sizeof(PgStat_ReplSlotStats));
+	nReplSlotStats = 0;
+
 	/*
 	 * Clear out global and archiver statistics so they start from zero in
 	 * case we can't load an existing statsfile.
@@ -5203,6 +5324,12 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 		slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
 
 	/*
+	 * Set the same reset timestamp for all replication slots too.
+	 */
+	for (i = 0; i < max_replication_slots; i++)
+		replSlotStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+
+	/*
 	 * Try to open the stats file. If it doesn't exist, the backends simply
 	 * return zero for anything and the collector simply starts from scratch
 	 * with empty counters.
@@ -5365,6 +5492,23 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 
 				break;
 
+				/*
+				 * 'R'	A PgStat_ReplSlotStats struct describing a replication slot
+				 * follows.
+				 */
+			case 'R':
+				if (fread(&replSlotStats[nReplSlotStats], 1, sizeof(PgStat_ReplSlotStats), fpin)
+					!= sizeof(PgStat_ReplSlotStats))
+				{
+					ereport(pgStatRunningInCollector ? LOG : WARNING,
+							(errmsg("corrupted statistics file \"%s\"",
+									statfile)));
+					memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
+					goto done;
+				}
+				nReplSlotStats++;
+				break;
+
 			case 'E':
 				goto done;
 
@@ -5574,6 +5718,7 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 	PgStat_GlobalStats myGlobalStats;
 	PgStat_ArchiverStats myArchiverStats;
 	PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
+	PgStat_ReplSlotStats myReplSlotStats;
 	FILE	   *fpin;
 	int32		format_id;
 	const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
@@ -5676,6 +5821,21 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 
 				break;
 
+				/*
+				 * 'R'	A PgStat_ReplSlotStats struct describing a replication slot
+				 * follows.
+				 */
+			case 'R':
+				if (fread(&myReplSlotStats, 1, sizeof(PgStat_ReplSlotStats), fpin)
+					!= sizeof(PgStat_ReplSlotStats))
+				{
+					ereport(pgStatRunningInCollector ? LOG : WARNING,
+							(errmsg("corrupted statistics file \"%s\"",
+									statfile)));
+					goto done;
+				}
+				break;
+
 			case 'E':
 				goto done;
 
@@ -6263,6 +6423,48 @@ pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
 }
 
 /* ----------
+ * pgstat_recv_resetreplslotcounter() -
+ *
+ *	Reset some replication slot statistics of the cluster.
+ * ----------
+ */
+static void
+pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
+								 int len)
+{
+	int			i;
+	int			idx = -1;
+	TimestampTz ts;
+
+	if (!msg->clearall)
+	{
+		/* Get the index of replication slot statistics to reset */
+		idx = pgstat_replslot_index(msg->m_slotname, false);
+
+		if (idx < 0)
+			return;	/* not found */
+	}
+
+	ts = GetCurrentTimestamp();
+	for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
+	{
+		/* reset entry with the given index, or all entries (index is -1) */
+		if (msg->clearall || idx == i)
+		{
+			/* reset only counters. Don't clear slot name */
+			replSlotStats[i].spill_txns = 0;
+			replSlotStats[i].spill_count = 0;
+			replSlotStats[i].spill_bytes = 0;
+			replSlotStats[i].stream_txns = 0;
+			replSlotStats[i].stream_count = 0;
+			replSlotStats[i].stream_bytes = 0;
+			replSlotStats[i].stat_reset_timestamp = ts;
+		}
+	}
+}
+
+
+/* ----------
  * pgstat_recv_autovac() -
  *
  *	Process an autovacuum signaling message.
@@ -6509,6 +6711,80 @@ pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
 	dbentry->last_checksum_failure = msg->m_failure_time;
 }
 
+/*
+ * pgstat_replslot_index
+ *
+ * Return the index of entry of a replication slot with the given name, or
+ * -1 if the slot is not found.  If create_it is true, this function creates
+ * the statistics of the replication slot if not exists.
+ */
+static int
+pgstat_replslot_index(const char *name, bool create_it)
+{
+	int		i;
+
+	Assert(nReplSlotStats <= max_replication_slots);
+	for (i = 0; i < nReplSlotStats; i++)
+	{
+		if (strcmp(replSlotStats[i].slotname, name) == 0)
+			return i; /* found */
+	}
+
+	/*
+	 * The slot is not found.  We don't want to register the new statistics
+	 * if the list is already full or the caller didn't request.
+	 */
+	if (i == max_replication_slots || !create_it)
+		return -1;
+
+	/* Register new slot */
+	memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
+	memcpy(&replSlotStats[nReplSlotStats].slotname, name, NAMEDATALEN);
+	return nReplSlotStats++;
+}
+
+/* ----------
+ * pgstat_recv_replslot() -
+ *
+ *	Process a REPLSLOT message.
+ * ----------
+ */
+static void
+pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len)
+{
+	int idx;
+
+	/*
+	 * Get the index of replication slot statistics.  On dropping, we
+	 * don't create the new statistics.
+	 */
+	idx = pgstat_replslot_index(msg->m_slotname, !msg->m_drop);
+
+	/* the statistics is not found or is already full */
+	if (idx < 0)
+		return;
+
+	Assert(idx >= 0 && idx <= max_replication_slots);
+	if (msg->m_drop)
+	{
+		/* Remove the replication slot statistics with the given name */
+		memcpy(&replSlotStats[idx], &replSlotStats[nReplSlotStats - 1],
+			   sizeof(PgStat_ReplSlotStats));
+		nReplSlotStats--;
+		Assert(nReplSlotStats >= 0);
+	}
+	else
+	{
+		/* Update the replication slot statistics */
+		replSlotStats[idx].spill_txns += msg->m_spill_txns;
+		replSlotStats[idx].spill_count += msg->m_spill_count;
+		replSlotStats[idx].spill_bytes += msg->m_spill_bytes;
+		replSlotStats[idx].stream_txns += msg->m_stream_txns;
+		replSlotStats[idx].stream_count += msg->m_stream_count;
+		replSlotStats[idx].stream_bytes += msg->m_stream_bytes;
+	}
+}
+
 /* ----------
  * pgstat_recv_tempfile() -
  *
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 42f284b..9cfc48c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -32,6 +32,7 @@
 #include "access/xlog_internal.h"
 #include "fmgr.h"
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "replication/decode.h"
 #include "replication/logical.h"
 #include "replication/origin.h"
@@ -83,6 +84,7 @@ static void stream_truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *t
 									   int nrelations, Relation relations[], ReorderBufferChange *change);
 
 static void LoadOutputPlugin(OutputPluginCallbacks *callbacks, char *plugin);
+static void UpdateSpillStats(LogicalDecodingContext *ctx);
 
 /*
  * Make sure the current settings & environment are capable of doing logical
@@ -740,6 +742,11 @@ begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
+
+	/*
+	 * Update statistics about transactions that spilled to disk.
+	 */
+	UpdateSpillStats(ctx);
 }
 
 static void
@@ -1452,3 +1459,28 @@ ResetLogicalStreamingState(void)
 	CheckXidAlive = InvalidTransactionId;
 	bsysscan = false;
 }
+
+static void
+UpdateSpillStats(LogicalDecodingContext *ctx)
+{
+   ReorderBuffer *rb = ctx->reorder;
+
+   elog(DEBUG2, "UpdateSpillStats: updating stats %p %lld %lld %lld %lld %lld %lld",
+        rb,
+        (long long) rb->spillTxns,
+        (long long) rb->spillCount,
+        (long long) rb->spillBytes,
+        (long long) rb->streamTxns,
+        (long long) rb->streamCount,
+        (long long) rb->streamBytes);
+
+   pgstat_report_replslot(NameStr(ctx->slot->data.name),
+                          rb->spillTxns, rb->spillCount, rb->spillBytes,
+                          rb->streamTxns, rb->streamCount, rb->streamBytes);
+   rb->spillTxns = 0;
+   rb->spillCount = 0;
+   rb->spillBytes = 0;
+   rb->streamTxns = 0;
+   rb->streamCount = 0;
+   rb->streamBytes = 0;
+}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c469536..ac4422b 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -344,6 +344,13 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	buffer->spillCount = 0;
+	buffer->spillTxns = 0;
+	buffer->spillBytes = 0;
+	buffer->streamCount = 0;
+	buffer->streamTxns = 0;
+	buffer->streamBytes = 0;
+
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
 	dlist_init(&buffer->toplevel_by_lsn);
@@ -3098,6 +3105,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	int			fd = -1;
 	XLogSegNo	curOpenSegNo = 0;
 	Size		spilled = 0;
+	Size		size = txn->size;
 
 	elog(DEBUG2, "spill %u changes in XID %u to disk",
 		 (uint32) txn->nentries_mem, txn->xid);
@@ -3156,6 +3164,13 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		spilled++;
 	}
 
+	/* update the statistics */
+	rb->spillCount += 1;
+	rb->spillBytes += size;
+
+	/* Don't consider already serialized transactions. */
+	rb->spillTxns += rbtxn_is_serialized(txn) ? 0 : 1;
+
 	Assert(spilled == txn->nentries_mem);
 	Assert(dlist_is_empty(&txn->changes));
 	txn->nentries_mem = 0;
@@ -3484,10 +3499,18 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->snapshot_now = NULL;
 	}
 
+
+	rb->streamCount += 1;
+	rb->streamBytes += txn->total_size;
+
+	/* Don't consider already streamed transaction. */
+	rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
+
 	/* Process and send the changes to output plugin. */
 	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
 							command_id, true);
 
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 57bbb62..ba8a013 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -322,6 +322,9 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 
 	/* Let everybody know we've modified this slot */
 	ConditionVariableBroadcast(&slot->active_cv);
+
+	/* Create statistics entry for the new slot */
+	pgstat_report_replslot(NameStr(slot->data.name), 0, 0, 0, 0, 0, 0);
 }
 
 /*
@@ -683,6 +686,17 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
 				(errmsg("could not remove directory \"%s\"", tmppath)));
 
 	/*
+	 * Report to drop the replication slot to stats collector.  Since there
+	 * is no guarantees the order of message arrival on an UDP connection,
+	 * it's possible that a message for creating a new slot arrives before a
+	 * message for removing the old slot.  We send the drop message while
+	 * holding ReplicationSlotAllocationLock to reduce that possibility.
+	 * If the messages arrived in reverse, we would lose one statistics update
+	 * message.
+	 */
+	pgstat_report_replslot_drop(NameStr(slot->data.name));
+
+	/*
 	 * We release this at the very end, so that nobody starts trying to create
 	 * a slot while we're still cleaning up the detritus of the old one.
 	 */
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 95738a4..59fba37 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2033,6 +2033,20 @@ pg_stat_reset_slru(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+/* Reset replication slots counters (a specific one or all of them). */
+Datum
+pg_stat_reset_replication_slot(PG_FUNCTION_ARGS)
+{
+	char	   *target = NULL;
+
+	if (!PG_ARGISNULL(0))
+		target = text_to_cstring(PG_GETARG_TEXT_PP(0));
+
+	pgstat_reset_replslot_counter(target);
+
+	PG_RETURN_VOID();
+}
+
 Datum
 pg_stat_get_archiver(PG_FUNCTION_ARGS)
 {
@@ -2098,3 +2112,66 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
+
+Datum
+pg_stat_get_replication_slots(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_REPLICATION_SLOT_CLOS 7
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	PgStat_ReplSlotStats *stats;
+	int			nstats;
+	int			i;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	stats = pgstat_fetch_replslot(&nstats);
+	for (i = 0; i < nstats; i++)
+	{
+		Datum	values[PG_STAT_GET_REPLICATION_SLOT_CLOS];
+		bool	nulls[PG_STAT_GET_REPLICATION_SLOT_CLOS];
+		PgStat_ReplSlotStats stat = stats[i];
+
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = PointerGetDatum(cstring_to_text(stat.slotname));
+		values[1] = Int64GetDatum(stat.spill_txns);
+		values[2] = Int64GetDatum(stat.spill_count);
+		values[3] = Int64GetDatum(stat.spill_bytes);
+		values[4] = Int64GetDatum(stat.stream_txns);
+		values[5] = Int64GetDatum(stat.stream_count);
+		values[6] = Int64GetDatum(stat.stream_bytes);
+
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 082a11f..2a316b1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5251,6 +5251,14 @@
   proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
   proargnames => '{pid,status,receive_start_lsn,receive_start_tli,written_lsn,flushed_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,sender_host,sender_port,conninfo}',
   prosrc => 'pg_stat_get_wal_receiver' },
+{ oid => '8595', descr => 'statistics: information about replication slots',
+  proname => 'pg_stat_get_replication_slots', prorows => '10', proisstrict => 'f',
+  proretset => 't', provolatile => 's', proparallel => 'r',
+  prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{name,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}',
+  prosrc => 'pg_stat_get_replication_slots' },
 { oid => '6118', descr => 'statistics: information about subscription',
   proname => 'pg_stat_get_subscription', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => 'oid',
@@ -5592,6 +5600,10 @@
   descr => 'statistics: reset collected statistics for a single SLRU',
   proname => 'pg_stat_reset_slru', proisstrict => 'f', provolatile => 'v',
   prorettype => 'void', proargtypes => 'text', prosrc => 'pg_stat_reset_slru' },
+{ oid => '8596',
+  descr => 'statistics: reset collected statistics for a single replication slot',
+  proname => 'pg_stat_reset_replication_slot', proisstrict => 'f', provolatile => 'v',
+  prorettype => 'void', proargtypes => 'text', prosrc => 'pg_stat_reset_replication_slot' },
 
 { oid => '3163', descr => 'current trigger depth',
   proname => 'pg_trigger_depth', provolatile => 's', proparallel => 'r',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0dfbac4..ce67740 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -56,6 +56,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_RESETSHAREDCOUNTER,
 	PGSTAT_MTYPE_RESETSINGLECOUNTER,
 	PGSTAT_MTYPE_RESETSLRUCOUNTER,
+	PGSTAT_MTYPE_RESETREPLSLOTCOUNTER,
 	PGSTAT_MTYPE_AUTOVAC_START,
 	PGSTAT_MTYPE_VACUUM,
 	PGSTAT_MTYPE_ANALYZE,
@@ -67,7 +68,8 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_RECOVERYCONFLICT,
 	PGSTAT_MTYPE_TEMPFILE,
 	PGSTAT_MTYPE_DEADLOCK,
-	PGSTAT_MTYPE_CHECKSUMFAILURE
+	PGSTAT_MTYPE_CHECKSUMFAILURE,
+	PGSTAT_MTYPE_REPLSLOT,
 } StatMsgType;
 
 /* ----------
@@ -357,6 +359,18 @@ typedef struct PgStat_MsgResetslrucounter
 } PgStat_MsgResetslrucounter;
 
 /* ----------
+ * PgStat_MsgResetreplslotcounter Sent by the backend to tell the collector
+ *								to reset replicatino slot counter(s)
+ * ----------
+ */
+typedef struct PgStat_MsgResetreplslotcounter
+{
+	PgStat_MsgHdr m_hdr;
+	char		m_slotname[NAMEDATALEN];
+	bool		clearall;
+} PgStat_MsgResetreplslotcounter;
+
+/* ----------
  * PgStat_MsgAutovacStart		Sent by the autovacuum daemon to signal
  *								that a database is going to be processed
  * ----------
@@ -454,6 +468,25 @@ typedef struct PgStat_MsgSLRU
 } PgStat_MsgSLRU;
 
 /* ----------
+ * PgStat_MsgReplSlot	Sent by a backend or a wal sender to update replication
+ *						slot statistics.
+ * ----------
+ */
+typedef struct PgStat_MsgReplSlot
+{
+	PgStat_MsgHdr	m_hdr;
+	char			m_slotname[NAMEDATALEN];
+	bool			m_drop;
+	PgStat_Counter	m_spill_txns;
+	PgStat_Counter	m_spill_count;
+	PgStat_Counter	m_spill_bytes;
+	PgStat_Counter	m_stream_txns;
+	PgStat_Counter	m_stream_count;
+	PgStat_Counter	m_stream_bytes;
+} PgStat_MsgReplSlot;
+
+
+/* ----------
  * PgStat_MsgRecoveryConflict	Sent by the backend upon recovery conflict
  * ----------
  */
@@ -591,6 +624,7 @@ typedef union PgStat_Msg
 	PgStat_MsgResetsharedcounter msg_resetsharedcounter;
 	PgStat_MsgResetsinglecounter msg_resetsinglecounter;
 	PgStat_MsgResetslrucounter msg_resetslrucounter;
+	PgStat_MsgResetreplslotcounter msg_resetreplslotcounter;
 	PgStat_MsgAutovacStart msg_autovacuum_start;
 	PgStat_MsgVacuum msg_vacuum;
 	PgStat_MsgAnalyze msg_analyze;
@@ -603,6 +637,7 @@ typedef union PgStat_Msg
 	PgStat_MsgDeadlock msg_deadlock;
 	PgStat_MsgTempFile msg_tempfile;
 	PgStat_MsgChecksumFailure msg_checksumfailure;
+	PgStat_MsgReplSlot msg_replslot;
 } PgStat_Msg;
 
 
@@ -760,6 +795,20 @@ typedef struct PgStat_SLRUStats
 	TimestampTz stat_reset_timestamp;
 } PgStat_SLRUStats;
 
+/*
+ * Replication slot statistics kept in the stats collector
+ */
+typedef struct PgStat_ReplSlotStats
+{
+	char			slotname[NAMEDATALEN];
+	PgStat_Counter	spill_txns;
+	PgStat_Counter	spill_count;
+	PgStat_Counter	spill_bytes;
+	PgStat_Counter  stream_txns;
+	PgStat_Counter  stream_count;
+	PgStat_Counter  stream_bytes;
+	TimestampTz		stat_reset_timestamp;
+} PgStat_ReplSlotStats;
 
 /* ----------
  * Backend states
@@ -1303,6 +1352,7 @@ extern void pgstat_reset_counters(void);
 extern void pgstat_reset_shared_counters(const char *);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 extern void pgstat_reset_slru_counter(const char *);
+extern void pgstat_reset_replslot_counter(const char *name);
 
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -1315,6 +1365,9 @@ extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 extern void pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount);
 extern void pgstat_report_checksum_failure(void);
+extern void pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
+								   int spillbytes, int streamtxns, int streamcount, int streambytes);
+extern void pgstat_report_replslot_drop(const char *slotname);
 
 extern void pgstat_initialize(void);
 extern void pgstat_bestart(void);
@@ -1479,6 +1532,7 @@ extern int	pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
+extern PgStat_ReplSlotStats *pgstat_fetch_replslot(int *nslots_p);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1ae17d5..edc51b1 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -525,6 +525,20 @@ struct ReorderBuffer
 
 	/* memory accounting */
 	Size		size;
+
+	/*
+	 * Statistics about transactions spilled to disk.
+	 *
+	 * A single transaction may be spilled repeatedly, which is why we keep
+	 * two different counters. For spilling, the transaction counter includes
+	 * both toplevel transactions and subtransactions.
+	 */
+	int64		spillCount;		/* spill-to-disk invocation counter */
+	int64		spillTxns;		/* number of transactions spilled to disk  */
+	int64		spillBytes;		/* amount of data spilled to disk */
+	int64		streamCount;	/* streaming invocation counter */
+	int64		streamTxns;		/* number of transactions spilled to disk */
+	int64		streamBytes;	/* amount of data streamed to subscriber */
 };
 
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 601734a..197a86c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2008,6 +2008,14 @@ pg_stat_replication| SELECT s.pid,
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
      JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
+pg_stat_replication_slots| SELECT s.name,
+    s.spill_txns,
+    s.spill_count,
+    s.spill_bytes,
+    s.stream_txns,
+    s.stream_count,
+    s.stream_bytes
+   FROM pg_stat_get_replication_slots() s(name, spill_txns, spill_count, spill_bytes, stream_txns, stream_count, stream_bytes);
 pg_stat_slru| SELECT s.name,
     s.blks_zeroed,
     s.blks_hit,
#462Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#458)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Jul 29, 2020 at 10:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Thanks, please find the rebased patch set.

Few comments on v44-0001-Implement-streaming-mode-in-ReorderBuffer:
============================================================
1.
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM
generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM
generate_series(1, 20) g(i);
+COMMIT;

Is the above comment true? Because it seems to me that Insert is
getting streamed in the main transaction.

2.
+<programlisting>
+postgres[33712]=#* SELECT * FROM
pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes',
'1');
+    lsn    | xid |                       data
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+

Is the above example correct? Because we should include XID in the
stream message only when include_xids option is specified.

3.
 /*
- * Queue a change into a transaction so it can be replayed upon commit.
+ * Record the partial change for the streaming of in-progress transactions.  We
+ * can stream only complete changes so if we have a partial change like toast
+ * table insert or speculative then we mark such a 'txn' so that it can't be
+ * streamed.

/speculative then/speculative insert then

4. I think we can explain the problems (like we can see the wrong
tuple or see two versions of the same tuple or whatever else wrong can
happen, if possible with some example) related to concurrent aborts
somewhere in comments.

--
With Regards,
Amit Kapila.

#463Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#462)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Aug 4, 2020 at 10:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 29, 2020 at 10:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Thanks, please find the rebased patch set.

Few comments on v44-0001-Implement-streaming-mode-in-ReorderBuffer:
============================================================
1.
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM
generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM
generate_series(1, 20) g(i);
+COMMIT;

Is the above comment true? Because it seems to me that Insert is
getting streamed in the main transaction.

Changed the comments.

2.
+<programlisting>
+postgres[33712]=#* SELECT * FROM
pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes',
'1');
+    lsn    | xid |                       data
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+

Is the above example correct? Because we should include XID in the
stream message only when include_xids option is specified.

include_xids is true if we don't set it to false explicitly

3.
/*
- * Queue a change into a transaction so it can be replayed upon commit.
+ * Record the partial change for the streaming of in-progress transactions.  We
+ * can stream only complete changes so if we have a partial change like toast
+ * table insert or speculative then we mark such a 'txn' so that it can't be
+ * streamed.

/speculative then/speculative insert then

Done

4. I think we can explain the problems (like we can see the wrong
tuple or see two versions of the same tuple or whatever else wrong can
happen, if possible with some example) related to concurrent aborts
somewhere in comments.

Done

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v45.tarapplication/x-tar; name=v45.tarDownload
v45/0000775000175000017500000000000013712204536012565 5ustar  dilipkumardilipkumarv45/v45-0001-Implement-streaming-mode-in-ReorderBuffer.patch0000664000175000017500000023103313712204536024603 0ustar  dilipkumardilipkumarFrom 3e7189c5617d43dc0fc36903abfb9cec5537642b Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 15 Jul 2020 18:38:32 +0530
Subject: [PATCH v45 1/6] Implement streaming mode in ReorderBuffer.

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke new stream API methods. However, sometimes if we have
incomplete toast or speculative insert we spill to the disk because we can
not generate the complete tuple and stream.  And, as soon as we get the
complete tuple we stream the transaction including the serialized
changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages.

Now that we can stream in-progress transactions, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

We also provide a new option via SQL APIs to fetch the changes being
streamed.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/Makefile                  |   2 +-
 contrib/test_decoding/expected/stream.out       |  66 ++
 contrib/test_decoding/expected/truncate.out     |   6 +
 contrib/test_decoding/sql/stream.sql            |  21 +
 contrib/test_decoding/sql/truncate.sql          |   1 +
 contrib/test_decoding/test_decoding.c           |  13 +
 doc/src/sgml/logicaldecoding.sgml               |   9 +-
 doc/src/sgml/test-decoding.sgml                 |  22 +
 src/backend/access/heap/heapam.c                |  13 +
 src/backend/access/heap/heapam_visibility.c     |  42 +-
 src/backend/access/index/genam.c                |  53 ++
 src/backend/access/table/tableam.c              |   8 +
 src/backend/access/transam/xact.c               |  19 +
 src/backend/replication/logical/decode.c        |  17 +-
 src/backend/replication/logical/logical.c       |  10 +
 src/backend/replication/logical/reorderbuffer.c | 980 +++++++++++++++++++++---
 src/include/access/heapam_xlog.h                |   1 +
 src/include/access/tableam.h                    |  55 ++
 src/include/access/xact.h                       |   4 +
 src/include/replication/logical.h               |   1 +
 src/include/replication/reorderbuffer.h         |  56 +-
 21 files changed, 1293 insertions(+), 106 deletions(-)
 create mode 100644 contrib/test_decoding/expected/stream.out
 create mode 100644 contrib/test_decoding/sql/stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c58..ed9a3d6 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,7 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate
+	spill slot truncate stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
new file mode 100644
index 0000000..2cdd79a
--- /dev/null
+++ b/contrib/test_decoding/expected/stream.out
@@ -0,0 +1,66 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+COMMIT;
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ opening a streamed block for transaction
+ streaming message: transactional: 1 prefix: test, sz: 50
+ closing a streamed block for transaction
+ aborting streamed (sub)transaction
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ committing streamed transaction
+(27 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/truncate.out b/contrib/test_decoding/expected/truncate.out
index 1cf2ae8..e64d377 100644
--- a/contrib/test_decoding/expected/truncate.out
+++ b/contrib/test_decoding/expected/truncate.out
@@ -25,3 +25,9 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
  COMMIT
 (9 rows)
 
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
new file mode 100644
index 0000000..65e8289
--- /dev/null
+++ b/contrib/test_decoding/sql/stream.sql
@@ -0,0 +1,21 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+COMMIT;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/truncate.sql b/contrib/test_decoding/sql/truncate.sql
index 5aecdf0..5633854 100644
--- a/contrib/test_decoding/sql/truncate.sql
+++ b/contrib/test_decoding/sql/truncate.sql
@@ -11,3 +11,4 @@ TRUNCATE tab1, tab1 RESTART IDENTITY CASCADE;
 TRUNCATE tab1, tab2;
 
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index dbef52a..d8e2b41 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -122,6 +122,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	ListCell   *option;
 	TestDecodingData *data;
+	bool enable_streaming = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -212,6 +213,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "stream-changes") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_streaming))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -221,6 +232,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 							elem->arg ? strVal(elem->arg) : "(null)")));
 		}
 	}
+
+	ctx->streaming &= enable_streaming;
 }
 
 /* cleanup this plugin's resources */
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 791a62b..1571d71 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d..fe7c978 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes', '1');
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5eef225..0016900 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1299,6 +1299,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments in xact.c where
+	 * these variables are declared.  Normally we have such a check at tableam
+	 * level API but this is called from many places so we need to ensure it
+	 * here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
@@ -1956,6 +1966,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..c771280 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the tuple
+		 * yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1659,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..9d9a70a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,10 +430,37 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments in xact.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments in xact.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments in xact.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 3afb63b..c638319 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -249,6 +249,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
 	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
+	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d4f7c29..99722ee 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -83,6 +83,19 @@ bool		XactDeferrable;
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
 /*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool		bsysscan = false;
+
+/*
  * When running as a parallel worker, we place only a single
  * TransactionStateData on the parallel worker's state stack, and the XID
  * reflected there will be that of the *innermost* currently-active
@@ -2680,6 +2693,9 @@ AbortTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* If in parallel mode, clean up workers and exit parallel mode. */
 	if (IsInParallelMode())
 	{
@@ -4982,6 +4998,9 @@ AbortSubTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* Exit from parallel mode, if necessary. */
 	if (IsInParallelMode())
 	{
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index f3a1c31..f21f61d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 05d24b9..42f284b 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1442,3 +1442,13 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Clear logical streaming state during (sub)transaction abort.
+ */
+void
+ResetLogicalStreamingState(void)
+{
+	CheckXidAlive = InvalidTransactionId;
+	bsysscan = false;
+}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index ce6e621..9d7af4b 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -178,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -236,6 +252,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +261,16 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static inline bool ReorderBufferCanStartStreaming(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -367,6 +394,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -416,13 +446,15 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 }
 
 /*
- * Free an ReorderBufferChange.
+ * Free a ReorderBufferChange and update memory accounting, if requested.
  */
 void
-ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
+						  bool upd_mem)
 {
 	/* update memory accounting info */
-	ReorderBufferChangeMemoryUpdate(rb, change, false);
+	if (upd_mem)
+		ReorderBufferChangeMemoryUpdate(rb, change, false);
 
 	/* free contained data */
 	switch (change->action)
@@ -624,16 +656,101 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 }
 
 /*
- * Queue a change into a transaction so it can be replayed upon commit.
+ * Record the partial change for the streaming of in-progress transactions.  We
+ * can stream only complete changes so if we have a partial change like toast
+ * table insert or speculative insert then we mark such a 'txn' so that it
+ * can't be streamed.  We also ensure that if the changes in such a 'txn' are
+ * above logical_decoding_work_mem threshold then we stream them as soon as we
+ * have a complete change.
+ */
+static void
+ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								  ReorderBufferChange *change,
+								  bool toast_insert)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The partial changes need to be processed only while streaming
+	 * in-progress transactions.
+	 */
+	if (!ReorderBufferCanStream(rb))
+		return;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * Set the toast insert bit whenever we get toast insert to indicate a partial
+	 * change and clear it when we get the insert or update on main table (Both
+	 * update and insert will do the insert in the toast table).
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			 IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial change and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+	{
+		/*
+		 * Speculative confirm change must be preceded by speculative insertion.
+		 */
+		Assert(rbtxn_has_spec_insert(toptxn));
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+	}
+
+	/*
+	 * Stream the transaction if it is serialized before and the changes are
+	 * now complete in the top-level transaction.
+	 *
+	 * The reason for doing the streaming of such a transaction as soon as
+	 * we get the complete change for it is that previously it would have
+	 * reached the memory threshold and wouldn't get streamed because of
+	 * incomplete changes.  Delaying such transactions would increase apply
+	 * lag for them.
+	 */
+	if (ReorderBufferCanStartStreaming(rb) &&
+		!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		rbtxn_is_serialized(txn))
+		ReorderBufferStreamTXN(rb, toptxn);
+}
+
+/*
+ * Queue a change into a transaction so it can be replayed upon commit or will be
+ * streamed when we reach logical_decoding_work_mem threshold.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the previous changes we have detected that the transaction
+	 * is aborted.  So there is no point in collecting further changes for it.
+	 */
+	if (txn->concurrent_abort)
+	{
+		/*
+		 * We don't need to update memory accounting for this change as we have
+		 * not added it to the queue yet.
+		 */
+		ReorderBufferReturnChange(rb, change, false);
+		return;
+	}
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -645,6 +762,9 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* process partial change */
+	ReorderBufferProcessPartialChange(rb, txn, change, toast_insert);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -674,7 +794,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -764,6 +884,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the (sub)transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -1018,6 +1170,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1032,6 +1187,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1148,7 +1306,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 	{
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1234,7 +1392,7 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1280,7 +1438,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1297,7 +1455,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1310,6 +1468,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1335,6 +1502,91 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change, true);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
+	 * streamed always, even if it does not contain any changes (that is, when
+	 * all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the toplevel xact (we send the XID in all messages), but we never
+	 * stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
+	 * memory. We could also keep the hash table and update it with new ctid
+	 * values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
  */
@@ -1485,57 +1737,189 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid for concurrent abort check
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  For example, suppose there is one catalog tuple with
+ * (xmin: 500, xmax: 0).  Now, the transaction 501 updates the catalog tuple
+ * and now we will have two tuples (xmin: 500, xmax: 501) and
+ * (xmin: 501, xmax: 0).  Now, if 501 is aborted and some other transaction
+ * say 502 updates the same catalog tuple then the first tuple will be changed
+ * to (xmin: 500, xmax: 502).  So, the problem is that when we try to decode
+ * the the tuple inserted/updated in 501 after the catalog update, will see the
+ * first catalog tuple instead of second one because it will assume it is
+ * deleted by xid 502 which is not visible to our snapshot so we can see that
+ * tuple.
+ *
+ * For detecting the concurrent abort we set CheckXidAlive to the current
+ * (sub)transaction's xid for which this change belongs to.  And, during
+ * catalog scan we can check the status of the xid and if it is aborted we will
+ * report a specific error so that we can stop streaming current transaction
+ * and discard the already streamed changes on such an error.  We might have
+ * already streamed some of the changes for the aborted (sub)transaction, but
+ * that is fine because when we decode the abort we will stream abort message
+ * to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check if the
+	 * xid aborted.  That will happen during catalog access.
+	 */
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						  ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction. This resets the TXN such that it
+ * can be used to stream the remaining data of transaction being processed.
+ */
+static void
+ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+					  Snapshot snapshot_now,
+					  CommandId command_id,
+					  XLogRecPtr last_lsn,
+					  ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Free all resources allocated for toast reconstruction */
+	ReorderBufferToastReset(rb, txn);
+
+	/* Return the spec insert change if it is not NULL */
+	if (specinsert != NULL)
+	{
+		ReorderBufferReturnChange(rb, specinsert, true);
+		specinsert = NULL;
 	}
 
-	snapshot_now = txn->base_snapshot;
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. We iterate over the top and subtransactions (using a k-way
+ * merge) and replay the changes in lsn order.
+ *
+ * If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool stream_started = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1558,14 +1942,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming ? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* We only need to send begin/commit for non-streamed transactions. */
+		if (!streaming)
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1573,6 +1958,36 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * We can't call start stream callback before processing first
+			 * change.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					txn->origin_id = change->origin_id;
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the xid for concurrent abort check. */
+			if (streaming)
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1649,7 +2064,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1685,11 +2101,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1714,7 +2130,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					/* clear out a pending (and thus failed) speculation */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
@@ -1747,7 +2163,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1756,10 +2175,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1790,7 +2206,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1837,7 +2252,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		 */
 		if (specinsert)
 		{
-			ReorderBufferReturnChange(rb, specinsert);
+			ReorderBufferReturnChange(rb, specinsert, true);
 			specinsert = NULL;
 		}
 
@@ -1845,14 +2260,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, send the last message for this set of
+		 * changes depending upon streaming mode.
+		 */
+		if (streaming)
+		{
+			if (stream_started)
+			{
+				rb->stream_stop(rb, txn, prev_lsn);
+				stream_started = false;
+			}
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot for the next set of changes in
+		 * streaming mode.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1870,14 +2306,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as
+		 * streamed (if they contained changes). Otherwise, remove all the
+		 * changes and deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1896,15 +2345,106 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
+		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * This error can only occur when we are sending the data in
+			 * streaming mode and the streaming in not finished yet.
+			 */
+			Assert(streaming);
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+			curtxn->concurrent_abort = true;
+
+			/* Reset the TXN so that it is allowed to stream remaining data. */
+			ReorderBufferResetTXN(rb, txn, snapshot_now,
+								  command_id, prev_lsn,
+								  specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+	}
+	PG_END_TRY();
+}
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot	snapshot_now;
+	CommandId	command_id = FirstCommandId;
 
-		PG_RE_THROW();
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * XXX Called after everything (origin ID and LSN, ...) is stored in the
+	 * transaction, so we don't pass that directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
 	}
-	PG_END_TRY();
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
 }
 
 /*
@@ -1931,6 +2471,22 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_abort(rb, txn, lsn);
+
+		/*
+		 * We might have decoded changes for this transaction that could load
+		 * the cache as per the current transaction's view (consider DDL's
+		 * happened in this transaction). We don't want the decoding of future
+		 * transactions to use those cache entries so execute invalidations.
+		 */
+		if (txn->ninvalidations > 0)
+			ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+											   txn->invalidations);
+	}
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2000,6 +2556,10 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2082,7 +2642,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2131,12 +2691,21 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2144,6 +2713,8 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2155,19 +2726,41 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* If streaming supported, update the total size in top level as well. */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2196,6 +2789,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
 	change->lsn = lsn;
 	change->txn = txn;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	change->txn = txn;
 
 	dlist_push_tail(&txn->tuplecids, &change->node);
 	txn->ntuplecids++;
@@ -2388,6 +2982,51 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ *
+ * Note that, we skip transactions that contains incomplete changes. There
+ * is a scope of optimization here such that we can select the largest transaction
+ * which has complete changes.  But that will make the code and design quite complex
+ * and that might not be worth the benefit.  If we plan to stream the transactions
+ * that contains incomplete changes then we need to find a way to partially
+ * stream/truncate the transaction changes in-memory and build a mechanism to
+ * partially truncate the spilled files.  Additionally, whenever we partially
+ * stream the transaction we need to maintain the last streamed lsn and next time
+ * we need to restore from that segment and the offset in WAL.  As we stream the
+ * changes from the top transaction and restore them subtransaction wise, we need
+ * to even remember the subxact from where we streamed the last change.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	Size largest_size = 0;
+	ReorderBufferTXN *largest = NULL;
+
+	/* Find the largest top-level transaction. */
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		if ((largest != NULL || txn->total_size > largest_size) &&
+			(txn->total_size > 0) && !(rbtxn_has_incomplete_tuple(txn)))
+		{
+			largest = txn;
+			largest_size = txn->total_size;
+		}
+	}
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
  * disk until we reach under the memory limit.
@@ -2419,11 +3058,33 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if possible.  Otherwise, spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStartStreaming(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it
+			 * from memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2501,7 +3162,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 
 		spilled++;
 	}
@@ -2713,6 +3374,136 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+/* Returns true, if the output plugin supports streaming, false, otherwise. */
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/* Returns true, if the streaming can be started now, false, otherwise. */
+static inline bool
+ReorderBufferCanStartStreaming(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild *builder = ctx->snapshot_builder;
+
+	/*
+	 * We can't start streaming immediately even if the streaming is enabled
+	 * because we previously decoded this transaction and now just are
+	 * restarting.
+	 */
+	if (ReorderBufferCanStream(rb) &&
+		!SnapBuildXactNeedsSkip(builder, ctx->reader->EndRecPtr))
+	{
+		/* We must have a consistent snapshot by this time */
+		Assert(SnapBuildCurrentState(builder) == SNAPBUILD_CONSISTENT);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/* We can never reach here for a sub transaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * We can't make any assumptions about base snapshot here, similar to what
+	 * ReorderBufferCommit() does. That relies on base_snapshot getting
+	 * transferred from subxact in ReorderBufferCommitChild(), but that was
+	 * not yet called as the transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 *
+	 * Unlike DecodeCommit which adds xids of all the subtransactions in
+	 * snapshot's xip array via SnapBuildCommittedTxn, we can't do that here
+	 * but we do add them to subxip array instead via ReorderBufferCopySnap.
+	 * This allows the catalog changes made in subtransactions decoded till
+	 * now to be visible.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		/*
+		 * If this transaction has no snapshot, it didn't make any changes to
+		 * the database till now, so there's nothing to decode.
+		 */
+		if (txn->base_snapshot == NULL)
+		{
+			Assert(txn->ninvalidations == 0);
+			return;
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can not use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -2813,7 +3604,7 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		dlist_container(ReorderBufferChange, node, cleanup_iter.cur);
 
 		dlist_delete(&cleanup->node);
-		ReorderBufferReturnChange(rb, cleanup);
+		ReorderBufferReturnChange(rb, cleanup, true);
 	}
 	txn->nentries_mem = 0;
 	Assert(dlist_is_empty(&txn->changes));
@@ -3522,7 +4313,7 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			dlist_container(ReorderBufferChange, node, it.cur);
 
 			dlist_delete(&change->node);
-			ReorderBufferReturnChange(rb, change);
+			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
 
@@ -3812,6 +4603,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases, we assume the CID is from the future
+	 * command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 7ba72c8..387eb34 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -19,6 +19,7 @@
 
 #include "access/relscan.h"
 #include "access/sdir.h"
+#include "access/xact.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1017,6 +1027,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1056,6 +1073,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with
+	 * valid CheckXidAlive for catalog or regular tables.  See detailed
+	 * comments in xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1713,6 +1738,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1730,6 +1763,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1748,6 +1789,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1764,6 +1812,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5348011..c18554b 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,10 @@ typedef enum
 /* Synchronous commit level */
 extern int	synchronous_commit;
 
+/* used during logical streaming of a transaction */
+extern TransactionId CheckXidAlive;
+extern bool bsysscan;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index deef318..b0fae98 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -121,5 +121,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+extern void ResetLogicalStreamingState(void);
 
 #endif
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 42bc817..1ae17d5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -162,6 +162,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -181,6 +184,40 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -249,6 +286,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
@@ -313,6 +357,12 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -484,12 +534,14 @@ void		ReorderBufferFree(ReorderBuffer *);
 ReorderBufferTupleBuf *ReorderBufferGetTupleBuf(ReorderBuffer *, Size tuple_len);
 void		ReorderBufferReturnTupleBuf(ReorderBuffer *, ReorderBufferTupleBuf *tuple);
 ReorderBufferChange *ReorderBufferGetChange(ReorderBuffer *);
-void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
+void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *, bool);
 
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool toast_insert);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v45/v45-0002-Extend-the-BufFile-interface-for-the-streaming-o.patch0000664000175000017500000003753413712204536025657 0ustar  dilipkumardilipkumarFrom 4207c6d4c9802911ab5be39b6c97627657e7f4d9 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v45 2/6] Extend the BufFile interface for the streaming of
 in-progress transactions.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times.

Implement the interface for BufFileTruncate interface to allow files to be
truncated up to a particular offset.  Extend BufFileSeek API to support
SEEK_END case.  Add an option to provide a mode while opening the shared
BufFiles instead of always opening in read-only mode.
---
 src/backend/postmaster/pgstat.c           |  3 +
 src/backend/storage/file/buffile.c        | 86 +++++++++++++++++++++++----
 src/backend/storage/file/fd.c             |  9 ++-
 src/backend/storage/file/sharedfileset.c  | 98 ++++++++++++++++++++++++++++---
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  4 +-
 10 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 15f92b6..3804412 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 2d7a082..a9ca5d9 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile supports temporary files that can be used by the single backend when
+ * the corresponding files need to be survived across the transaction and need
+ * to be opened and closed multiple times.  Such files need to be created as a
+ * member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -364,6 +368,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -666,11 +673,22 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
+			break;
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +856,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset = file->curOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420e..f376a97 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594..9a3dc10 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can also be used by backends when the temporary files need
+ * to be opened/closed multiple times and the underlying files need to survive
+ * across transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,19 +29,29 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -223,6 +255,58 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 }
 
 /*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
+/*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
  */
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59..788815c 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a43..b83fb50 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201..807a9c1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752ba..fc34c49 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d..e209f04 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf07..d5edb60 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
1.8.3.1

v45/v45-0003-Add-support-for-streaming-to-built-in-replicatio.patch0000664000175000017500000027354513712204536026071 0ustar  dilipkumardilipkumarFrom ac24187438e68ec702f5952dbc7fd7127e739724 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v45 3/6] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  49 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   4 +
 src/backend/replication/logical/proto.c            | 140 ++-
 src/backend/replication/logical/worker.c           | 946 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 348 +++++++-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  46 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/009_stream_simple.pl       |  86 ++
 src/test/subscription/t/010_stream_subxact.pl      | 102 +++
 src/test/subscription/t/011_stream_ddl.pl          |  95 +++
 .../subscription/t/012_stream_subxact_abort.pl     |  82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |  84 ++
 src/test/subscription/t/015_stream_binary.pl       |  86 ++
 19 files changed, 2060 insertions(+), 47 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/015_stream_binary.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70..a81bd54 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal> and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c5..b7d7457 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf..311d462 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377..4c58ad8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,8 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +197,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +350,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +375,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -439,6 +455,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -698,6 +721,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +732,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +789,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +834,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +877,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3804412..bb0f95a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e905723..a6101ac 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097..ff25924 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 2fcf2e6..98e7fd0 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,62 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +291,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,17 +752,323 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or
+	 * inside streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		!in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -635,6 +1081,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1099,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1138,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1256,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1408,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1781,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1922,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2050,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2162,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1947,6 +2435,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1979,6 +2481,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2145,6 +3080,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc..3360bd5 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,29 +47,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,11 +123,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -115,16 +149,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +226,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +255,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +279,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +300,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +331,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +394,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +444,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +486,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,6 +507,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -406,7 +539,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +559,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +584,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +605,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +631,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +663,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -606,6 +744,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -642,6 +887,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -771,12 +1048,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -811,7 +1121,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35..1d09154 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc..655144d 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,49 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbe..6c0a4e3 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/015_stream_binary.pl b/src/test/subscription/t/015_stream_binary.pl
new file mode 100644
index 0000000..fa2362e
--- /dev/null
+++ b/src/test/subscription/t/015_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v45/v45-0004-Enable-streaming-for-all-subscription-TAP-tests.patch0000664000175000017500000003534213712204536025625 0ustar  dilipkumardilipkumarFrom 40dc36eda7c95d901d0a532c482fc90c26d5219e Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v45 4/6] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318f..6f7bedc 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 9f140b5..21410fa 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334e..cdf9b8e 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v45/v45-0005-Add-TAP-test-for-streaming-vs.-DDL.patch0000664000175000017500000001062113712204536022553 0ustar  dilipkumardilipkumarFrom 679779484fcc8ff93c26a27131dd79af6ad4b997 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v45 5/6] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v45/v45-0006-Add-streaming-option-in-pg_dump.patch0000664000175000017500000000524713712204536022637 0ustar  dilipkumardilipkumarFrom eaa67a5ffc290792e93547395bad72ca90969e90 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 11:09:18 +0530
Subject: [PATCH v45 6/6] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 17 +++++++++++++++--
 src/bin/pg_dump/pg_dump.h |  1 +
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 94459b3..f69d64c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4240,10 +4241,17 @@ getSubscriptions(Archive *fout)
 
 	if (fout->remoteVersion >= 140000)
 		appendPQExpBuffer(query,
-						  " s.subbinary\n");
+						  " s.subbinary,\n");
 	else
 		appendPQExpBuffer(query,
-						  " false AS subbinary\n");
+						  " false AS subbinary,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBuffer(query,
+						  " s.substream\n");
+	else
+		appendPQExpBuffer(query,
+						  " false AS substream\n");
 
 	appendPQExpBuffer(query,
 					  "FROM pg_subscription s\n"
@@ -4263,6 +4271,7 @@ getSubscriptions(Archive *fout)
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
+	i_substream = PQfnumber(res, "substream");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4286,6 +4295,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subpublications));
 		subinfo[i].subbinary =
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
+		subinfo[i].substream =
+			pg_strdup(PQgetvalue(res, i, i_substream));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4357,6 +4368,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subbinary, "t") == 0)
 		appendPQExpBuffer(query, ", binary = true");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index da97b73..cc10c7c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -626,6 +626,7 @@ typedef struct _SubscriptionInfo
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subbinary;
+	char       *substream;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
-- 
1.8.3.1

#464Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#463)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Aug 4, 2020 at 12:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Aug 4, 2020 at 10:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

4. I think we can explain the problems (like we can see the wrong
tuple or see two versions of the same tuple or whatever else wrong can
happen, if possible with some example) related to concurrent aborts
somewhere in comments.

Done

I have slightly modified the comment added for the above point and
apart from that added/modified a few comments at other places. I have
also slightly edited the commit message.

@@ -2196,6 +2778,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb,
TransactionId xid,
change->lsn = lsn;
change->txn = txn;
change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+ change->txn = txn;

This change is not required as the same information is assigned a few
lines before. So, I have removed this change as well. Let me know
what you think of the above changes?

Can we add a test for incomplete changes (probably with toast
insertion but we can do it for spec_insert case as well) in
ReorderBuffer such that it needs to first serialize the changes and
then stream it? I have manually verified such scenarios but it is
good to have the test for the same.

--
With Regards,
Amit Kapila.

Attachments:

v46-0001-Implement-streaming-mode-in-ReorderBuffer.patchapplication/octet-stream; name=v46-0001-Implement-streaming-mode-in-ReorderBuffer.patchDownload
From 31b8d09ba98ca9c3b91f3add39b05cb189d93ac2 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 15 Jul 2020 18:38:32 +0530
Subject: [PATCH v46] Implement streaming mode in ReorderBuffer.

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke stream API methods added by commit 45fdc9738b.
However, sometimes if we have incomplete toast or speculative insert we
spill to the disk because we can not generate the complete tuple and
stream.  And, as soon as we get the complete tuple we stream the
transaction including the serialized changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages at each command end. These
features are added by commits 0bead9af48 and c55040ccd0 respectively.

Now that we can stream in-progress transactions, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

We also provide a new option via SQL APIs to fetch the changes being
streamed.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila, Nikhil Sontakke
Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/Makefile                  |   2 +-
 contrib/test_decoding/expected/stream.out       |  66 ++
 contrib/test_decoding/expected/truncate.out     |   6 +
 contrib/test_decoding/sql/stream.sql            |  21 +
 contrib/test_decoding/sql/truncate.sql          |   1 +
 contrib/test_decoding/test_decoding.c           |  13 +
 doc/src/sgml/logicaldecoding.sgml               |   9 +-
 doc/src/sgml/test-decoding.sgml                 |  22 +
 src/backend/access/heap/heapam.c                |  13 +
 src/backend/access/heap/heapam_visibility.c     |  42 +-
 src/backend/access/index/genam.c                |  53 ++
 src/backend/access/table/tableam.c              |   8 +
 src/backend/access/transam/xact.c               |  19 +
 src/backend/replication/logical/decode.c        |  17 +-
 src/backend/replication/logical/logical.c       |  10 +
 src/backend/replication/logical/reorderbuffer.c | 981 +++++++++++++++++++++---
 src/include/access/heapam_xlog.h                |   1 +
 src/include/access/tableam.h                    |  55 ++
 src/include/access/xact.h                       |   4 +
 src/include/replication/logical.h               |   1 +
 src/include/replication/reorderbuffer.h         |  56 +-
 21 files changed, 1294 insertions(+), 106 deletions(-)
 create mode 100644 contrib/test_decoding/expected/stream.out
 create mode 100644 contrib/test_decoding/sql/stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c58..ed9a3d6 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,7 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate
+	spill slot truncate stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
new file mode 100644
index 0000000..2cdd79a
--- /dev/null
+++ b/contrib/test_decoding/expected/stream.out
@@ -0,0 +1,66 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+COMMIT;
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ opening a streamed block for transaction
+ streaming message: transactional: 1 prefix: test, sz: 50
+ closing a streamed block for transaction
+ aborting streamed (sub)transaction
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ committing streamed transaction
+(27 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/truncate.out b/contrib/test_decoding/expected/truncate.out
index 1cf2ae8..e64d377 100644
--- a/contrib/test_decoding/expected/truncate.out
+++ b/contrib/test_decoding/expected/truncate.out
@@ -25,3 +25,9 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
  COMMIT
 (9 rows)
 
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
new file mode 100644
index 0000000..65e8289
--- /dev/null
+++ b/contrib/test_decoding/sql/stream.sql
@@ -0,0 +1,21 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+COMMIT;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/truncate.sql b/contrib/test_decoding/sql/truncate.sql
index 5aecdf0..5633854 100644
--- a/contrib/test_decoding/sql/truncate.sql
+++ b/contrib/test_decoding/sql/truncate.sql
@@ -11,3 +11,4 @@ TRUNCATE tab1, tab1 RESTART IDENTITY CASCADE;
 TRUNCATE tab1, tab2;
 
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index dbef52a..d8e2b41 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -122,6 +122,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	ListCell   *option;
 	TestDecodingData *data;
+	bool enable_streaming = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -212,6 +213,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "stream-changes") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_streaming))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -221,6 +232,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 							elem->arg ? strVal(elem->arg) : "(null)")));
 		}
 	}
+
+	ctx->streaming &= enable_streaming;
 }
 
 /* cleanup this plugin's resources */
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 791a62b..1571d71 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d..fe7c978 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes', '1');
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5eef225..0016900 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1299,6 +1299,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments in xact.c where
+	 * these variables are declared.  Normally we have such a check at tableam
+	 * level API but this is called from many places so we need to ensure it
+	 * here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
@@ -1956,6 +1966,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..c771280 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the tuple
+		 * yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1659,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..9d9a70a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,10 +430,37 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments in xact.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments in xact.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments in xact.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 3afb63b..c638319 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -249,6 +249,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
 	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
+	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d4f7c29..99722ee 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -83,6 +83,19 @@ bool		XactDeferrable;
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
 /*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool		bsysscan = false;
+
+/*
  * When running as a parallel worker, we place only a single
  * TransactionStateData on the parallel worker's state stack, and the XID
  * reflected there will be that of the *innermost* currently-active
@@ -2680,6 +2693,9 @@ AbortTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* If in parallel mode, clean up workers and exit parallel mode. */
 	if (IsInParallelMode())
 	{
@@ -4982,6 +4998,9 @@ AbortSubTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* Exit from parallel mode, if necessary. */
 	if (IsInParallelMode())
 	{
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index f3a1c31..f21f61d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 05d24b9..42f284b 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1442,3 +1442,13 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Clear logical streaming state during (sub)transaction abort.
+ */
+void
+ResetLogicalStreamingState(void)
+{
+	CheckXidAlive = InvalidTransactionId;
+	bsysscan = false;
+}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index ce6e621..b77000a 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -178,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -236,6 +252,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +261,16 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static inline bool ReorderBufferCanStartStreaming(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -367,6 +394,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -416,13 +446,15 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 }
 
 /*
- * Free an ReorderBufferChange.
+ * Free a ReorderBufferChange and update memory accounting, if requested.
  */
 void
-ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
+						  bool upd_mem)
 {
 	/* update memory accounting info */
-	ReorderBufferChangeMemoryUpdate(rb, change, false);
+	if (upd_mem)
+		ReorderBufferChangeMemoryUpdate(rb, change, false);
 
 	/* free contained data */
 	switch (change->action)
@@ -624,16 +656,101 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 }
 
 /*
- * Queue a change into a transaction so it can be replayed upon commit.
+ * Record the partial change for the streaming of in-progress transactions.  We
+ * can stream only complete changes so if we have a partial change like toast
+ * table insert or speculative insert then we mark such a 'txn' so that it
+ * can't be streamed.  We also ensure that if the changes in such a 'txn' are
+ * above logical_decoding_work_mem threshold then we stream them as soon as we
+ * have a complete change.
+ */
+static void
+ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								  ReorderBufferChange *change,
+								  bool toast_insert)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The partial changes need to be processed only while streaming
+	 * in-progress transactions.
+	 */
+	if (!ReorderBufferCanStream(rb))
+		return;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * Set the toast insert bit whenever we get toast insert to indicate a partial
+	 * change and clear it when we get the insert or update on main table (Both
+	 * update and insert will do the insert in the toast table).
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			 IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial change and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+	{
+		/*
+		 * Speculative confirm change must be preceded by speculative insertion.
+		 */
+		Assert(rbtxn_has_spec_insert(toptxn));
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+	}
+
+	/*
+	 * Stream the transaction if it is serialized before and the changes are
+	 * now complete in the top-level transaction.
+	 *
+	 * The reason for doing the streaming of such a transaction as soon as
+	 * we get the complete change for it is that previously it would have
+	 * reached the memory threshold and wouldn't get streamed because of
+	 * incomplete changes.  Delaying such transactions would increase apply
+	 * lag for them.
+	 */
+	if (ReorderBufferCanStartStreaming(rb) &&
+		!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		rbtxn_is_serialized(txn))
+		ReorderBufferStreamTXN(rb, toptxn);
+}
+
+/*
+ * Queue a change into a transaction so it can be replayed upon commit or will be
+ * streamed when we reach logical_decoding_work_mem threshold.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the previous changes we have detected that the transaction
+	 * is aborted.  So there is no point in collecting further changes for it.
+	 */
+	if (txn->concurrent_abort)
+	{
+		/*
+		 * We don't need to update memory accounting for this change as we have
+		 * not added it to the queue yet.
+		 */
+		ReorderBufferReturnChange(rb, change, false);
+		return;
+	}
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -645,6 +762,9 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* process partial change */
+	ReorderBufferProcessPartialChange(rb, txn, change, toast_insert);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -674,7 +794,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -764,6 +884,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the (sub)transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -1018,6 +1170,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1032,6 +1187,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1148,7 +1306,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 	{
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1234,7 +1392,7 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1280,7 +1438,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1297,7 +1455,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1310,6 +1468,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1335,6 +1502,91 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change, true);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
+	 * streamed always, even if it does not contain any changes (that is, when
+	 * all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the toplevel xact (we send the XID in all messages), but we never
+	 * stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
+	 * memory. We could also keep the hash table and update it with new ctid
+	 * values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
  */
@@ -1485,57 +1737,191 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid to detect concurrent aborts.
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  For example, suppose there is one catalog tuple with
+ * (xmin: 500, xmax: 0).  Now, the transaction 501 updates the catalog tuple
+ * and after that we will have two tuples (xmin: 500, xmax: 501) and
+ * (xmin: 501, xmax: 0).  Now, if 501 is aborted and some other transaction
+ * say 502 updates the same catalog tuple then the first tuple will be changed
+ * to (xmin: 500, xmax: 502).  So, the problem is that when we try to decode
+ * the tuple inserted/updated in 501 after the catalog update, we will see the
+ * catalog tuple with (xmin: 500, xmax: 502) as visible because it will
+ * consider that the tuple is deleted by xid 502 which is not visible to our
+ * snapshot.  And when we will try to decode with that catalog tuple, it can
+ * lead to a wrong result or a crash.  So, it is necessary to detect
+ * concurrent aborts to allow streaming of in-progress transactions.
+ *
+ * For detecting the concurrent abort we set CheckXidAlive to the current
+ * (sub)transaction's xid for which this change belongs to.  And, during
+ * catalog scan we can check the status of the xid and if it is aborted we will
+ * report a specific error so that we can stop streaming current transaction
+ * and discard the already streamed changes on such an error.  We might have
+ * already streamed some of the changes for the aborted (sub)transaction, but
+ * that is fine because when we decode the abort we will stream abort message
+ * to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet.  We don't check if the
+	 * xid is aborted.  That will happen during catalog access.
+	 */
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						  ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction. This resets the TXN such that it
+ * can be used to stream the remaining data of transaction being processed.
+ */
+static void
+ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+					  Snapshot snapshot_now,
+					  CommandId command_id,
+					  XLogRecPtr last_lsn,
+					  ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Free all resources allocated for toast reconstruction */
+	ReorderBufferToastReset(rb, txn);
+
+	/* Return the spec insert change if it is not NULL */
+	if (specinsert != NULL)
+	{
+		ReorderBufferReturnChange(rb, specinsert, true);
+		specinsert = NULL;
 	}
 
-	snapshot_now = txn->base_snapshot;
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. We iterate over the top and subtransactions (using a k-way
+ * merge) and replay the changes in lsn order.
+ *
+ * If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool stream_started = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1558,14 +1944,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming ? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* We only need to send begin/commit for non-streamed transactions. */
+		if (!streaming)
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1573,6 +1960,36 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * We can't call start stream callback before processing first
+			 * change.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					txn->origin_id = change->origin_id;
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the current xid to detect concurrent aborts. */
+			if (streaming)
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1649,7 +2066,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1685,11 +2103,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1714,7 +2132,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					/* clear out a pending (and thus failed) speculation */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
@@ -1747,7 +2165,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1756,10 +2177,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1790,7 +2208,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1837,7 +2254,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		 */
 		if (specinsert)
 		{
-			ReorderBufferReturnChange(rb, specinsert);
+			ReorderBufferReturnChange(rb, specinsert, true);
 			specinsert = NULL;
 		}
 
@@ -1845,14 +2262,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, send the last message for this set of
+		 * changes depending upon streaming mode.
+		 */
+		if (streaming)
+		{
+			if (stream_started)
+			{
+				rb->stream_stop(rb, txn, prev_lsn);
+				stream_started = false;
+			}
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot for the next set of changes in
+		 * streaming mode.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1870,14 +2308,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as
+		 * streamed (if they contained changes). Otherwise, remove all the
+		 * changes and deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1896,15 +2347,106 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
+		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * This error can only occur when we are sending the data in
+			 * streaming mode and the streaming is not finished yet.
+			 */
+			Assert(streaming);
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+			curtxn->concurrent_abort = true;
+
+			/* Reset the TXN so that it is allowed to stream remaining data. */
+			ReorderBufferResetTXN(rb, txn, snapshot_now,
+								  command_id, prev_lsn,
+								  specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+	}
+	PG_END_TRY();
+}
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot	snapshot_now;
+	CommandId	command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
 
-		PG_RE_THROW();
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * Called after everything (origin ID, LSN, ...) is stored in the
+	 * transaction to avoid passing that information directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
 	}
-	PG_END_TRY();
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
 }
 
 /*
@@ -1931,6 +2473,22 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_abort(rb, txn, lsn);
+
+		/*
+		 * We might have decoded changes for this transaction that could load
+		 * the cache as per the current transaction's view (consider DDL's
+		 * happened in this transaction). We don't want the decoding of future
+		 * transactions to use those cache entries so execute invalidations.
+		 */
+		if (txn->ninvalidations > 0)
+			ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+											   txn->invalidations);
+	}
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2000,6 +2558,10 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2082,7 +2644,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2131,12 +2693,21 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2144,6 +2715,8 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2155,19 +2728,41 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* If streaming supported, update the total size in top level as well. */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2388,6 +2983,51 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ *
+ * Note that, we skip transactions that contains incomplete changes. There
+ * is a scope of optimization here such that we can select the largest transaction
+ * which has complete changes.  But that will make the code and design quite complex
+ * and that might not be worth the benefit.  If we plan to stream the transactions
+ * that contains incomplete changes then we need to find a way to partially
+ * stream/truncate the transaction changes in-memory and build a mechanism to
+ * partially truncate the spilled files.  Additionally, whenever we partially
+ * stream the transaction we need to maintain the last streamed lsn and next time
+ * we need to restore from that segment and the offset in WAL.  As we stream the
+ * changes from the top transaction and restore them subtransaction wise, we need
+ * to even remember the subxact from where we streamed the last change.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	Size largest_size = 0;
+	ReorderBufferTXN *largest = NULL;
+
+	/* Find the largest top-level transaction. */
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		if ((largest != NULL || txn->total_size > largest_size) &&
+			(txn->total_size > 0) && !(rbtxn_has_incomplete_tuple(txn)))
+		{
+			largest = txn;
+			largest_size = txn->total_size;
+		}
+	}
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
  * disk until we reach under the memory limit.
@@ -2419,11 +3059,33 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if possible.  Otherwise, spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStartStreaming(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it
+			 * from memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2501,7 +3163,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 
 		spilled++;
 	}
@@ -2713,6 +3375,136 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+/* Returns true, if the output plugin supports streaming, false, otherwise. */
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/* Returns true, if the streaming can be started now, false, otherwise. */
+static inline bool
+ReorderBufferCanStartStreaming(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild *builder = ctx->snapshot_builder;
+
+	/*
+	 * We can't start streaming immediately even if the streaming is enabled
+	 * because we previously decoded this transaction and now just are
+	 * restarting.
+	 */
+	if (ReorderBufferCanStream(rb) &&
+		!SnapBuildXactNeedsSkip(builder, ctx->reader->EndRecPtr))
+	{
+		/* We must have a consistent snapshot by this time */
+		Assert(SnapBuildCurrentState(builder) == SNAPBUILD_CONSISTENT);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/* We can never reach here for a subtransaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * We can't make any assumptions about base snapshot here, similar to what
+	 * ReorderBufferCommit() does. That relies on base_snapshot getting
+	 * transferred from subxact in ReorderBufferCommitChild(), but that was
+	 * not yet called as the transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 *
+	 * Unlike DecodeCommit which adds xids of all the subtransactions in
+	 * snapshot's xip array via SnapBuildCommittedTxn, we can't do that here
+	 * but we do add them to subxip array instead via ReorderBufferCopySnap.
+	 * This allows the catalog changes made in subtransactions decoded till
+	 * now to be visible.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		/*
+		 * If this transaction has no snapshot, it didn't make any changes to
+		 * the database till now, so there's nothing to decode.
+		 */
+		if (txn->base_snapshot == NULL)
+		{
+			Assert(txn->ninvalidations == 0);
+			return;
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can't use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -2813,7 +3605,7 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		dlist_container(ReorderBufferChange, node, cleanup_iter.cur);
 
 		dlist_delete(&cleanup->node);
-		ReorderBufferReturnChange(rb, cleanup);
+		ReorderBufferReturnChange(rb, cleanup, true);
 	}
 	txn->nentries_mem = 0;
 	Assert(dlist_is_empty(&txn->changes));
@@ -3522,7 +4314,7 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			dlist_container(ReorderBufferChange, node, it.cur);
 
 			dlist_delete(&change->node);
-			ReorderBufferReturnChange(rb, change);
+			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
 
@@ -3812,6 +4604,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases, we assume the CID is from the future
+	 * command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 7ba72c8..387eb34 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -19,6 +19,7 @@
 
 #include "access/relscan.h"
 #include "access/sdir.h"
+#include "access/xact.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1017,6 +1027,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1056,6 +1073,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with
+	 * valid CheckXidAlive for catalog or regular tables.  See detailed
+	 * comments in xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1713,6 +1738,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1730,6 +1763,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1748,6 +1789,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1764,6 +1812,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5348011..c18554b 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,10 @@ typedef enum
 /* Synchronous commit level */
 extern int	synchronous_commit;
 
+/* used during logical streaming of a transaction */
+extern TransactionId CheckXidAlive;
+extern bool bsysscan;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index deef318..b0fae98 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -121,5 +121,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+extern void ResetLogicalStreamingState(void);
 
 #endif
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 42bc817..1ae17d5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -162,6 +162,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -181,6 +184,40 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -249,6 +286,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
@@ -313,6 +357,12 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -484,12 +534,14 @@ void		ReorderBufferFree(ReorderBuffer *);
 ReorderBufferTupleBuf *ReorderBufferGetTupleBuf(ReorderBuffer *, Size tuple_len);
 void		ReorderBufferReturnTupleBuf(ReorderBuffer *, ReorderBufferTupleBuf *tuple);
 ReorderBufferChange *ReorderBufferGetChange(ReorderBuffer *);
-void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
+void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *, bool);
 
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool toast_insert);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

#465Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#464)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Aug 5, 2020 at 6:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Aug 4, 2020 at 12:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Aug 4, 2020 at 10:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

4. I think we can explain the problems (like we can see the wrong
tuple or see two versions of the same tuple or whatever else wrong can
happen, if possible with some example) related to concurrent aborts
somewhere in comments.

Done

I have slightly modified the comment added for the above point and
apart from that added/modified a few comments at other places. I have
also slightly edited the commit message.

@@ -2196,6 +2778,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb,
TransactionId xid,
change->lsn = lsn;
change->txn = txn;
change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+ change->txn = txn;

This change is not required as the same information is assigned a few
lines before. So, I have removed this change as well. Let me know
what you think of the above changes?

Changes look fine to me.

Can we add a test for incomplete changes (probably with toast
insertion but we can do it for spec_insert case as well) in
ReorderBuffer such that it needs to first serialize the changes and
then stream it? I have manually verified such scenarios but it is
good to have the test for the same.

I have added a new test for the same in the stream.sql file.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v46.tarapplication/x-tar; name=v46.tarDownload
v46/0000755000175000017500000000000013712531017012561 5ustar  dilipkumardilipkumarv46/v46-0001-Implement-streaming-mode-in-ReorderBuffer.patch0000644000175000017500000023351213712531017024604 0ustar  dilipkumardilipkumarFrom 6629cd7cf30f0b4b050b2077a1714bb913c93ca5 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 15 Jul 2020 18:38:32 +0530
Subject: [PATCH v46 1/6] Implement streaming mode in ReorderBuffer.

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke stream API methods added by commit 45fdc9738b.
However, sometimes if we have incomplete toast or speculative insert we
spill to the disk because we can not generate the complete tuple and
stream.  And, as soon as we get the complete tuple we stream the
transaction including the serialized changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages at each command end. These
features are added by commits 0bead9af48 and c55040ccd0 respectively.

Now that we can stream in-progress transactions, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

We also provide a new option via SQL APIs to fetch the changes being
streamed.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila, Nikhil Sontakke
Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/Makefile                  |   2 +-
 contrib/test_decoding/expected/stream.out       |  90 +++
 contrib/test_decoding/expected/truncate.out     |   6 +
 contrib/test_decoding/sql/stream.sql            |  27 +
 contrib/test_decoding/sql/truncate.sql          |   1 +
 contrib/test_decoding/test_decoding.c           |  13 +
 doc/src/sgml/logicaldecoding.sgml               |   9 +-
 doc/src/sgml/test-decoding.sgml                 |  22 +
 src/backend/access/heap/heapam.c                |  13 +
 src/backend/access/heap/heapam_visibility.c     |  42 +-
 src/backend/access/index/genam.c                |  53 ++
 src/backend/access/table/tableam.c              |   8 +
 src/backend/access/transam/xact.c               |  19 +
 src/backend/replication/logical/decode.c        |  17 +-
 src/backend/replication/logical/logical.c       |  10 +
 src/backend/replication/logical/reorderbuffer.c | 981 +++++++++++++++++++++---
 src/include/access/heapam_xlog.h                |   1 +
 src/include/access/tableam.h                    |  55 ++
 src/include/access/xact.h                       |   4 +
 src/include/replication/logical.h               |   1 +
 src/include/replication/reorderbuffer.h         |  56 +-
 21 files changed, 1324 insertions(+), 106 deletions(-)
 create mode 100644 contrib/test_decoding/expected/stream.out
 create mode 100644 contrib/test_decoding/sql/stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c58..ed9a3d6 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,7 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate
+	spill slot truncate stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
new file mode 100644
index 0000000..7f818df
--- /dev/null
+++ b/contrib/test_decoding/expected/stream.out
@@ -0,0 +1,90 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+COMMIT;
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ opening a streamed block for transaction
+ streaming message: transactional: 1 prefix: test, sz: 50
+ closing a streamed block for transaction
+ aborting streamed (sub)transaction
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ committing streamed transaction
+(27 rows)
+
+-- streaming toast changes
+ALTER TABLE stream_test ALTER COLUMN data set storage external;
+INSERT INTO stream_test SELECT repeat('a', 6000) || g.i FROM generate_series(1, 10) g(i);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+                   data                   
+------------------------------------------
+ BEGIN
+ COMMIT
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ committing streamed transaction
+(15 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/truncate.out b/contrib/test_decoding/expected/truncate.out
index 1cf2ae8..e64d377 100644
--- a/contrib/test_decoding/expected/truncate.out
+++ b/contrib/test_decoding/expected/truncate.out
@@ -25,3 +25,9 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
  COMMIT
 (9 rows)
 
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
new file mode 100644
index 0000000..e71386e
--- /dev/null
+++ b/contrib/test_decoding/sql/stream.sql
@@ -0,0 +1,27 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+COMMIT;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+
+-- streaming toast changes
+ALTER TABLE stream_test ALTER COLUMN data set storage external;
+INSERT INTO stream_test SELECT repeat('a', 6000) || g.i FROM generate_series(1, 10) g(i);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/truncate.sql b/contrib/test_decoding/sql/truncate.sql
index 5aecdf0..5633854 100644
--- a/contrib/test_decoding/sql/truncate.sql
+++ b/contrib/test_decoding/sql/truncate.sql
@@ -11,3 +11,4 @@ TRUNCATE tab1, tab1 RESTART IDENTITY CASCADE;
 TRUNCATE tab1, tab2;
 
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index dbef52a..d8e2b41 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -122,6 +122,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	ListCell   *option;
 	TestDecodingData *data;
+	bool enable_streaming = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -212,6 +213,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "stream-changes") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_streaming))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -221,6 +232,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 							elem->arg ? strVal(elem->arg) : "(null)")));
 		}
 	}
+
+	ctx->streaming &= enable_streaming;
 }
 
 /* cleanup this plugin's resources */
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 791a62b..1571d71 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d..fe7c978 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes', '1');
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5eef225..0016900 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1299,6 +1299,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments in xact.c where
+	 * these variables are declared.  Normally we have such a check at tableam
+	 * level API but this is called from many places so we need to ensure it
+	 * here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
@@ -1956,6 +1966,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..c771280 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the tuple
+		 * yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1659,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..9d9a70a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,10 +430,37 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments in xact.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments in xact.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments in xact.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 3afb63b..c638319 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -249,6 +249,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
 	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
+	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d4f7c29..99722ee 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -83,6 +83,19 @@ bool		XactDeferrable;
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
 /*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool		bsysscan = false;
+
+/*
  * When running as a parallel worker, we place only a single
  * TransactionStateData on the parallel worker's state stack, and the XID
  * reflected there will be that of the *innermost* currently-active
@@ -2680,6 +2693,9 @@ AbortTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* If in parallel mode, clean up workers and exit parallel mode. */
 	if (IsInParallelMode())
 	{
@@ -4982,6 +4998,9 @@ AbortSubTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* Exit from parallel mode, if necessary. */
 	if (IsInParallelMode())
 	{
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index f3a1c31..f21f61d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 05d24b9..42f284b 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1442,3 +1442,13 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Clear logical streaming state during (sub)transaction abort.
+ */
+void
+ResetLogicalStreamingState(void)
+{
+	CheckXidAlive = InvalidTransactionId;
+	bsysscan = false;
+}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index ce6e621..b77000a 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -178,6 +179,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -236,6 +252,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +261,16 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static inline bool ReorderBufferCanStartStreaming(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -367,6 +394,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -416,13 +446,15 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 }
 
 /*
- * Free an ReorderBufferChange.
+ * Free a ReorderBufferChange and update memory accounting, if requested.
  */
 void
-ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
+						  bool upd_mem)
 {
 	/* update memory accounting info */
-	ReorderBufferChangeMemoryUpdate(rb, change, false);
+	if (upd_mem)
+		ReorderBufferChangeMemoryUpdate(rb, change, false);
 
 	/* free contained data */
 	switch (change->action)
@@ -624,16 +656,101 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 }
 
 /*
- * Queue a change into a transaction so it can be replayed upon commit.
+ * Record the partial change for the streaming of in-progress transactions.  We
+ * can stream only complete changes so if we have a partial change like toast
+ * table insert or speculative insert then we mark such a 'txn' so that it
+ * can't be streamed.  We also ensure that if the changes in such a 'txn' are
+ * above logical_decoding_work_mem threshold then we stream them as soon as we
+ * have a complete change.
+ */
+static void
+ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								  ReorderBufferChange *change,
+								  bool toast_insert)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The partial changes need to be processed only while streaming
+	 * in-progress transactions.
+	 */
+	if (!ReorderBufferCanStream(rb))
+		return;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * Set the toast insert bit whenever we get toast insert to indicate a partial
+	 * change and clear it when we get the insert or update on main table (Both
+	 * update and insert will do the insert in the toast table).
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			 IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial change and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+	{
+		/*
+		 * Speculative confirm change must be preceded by speculative insertion.
+		 */
+		Assert(rbtxn_has_spec_insert(toptxn));
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+	}
+
+	/*
+	 * Stream the transaction if it is serialized before and the changes are
+	 * now complete in the top-level transaction.
+	 *
+	 * The reason for doing the streaming of such a transaction as soon as
+	 * we get the complete change for it is that previously it would have
+	 * reached the memory threshold and wouldn't get streamed because of
+	 * incomplete changes.  Delaying such transactions would increase apply
+	 * lag for them.
+	 */
+	if (ReorderBufferCanStartStreaming(rb) &&
+		!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		rbtxn_is_serialized(txn))
+		ReorderBufferStreamTXN(rb, toptxn);
+}
+
+/*
+ * Queue a change into a transaction so it can be replayed upon commit or will be
+ * streamed when we reach logical_decoding_work_mem threshold.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the previous changes we have detected that the transaction
+	 * is aborted.  So there is no point in collecting further changes for it.
+	 */
+	if (txn->concurrent_abort)
+	{
+		/*
+		 * We don't need to update memory accounting for this change as we have
+		 * not added it to the queue yet.
+		 */
+		ReorderBufferReturnChange(rb, change, false);
+		return;
+	}
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -645,6 +762,9 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* process partial change */
+	ReorderBufferProcessPartialChange(rb, txn, change, toast_insert);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -674,7 +794,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -764,6 +884,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the (sub)transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -1018,6 +1170,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1032,6 +1187,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1148,7 +1306,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 	{
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1234,7 +1392,7 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1280,7 +1438,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1297,7 +1455,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1310,6 +1468,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1335,6 +1502,91 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them. Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change, true);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
+	 * streamed always, even if it does not contain any changes (that is, when
+	 * all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the toplevel xact (we send the XID in all messages), but we never
+	 * stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
+	 * memory. We could also keep the hash table and update it with new ctid
+	 * values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
  */
@@ -1485,57 +1737,191 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid to detect concurrent aborts.
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  For example, suppose there is one catalog tuple with
+ * (xmin: 500, xmax: 0).  Now, the transaction 501 updates the catalog tuple
+ * and after that we will have two tuples (xmin: 500, xmax: 501) and
+ * (xmin: 501, xmax: 0).  Now, if 501 is aborted and some other transaction
+ * say 502 updates the same catalog tuple then the first tuple will be changed
+ * to (xmin: 500, xmax: 502).  So, the problem is that when we try to decode
+ * the tuple inserted/updated in 501 after the catalog update, we will see the
+ * catalog tuple with (xmin: 500, xmax: 502) as visible because it will
+ * consider that the tuple is deleted by xid 502 which is not visible to our
+ * snapshot.  And when we will try to decode with that catalog tuple, it can
+ * lead to a wrong result or a crash.  So, it is necessary to detect
+ * concurrent aborts to allow streaming of in-progress transactions.
+ *
+ * For detecting the concurrent abort we set CheckXidAlive to the current
+ * (sub)transaction's xid for which this change belongs to.  And, during
+ * catalog scan we can check the status of the xid and if it is aborted we will
+ * report a specific error so that we can stop streaming current transaction
+ * and discard the already streamed changes on such an error.  We might have
+ * already streamed some of the changes for the aborted (sub)transaction, but
+ * that is fine because when we decode the abort we will stream abort message
+ * to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet.  We don't check if the
+	 * xid is aborted.  That will happen during catalog access.
+	 */
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						  ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction. This resets the TXN such that it
+ * can be used to stream the remaining data of transaction being processed.
+ */
+static void
+ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+					  Snapshot snapshot_now,
+					  CommandId command_id,
+					  XLogRecPtr last_lsn,
+					  ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed. */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Free all resources allocated for toast reconstruction */
+	ReorderBufferToastReset(rb, txn);
+
+	/* Return the spec insert change if it is not NULL */
+	if (specinsert != NULL)
+	{
+		ReorderBufferReturnChange(rb, specinsert, true);
+		specinsert = NULL;
 	}
 
-	snapshot_now = txn->base_snapshot;
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run. */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. We iterate over the top and subtransactions (using a k-way
+ * merge) and replay the changes in lsn order.
+ *
+ * If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool stream_started = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1558,14 +1944,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming ? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* We only need to send begin/commit for non-streamed transactions. */
+		if (!streaming)
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1573,6 +1960,36 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * We can't call start stream callback before processing first
+			 * change.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					txn->origin_id = change->origin_id;
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the current xid to detect concurrent aborts. */
+			if (streaming)
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1649,7 +2066,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1685,11 +2103,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1714,7 +2132,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					/* clear out a pending (and thus failed) speculation */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
@@ -1747,7 +2165,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1756,10 +2177,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1790,7 +2208,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1837,7 +2254,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		 */
 		if (specinsert)
 		{
-			ReorderBufferReturnChange(rb, specinsert);
+			ReorderBufferReturnChange(rb, specinsert, true);
 			specinsert = NULL;
 		}
 
@@ -1845,14 +2262,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, send the last message for this set of
+		 * changes depending upon streaming mode.
+		 */
+		if (streaming)
+		{
+			if (stream_started)
+			{
+				rb->stream_stop(rb, txn, prev_lsn);
+				stream_started = false;
+			}
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot for the next set of changes in
+		 * streaming mode.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1870,14 +2308,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as
+		 * streamed (if they contained changes). Otherwise, remove all the
+		 * changes and deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1896,15 +2347,106 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
+		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * This error can only occur when we are sending the data in
+			 * streaming mode and the streaming is not finished yet.
+			 */
+			Assert(streaming);
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+			curtxn->concurrent_abort = true;
+
+			/* Reset the TXN so that it is allowed to stream remaining data. */
+			ReorderBufferResetTXN(rb, txn, snapshot_now,
+								  command_id, prev_lsn,
+								  specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+	}
+	PG_END_TRY();
+}
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot	snapshot_now;
+	CommandId	command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
 
-		PG_RE_THROW();
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * Called after everything (origin ID, LSN, ...) is stored in the
+	 * transaction to avoid passing that information directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
 	}
-	PG_END_TRY();
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
 }
 
 /*
@@ -1931,6 +2473,22 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_abort(rb, txn, lsn);
+
+		/*
+		 * We might have decoded changes for this transaction that could load
+		 * the cache as per the current transaction's view (consider DDL's
+		 * happened in this transaction). We don't want the decoding of future
+		 * transactions to use those cache entries so execute invalidations.
+		 */
+		if (txn->ninvalidations > 0)
+			ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+											   txn->invalidations);
+	}
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2000,6 +2558,10 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2082,7 +2644,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2131,12 +2693,21 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2144,6 +2715,8 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2155,19 +2728,41 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* If streaming supported, update the total size in top level as well. */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2388,6 +2983,51 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ *
+ * Note that, we skip transactions that contains incomplete changes. There
+ * is a scope of optimization here such that we can select the largest transaction
+ * which has complete changes.  But that will make the code and design quite complex
+ * and that might not be worth the benefit.  If we plan to stream the transactions
+ * that contains incomplete changes then we need to find a way to partially
+ * stream/truncate the transaction changes in-memory and build a mechanism to
+ * partially truncate the spilled files.  Additionally, whenever we partially
+ * stream the transaction we need to maintain the last streamed lsn and next time
+ * we need to restore from that segment and the offset in WAL.  As we stream the
+ * changes from the top transaction and restore them subtransaction wise, we need
+ * to even remember the subxact from where we streamed the last change.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	Size largest_size = 0;
+	ReorderBufferTXN *largest = NULL;
+
+	/* Find the largest top-level transaction. */
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		if ((largest != NULL || txn->total_size > largest_size) &&
+			(txn->total_size > 0) && !(rbtxn_has_incomplete_tuple(txn)))
+		{
+			largest = txn;
+			largest_size = txn->total_size;
+		}
+	}
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
  * disk until we reach under the memory limit.
@@ -2419,11 +3059,33 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if possible.  Otherwise, spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStartStreaming(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it
+			 * from memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2501,7 +3163,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 
 		spilled++;
 	}
@@ -2713,6 +3375,136 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+/* Returns true, if the output plugin supports streaming, false, otherwise. */
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/* Returns true, if the streaming can be started now, false, otherwise. */
+static inline bool
+ReorderBufferCanStartStreaming(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild *builder = ctx->snapshot_builder;
+
+	/*
+	 * We can't start streaming immediately even if the streaming is enabled
+	 * because we previously decoded this transaction and now just are
+	 * restarting.
+	 */
+	if (ReorderBufferCanStream(rb) &&
+		!SnapBuildXactNeedsSkip(builder, ctx->reader->EndRecPtr))
+	{
+		/* We must have a consistent snapshot by this time */
+		Assert(SnapBuildCurrentState(builder) == SNAPBUILD_CONSISTENT);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/* We can never reach here for a subtransaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * We can't make any assumptions about base snapshot here, similar to what
+	 * ReorderBufferCommit() does. That relies on base_snapshot getting
+	 * transferred from subxact in ReorderBufferCommitChild(), but that was
+	 * not yet called as the transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 *
+	 * Unlike DecodeCommit which adds xids of all the subtransactions in
+	 * snapshot's xip array via SnapBuildCommittedTxn, we can't do that here
+	 * but we do add them to subxip array instead via ReorderBufferCopySnap.
+	 * This allows the catalog changes made in subtransactions decoded till
+	 * now to be visible.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		/*
+		 * If this transaction has no snapshot, it didn't make any changes to
+		 * the database till now, so there's nothing to decode.
+		 */
+		if (txn->base_snapshot == NULL)
+		{
+			Assert(txn->ninvalidations == 0);
+			return;
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can't use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -2813,7 +3605,7 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		dlist_container(ReorderBufferChange, node, cleanup_iter.cur);
 
 		dlist_delete(&cleanup->node);
-		ReorderBufferReturnChange(rb, cleanup);
+		ReorderBufferReturnChange(rb, cleanup, true);
 	}
 	txn->nentries_mem = 0;
 	Assert(dlist_is_empty(&txn->changes));
@@ -3522,7 +4314,7 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			dlist_container(ReorderBufferChange, node, it.cur);
 
 			dlist_delete(&change->node);
-			ReorderBufferReturnChange(rb, change);
+			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
 
@@ -3812,6 +4604,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases, we assume the CID is from the future
+	 * command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 7ba72c8..387eb34 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -19,6 +19,7 @@
 
 #include "access/relscan.h"
 #include "access/sdir.h"
+#include "access/xact.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1017,6 +1027,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1056,6 +1073,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with
+	 * valid CheckXidAlive for catalog or regular tables.  See detailed
+	 * comments in xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1713,6 +1738,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1730,6 +1763,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1748,6 +1789,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1764,6 +1812,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5348011..c18554b 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,10 @@ typedef enum
 /* Synchronous commit level */
 extern int	synchronous_commit;
 
+/* used during logical streaming of a transaction */
+extern TransactionId CheckXidAlive;
+extern bool bsysscan;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index deef318..b0fae98 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -121,5 +121,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+extern void ResetLogicalStreamingState(void);
 
 #endif
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 42bc817..1ae17d5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -162,6 +162,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -181,6 +184,40 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -249,6 +286,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
@@ -313,6 +357,12 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -484,12 +534,14 @@ void		ReorderBufferFree(ReorderBuffer *);
 ReorderBufferTupleBuf *ReorderBufferGetTupleBuf(ReorderBuffer *, Size tuple_len);
 void		ReorderBufferReturnTupleBuf(ReorderBuffer *, ReorderBufferTupleBuf *tuple);
 ReorderBufferChange *ReorderBufferGetChange(ReorderBuffer *);
-void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
+void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *, bool);
 
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool toast_insert);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v46/v46-0002-Extend-the-BufFile-interface-for-the-streaming-o.patch0000664000175000017500000003753413712531017025656 0ustar  dilipkumardilipkumarFrom 90c9890819a897f57995163115f66073c1bd3df1 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v46 2/6] Extend the BufFile interface for the streaming of
 in-progress transactions.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times.

Implement the interface for BufFileTruncate interface to allow files to be
truncated up to a particular offset.  Extend BufFileSeek API to support
SEEK_END case.  Add an option to provide a mode while opening the shared
BufFiles instead of always opening in read-only mode.
---
 src/backend/postmaster/pgstat.c           |  3 +
 src/backend/storage/file/buffile.c        | 86 +++++++++++++++++++++++----
 src/backend/storage/file/fd.c             |  9 ++-
 src/backend/storage/file/sharedfileset.c  | 98 ++++++++++++++++++++++++++++---
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  4 +-
 10 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 15f92b6..3804412 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 2d7a082..a9ca5d9 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile supports temporary files that can be used by the single backend when
+ * the corresponding files need to be survived across the transaction and need
+ * to be opened and closed multiple times.  Such files need to be created as a
+ * member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -364,6 +368,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -666,11 +673,22 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
+			break;
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +856,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset = file->curOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420e..f376a97 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594..9a3dc10 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can also be used by backends when the temporary files need
+ * to be opened/closed multiple times and the underlying files need to survive
+ * across transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,19 +29,29 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -223,6 +255,58 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 }
 
 /*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
+/*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
  */
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59..788815c 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a43..b83fb50 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201..807a9c1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752ba..fc34c49 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d..e209f04 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf07..d5edb60 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
1.8.3.1

v46/v46-0003-Add-support-for-streaming-to-built-in-replicatio.patch0000664000175000017500000027354513712531017026070 0ustar  dilipkumardilipkumarFrom 81a5018b36f562726d7d74a754866cef7ba242ed Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v46 3/6] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  49 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   4 +
 src/backend/replication/logical/proto.c            | 140 ++-
 src/backend/replication/logical/worker.c           | 946 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 348 +++++++-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  46 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/009_stream_simple.pl       |  86 ++
 src/test/subscription/t/010_stream_subxact.pl      | 102 +++
 src/test/subscription/t/011_stream_ddl.pl          |  95 +++
 .../subscription/t/012_stream_subxact_abort.pl     |  82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |  84 ++
 src/test/subscription/t/015_stream_binary.pl       |  86 ++
 19 files changed, 2060 insertions(+), 47 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/015_stream_binary.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70..a81bd54 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal> and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c5..b7d7457 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf..311d462 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377..4c58ad8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,8 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +197,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +350,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +375,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -439,6 +455,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -698,6 +721,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +732,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +789,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +834,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +877,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3804412..bb0f95a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e905723..a6101ac 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097..ff25924 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 2fcf2e6..98e7fd0 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,62 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +291,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,17 +752,323 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or
+	 * inside streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		!in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -635,6 +1081,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1099,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1138,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1256,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1408,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1781,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1922,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2050,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2162,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1947,6 +2435,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1979,6 +2481,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2145,6 +3080,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc..3360bd5 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,29 +47,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,11 +123,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -115,16 +149,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +226,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +255,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +279,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +300,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +331,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +394,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +444,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +486,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,6 +507,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -406,7 +539,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +559,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +584,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +605,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +631,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +663,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -606,6 +744,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -642,6 +887,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -771,12 +1048,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -811,7 +1121,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35..1d09154 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc..655144d 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,49 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbe..6c0a4e3 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/015_stream_binary.pl b/src/test/subscription/t/015_stream_binary.pl
new file mode 100644
index 0000000..fa2362e
--- /dev/null
+++ b/src/test/subscription/t/015_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v46/v46-0004-Enable-streaming-for-all-subscription-TAP-tests.patch0000664000175000017500000003534213712531017025624 0ustar  dilipkumardilipkumarFrom a35f1c3dd2ae773c8bedbf5554afa965a8a910a1 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v46 4/6] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318f..6f7bedc 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 9f140b5..21410fa 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334e..cdf9b8e 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v46/v46-0005-Add-TAP-test-for-streaming-vs.-DDL.patch0000664000175000017500000001062113712531017022552 0ustar  dilipkumardilipkumarFrom fc20f7ff1c0b9d1e9c6701fadfb4e61d65c5cb1f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v46 5/6] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v46/v46-0006-Add-streaming-option-in-pg_dump.patch0000664000175000017500000000524713712531017022636 0ustar  dilipkumardilipkumarFrom aa330ce99a9d34e0b79026ce9fbce7a4d797ae9a Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 11:09:18 +0530
Subject: [PATCH v46 6/6] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 17 +++++++++++++++--
 src/bin/pg_dump/pg_dump.h |  1 +
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 94459b3..f69d64c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4240,10 +4241,17 @@ getSubscriptions(Archive *fout)
 
 	if (fout->remoteVersion >= 140000)
 		appendPQExpBuffer(query,
-						  " s.subbinary\n");
+						  " s.subbinary,\n");
 	else
 		appendPQExpBuffer(query,
-						  " false AS subbinary\n");
+						  " false AS subbinary,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBuffer(query,
+						  " s.substream\n");
+	else
+		appendPQExpBuffer(query,
+						  " false AS substream\n");
 
 	appendPQExpBuffer(query,
 					  "FROM pg_subscription s\n"
@@ -4263,6 +4271,7 @@ getSubscriptions(Archive *fout)
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
+	i_substream = PQfnumber(res, "substream");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4286,6 +4295,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subpublications));
 		subinfo[i].subbinary =
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
+		subinfo[i].substream =
+			pg_strdup(PQgetvalue(res, i, i_substream));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4357,6 +4368,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subbinary, "t") == 0)
 		appendPQExpBuffer(query, ", binary = true");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index da97b73..cc10c7c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -626,6 +626,7 @@ typedef struct _SubscriptionInfo
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subbinary;
+	char       *substream;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
-- 
1.8.3.1

#466Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#465)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Aug 5, 2020 at 7:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Aug 5, 2020 at 6:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Can we add a test for incomplete changes (probably with toast
insertion but we can do it for spec_insert case as well) in
ReorderBuffer such that it needs to first serialize the changes and
then stream it? I have manually verified such scenarios but it is
good to have the test for the same.

I have added a new test for the same in the stream.sql file.

Thanks, I have slightly changed the test so that we can consume DDL
changes separately. I have made a number of other adjustments like
changing few more comments (to make them consistent with nearby
comments), removed unnecessary inclusion of header file, ran pgindent.
The next patch (v47-0001-Implement-streaming-mode-in-ReorderBuffer) in
this series looks good to me. I am planning to push it after one more
read-through unless you or anyone else has any comments on the same.
The patch I am talking about has the following functionality:

Implement streaming mode in ReorderBuffer. Instead of serializing the
transaction to disk after reaching the logical_decoding_work_mem limit
in memory, we consume the changes we have in memory and invoke stream
API methods added by commit 45fdc9738b. However, sometimes if we have
incomplete toast or speculative insert we spill to the disk because we
can't stream till we have the complete tuple. And, as soon as we get
the complete tuple we stream the transaction including the serialized
changes. Now that we can stream in-progress transactions, the
concurrent aborts may cause failures when the output plugin consults
catalogs (both system and user-defined). We handle such failures by
returning ERRCODE_TRANSACTION_ROLLBACK sqlerrcode from system table
scan APIs to the backend or WALSender decoding a specific uncommitted
transaction. The decoding logic on the receipt of such a sqlerrcode
aborts the decoding of the current transaction and continues with the
decoding of other transactions. We also provide a new option via SQL
APIs to fetch the changes being streamed.

This patch's functionality can be independently verified by SQL APIs

--
With Regards,
Amit Kapila.

Attachments:

v47.tarapplication/x-tar; name=v47.tarDownload
v47-0001-Implement-streaming-mode-in-ReorderBuffer.patch000664 000765 000024 00000233643 13712740650 024375 0ustar00amitkapilastaff000000 000000 From 64dd3a33532c328a4033aa03123365c8ea54d56c Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 15 Jul 2020 18:38:32 +0530
Subject: [PATCH v47 1/6] Implement streaming mode in ReorderBuffer.

Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke stream API methods added by commit 45fdc9738b.
However, sometimes if we have incomplete toast or speculative insert we
spill to the disk because we can not generate the complete tuple and
stream.  And, as soon as we get the complete tuple we stream the
transaction including the serialized changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages at each command end. These
features are added by commits 0bead9af48 and c55040ccd0 respectively.

Now that we can stream in-progress transactions, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such an sqlerrcode aborts the decoding of current transaction
and continue with decoding of other transactions.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

We also provide a new option via SQL APIs to fetch the changes being
streamed.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila, Nikhil Sontakke
Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 contrib/test_decoding/Makefile                  |   2 +-
 contrib/test_decoding/expected/stream.out       |  94 +++
 contrib/test_decoding/expected/truncate.out     |   6 +
 contrib/test_decoding/sql/stream.sql            |  30 +
 contrib/test_decoding/sql/truncate.sql          |   1 +
 contrib/test_decoding/test_decoding.c           |  13 +
 doc/src/sgml/logicaldecoding.sgml               |   9 +-
 doc/src/sgml/test-decoding.sgml                 |  22 +
 src/backend/access/heap/heapam.c                |  13 +
 src/backend/access/heap/heapam_visibility.c     |  42 +-
 src/backend/access/index/genam.c                |  53 ++
 src/backend/access/table/tableam.c              |   8 +
 src/backend/access/transam/xact.c               |  19 +
 src/backend/replication/logical/decode.c        |  17 +-
 src/backend/replication/logical/logical.c       |  10 +
 src/backend/replication/logical/reorderbuffer.c | 981 +++++++++++++++++++++---
 src/include/access/heapam_xlog.h                |   1 +
 src/include/access/tableam.h                    |  55 ++
 src/include/access/xact.h                       |   4 +
 src/include/replication/logical.h               |   1 +
 src/include/replication/reorderbuffer.h         |  56 +-
 21 files changed, 1331 insertions(+), 106 deletions(-)
 create mode 100644 contrib/test_decoding/expected/stream.out
 create mode 100644 contrib/test_decoding/sql/stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c58..ed9a3d6 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,7 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate
+	spill slot truncate stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
new file mode 100644
index 0000000..9a5d7e7
--- /dev/null
+++ b/contrib/test_decoding/expected/stream.out
@@ -0,0 +1,94 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+COMMIT;
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ opening a streamed block for transaction
+ streaming message: transactional: 1 prefix: test, sz: 50
+ closing a streamed block for transaction
+ aborting streamed (sub)transaction
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ committing streamed transaction
+(27 rows)
+
+-- streaming test for toast changes
+ALTER TABLE stream_test ALTER COLUMN data set storage external;
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+INSERT INTO stream_test SELECT repeat('a', 6000) || g.i FROM generate_series(1, 10) g(i);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+                   data                   
+------------------------------------------
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ committing streamed transaction
+(13 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/truncate.out b/contrib/test_decoding/expected/truncate.out
index 1cf2ae8..e64d377 100644
--- a/contrib/test_decoding/expected/truncate.out
+++ b/contrib/test_decoding/expected/truncate.out
@@ -25,3 +25,9 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
  COMMIT
 (9 rows)
 
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
new file mode 100644
index 0000000..8abc30d
--- /dev/null
+++ b/contrib/test_decoding/sql/stream.sql
@@ -0,0 +1,30 @@
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+COMMIT;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+
+-- streaming test for toast changes
+ALTER TABLE stream_test ALTER COLUMN data set storage external;
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+INSERT INTO stream_test SELECT repeat('a', 6000) || g.i FROM generate_series(1, 10) g(i);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'stream-changes', '1');
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/truncate.sql b/contrib/test_decoding/sql/truncate.sql
index 5aecdf0..5633854 100644
--- a/contrib/test_decoding/sql/truncate.sql
+++ b/contrib/test_decoding/sql/truncate.sql
@@ -11,3 +11,4 @@ TRUNCATE tab1, tab1 RESTART IDENTITY CASCADE;
 TRUNCATE tab1, tab2;
 
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index dbef52a..3474515 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -122,6 +122,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	ListCell   *option;
 	TestDecodingData *data;
+	bool		enable_streaming = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -212,6 +213,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "stream-changes") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_streaming))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -221,6 +232,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 							elem->arg ? strVal(elem->arg) : "(null)")));
 		}
 	}
+
+	ctx->streaming &= enable_streaming;
 }
 
 /* cleanup this plugin's resources */
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 791a62b..1571d71 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -433,9 +433,12 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal>
+     scan APIs only. Access via the <literal>heap_*</literal> scan APIs will
+     error out. Additionally, any actions leading to transaction ID assignment
+     are prohibited. That, among others, includes writing to tables, performing
+     DDL changes, and calling <literal>pg_current_xact_id()</literal>.
     </para>
    </sect2>
 
diff --git a/doc/src/sgml/test-decoding.sgml b/doc/src/sgml/test-decoding.sgml
index 8356a3d..fe7c978 100644
--- a/doc/src/sgml/test-decoding.sgml
+++ b/doc/src/sgml/test-decoding.sgml
@@ -39,4 +39,26 @@ postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'i
 </programlisting>
  </para>
 
+<para>
+  We can also get the changes of the in-progress transaction and the typical
+  output, might be:
+
+<programlisting>
+postgres[33712]=#* SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes', '1');
+    lsn    | xid |                       data                       
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+
 </sect1>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5eef225..0016900 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1299,6 +1299,16 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg_internal("only heap AM is supported")));
 
+	/*
+	 * We don't expect direct calls to heap_getnext with valid CheckXidAlive
+	 * for catalog or regular tables.  See detailed comments in xact.c where
+	 * these variables are declared.  Normally we have such a check at tableam
+	 * level API but this is called from many places so we need to ensure it
+	 * here.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected heap_getnext call during logical decoding");
+
 	/* Note: no locking manipulations needed */
 
 	if (scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
@@ -1956,6 +1966,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		{
 			xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
 			bufflags |= REGBUF_KEEP_DATA;
+
+			if (IsToastRelation(relation))
+				xlrec.flags |= XLH_INSERT_ON_TOAST_RELATION;
 		}
 
 		XLogBeginInsert();
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba1089..c771280 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1571,8 +1571,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmin is
+		 * definitely in the future, and we're not supposed to see the tuple
+		 * yet.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
 		if (!resolved)
-			elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+			return false;
 
 		Assert(cmin != InvalidCommandId);
 
@@ -1642,10 +1659,25 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 												 htup, buffer,
 												 &cmin, &cmax);
 
-		if (!resolved)
-			elog(ERROR, "could not resolve combocid to cmax");
-
-		Assert(cmax != InvalidCommandId);
+		/*
+		 * If we haven't resolved the combocid to cmin/cmax, that means we
+		 * have not decoded the combocid yet. That means the cmax is
+		 * definitely in the future, and we're still supposed to see the
+		 * tuple.
+		 *
+		 * XXX This only applies to decoding of in-progress transactions. In
+		 * regular logical decoding we only execute this code at commit time,
+		 * at which point we should have seen all relevant combocids. So
+		 * ideally, we should error out in this case but in practice, this
+		 * won't happen. If we are too worried about this then we can add an
+		 * elog inside ResolveCminCmaxDuringDecoding.
+		 *
+		 * XXX For the streaming case, we can track the largest combocid
+		 * assigned, and error out based on this (when unable to resolve
+		 * combocid below that observed maximum value).
+		 */
+		if (!resolved || cmax == InvalidCommandId)
+			return true;
 
 		if (cmax >= snapshot->curcid)
 			return true;		/* deleted after scan started */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index dfba5ae..06e07d2 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -28,6 +28,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -429,10 +430,37 @@ systable_beginscan(Relation heapRelation,
 		sysscan->iscan = NULL;
 	}
 
+	/*
+	 * If CheckXidAlive is set then set a flag to indicate that system table
+	 * scan is in-progress.  See detailed comments in xact.c where these
+	 * variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = true;
+
 	return sysscan;
 }
 
 /*
+ * HandleConcurrentAbort - Handle concurrent abort of the CheckXidAlive.
+ *
+ * Error out, if CheckXidAlive is aborted. We can't directly use
+ * TransactionIdDidAbort as after crash such transaction might not have been
+ * marked as aborted.  See detailed comments in xact.c where the variable
+ * is declared.
+ */
+static inline void
+HandleConcurrentAbort()
+{
+	if (TransactionIdIsValid(CheckXidAlive) &&
+		!TransactionIdIsInProgress(CheckXidAlive) &&
+		!TransactionIdDidCommit(CheckXidAlive))
+		ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+}
+
+/*
  * systable_getnext --- get next tuple in a heap-or-index scan
  *
  * Returns NULL if no more tuples available.
@@ -481,6 +509,12 @@ systable_getnext(SysScanDesc sysscan)
 		}
 	}
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
@@ -517,6 +551,12 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 											sysscan->slot,
 											freshsnap);
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return result;
 }
 
@@ -545,6 +585,13 @@ systable_endscan(SysScanDesc sysscan)
 	if (sysscan->snapshot)
 		UnregisterSnapshot(sysscan->snapshot);
 
+	/*
+	 * Reset the sysbegin_called flag at the end of the systable scan.  See
+	 * detailed comments in xact.c where these variables are declared.
+	 */
+	if (TransactionIdIsValid(CheckXidAlive))
+		bsysscan = false;
+
 	pfree(sysscan);
 }
 
@@ -643,6 +690,12 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * Handle the concurrent abort while fetching the catalog tuple during
+	 * logical streaming of a transaction.
+	 */
+	HandleConcurrentAbort();
+
 	return htup;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 3afb63b..c638319 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -249,6 +249,14 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
 	const TableAmRoutine *tableam = rel->rd_tableam;
 
 	/*
+	 * We don't expect direct calls to table_tuple_get_latest_tid with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_get_latest_tid call during logical decoding");
+
+	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d4f7c29..727d616 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -83,6 +83,19 @@ bool		XactDeferrable;
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
 /*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction.  To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool		bsysscan = false;
+
+/*
  * When running as a parallel worker, we place only a single
  * TransactionStateData on the parallel worker's state stack, and the XID
  * reflected there will be that of the *innermost* currently-active
@@ -2680,6 +2693,9 @@ AbortTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* If in parallel mode, clean up workers and exit parallel mode. */
 	if (IsInParallelMode())
 	{
@@ -4982,6 +4998,9 @@ AbortSubTransaction(void)
 	/* Forget about any active REINDEX. */
 	ResetReindexState(s->nestingLevel);
 
+	/* Reset logical streaming state. */
+	ResetLogicalStreamingState();
+
 	/* Exit from parallel mode, if necessary. */
 	if (IsInParallelMode())
 	{
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index f3a1c31..f21f61d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -724,7 +724,9 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change,
+							 xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);
 }
 
 /*
@@ -791,7 +793,8 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -848,7 +851,8 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 /*
@@ -884,7 +888,7 @@ DecodeTruncate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	memcpy(change->data.truncate.relids, xlrec->relids,
 		   xlrec->nrelids * sizeof(Oid));
 	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-							 buf->origptr, change);
+							 buf->origptr, change, false);
 }
 
 /*
@@ -984,7 +988,7 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			change->data.tp.clear_toast_afterwards = false;
 
 		ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
-								 buf->origptr, change);
+								 buf->origptr, change, false);
 
 		/* move to the next xl_multi_insert_tuple entry */
 		data += datalen;
@@ -1022,7 +1026,8 @@ DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	change->data.tp.clear_toast_afterwards = true;
 
-	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
+	ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
+							 change, false);
 }
 
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 05d24b9..42f284b 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1442,3 +1442,13 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
 }
+
+/*
+ * Clear logical streaming state during (sub)transaction abort.
+ */
+void
+ResetLogicalStreamingState(void)
+{
+	CheckXidAlive = InvalidTransactionId;
+	bsysscan = false;
+}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index ce6e621..5b7afe6 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -178,6 +178,21 @@ typedef struct ReorderBufferDiskChange
 	/* data follows */
 } ReorderBufferDiskChange;
 
+#define IsSpecInsert(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
+)
+#define IsSpecConfirm(action) \
+( \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) \
+)
+#define IsInsertOrUpdate(action) \
+( \
+	(((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+	((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+	((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
+)
+
 /*
  * Maximum number of changes kept in memory, per transaction. After that,
  * changes are spooled to disk.
@@ -236,6 +251,7 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -244,6 +260,16 @@ static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 									  ReorderBufferTXN *txn, CommandId cid);
 
+/*
+ * ---------------------------------------
+ * Streaming support functions
+ * ---------------------------------------
+ */
+static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
+static inline bool ReorderBufferCanStartStreaming(ReorderBuffer *rb);
+static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
+
 /* ---------------------------------------
  * toast reassembly support
  * ---------------------------------------
@@ -367,6 +393,9 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 	dlist_init(&txn->tuplecids);
 	dlist_init(&txn->subtxns);
 
+	/* InvalidCommandId is not zero, so set it explicitly */
+	txn->command_id = InvalidCommandId;
+
 	return txn;
 }
 
@@ -416,13 +445,15 @@ ReorderBufferGetChange(ReorderBuffer *rb)
 }
 
 /*
- * Free an ReorderBufferChange.
+ * Free a ReorderBufferChange and update memory accounting, if requested.
  */
 void
-ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
+ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
+						  bool upd_mem)
 {
 	/* update memory accounting info */
-	ReorderBufferChangeMemoryUpdate(rb, change, false);
+	if (upd_mem)
+		ReorderBufferChangeMemoryUpdate(rb, change, false);
 
 	/* free contained data */
 	switch (change->action)
@@ -624,16 +655,102 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 }
 
 /*
- * Queue a change into a transaction so it can be replayed upon commit.
+ * Record the partial change for the streaming of in-progress transactions.  We
+ * can stream only complete changes so if we have a partial change like toast
+ * table insert or speculative insert then we mark such a 'txn' so that it
+ * can't be streamed.  We also ensure that if the changes in such a 'txn' are
+ * above logical_decoding_work_mem threshold then we stream them as soon as we
+ * have a complete change.
+ */
+static void
+ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+								  ReorderBufferChange *change,
+								  bool toast_insert)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The partial changes need to be processed only while streaming
+	 * in-progress transactions.
+	 */
+	if (!ReorderBufferCanStream(rb))
+		return;
+
+	/* Get the top transaction. */
+	if (txn->toptxn != NULL)
+		toptxn = txn->toptxn;
+	else
+		toptxn = txn;
+
+	/*
+	 * Set the toast insert bit whenever we get toast insert to indicate a
+	 * partial change and clear it when we get the insert or update on main
+	 * table (Both update and insert will do the insert in the toast table).
+	 */
+	if (toast_insert)
+		toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+	else if (rbtxn_has_toast_insert(toptxn) &&
+			 IsInsertOrUpdate(change->action))
+		toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+
+	/*
+	 * Set the spec insert bit whenever we get the speculative insert to
+	 * indicate the partial change and clear the same on speculative confirm.
+	 */
+	if (IsSpecInsert(change->action))
+		toptxn->txn_flags |= RBTXN_HAS_SPEC_INSERT;
+	else if (IsSpecConfirm(change->action))
+	{
+		/*
+		 * Speculative confirm change must be preceded by speculative
+		 * insertion.
+		 */
+		Assert(rbtxn_has_spec_insert(toptxn));
+		toptxn->txn_flags &= ~RBTXN_HAS_SPEC_INSERT;
+	}
+
+	/*
+	 * Stream the transaction if it is serialized before and the changes are
+	 * now complete in the top-level transaction.
+	 *
+	 * The reason for doing the streaming of such a transaction as soon as we
+	 * get the complete change for it is that previously it would have reached
+	 * the memory threshold and wouldn't get streamed because of incomplete
+	 * changes.  Delaying such transactions would increase apply lag for them.
+	 */
+	if (ReorderBufferCanStartStreaming(rb) &&
+		!(rbtxn_has_incomplete_tuple(toptxn)) &&
+		rbtxn_is_serialized(txn))
+		ReorderBufferStreamTXN(rb, toptxn);
+}
+
+/*
+ * Queue a change into a transaction so it can be replayed upon commit or will be
+ * streamed when we reach logical_decoding_work_mem threshold.
  */
 void
 ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
-						 ReorderBufferChange *change)
+						 ReorderBufferChange *change, bool toast_insert)
 {
 	ReorderBufferTXN *txn;
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
+	/*
+	 * While streaming the previous changes we have detected that the
+	 * transaction is aborted.  So there is no point in collecting further
+	 * changes for it.
+	 */
+	if (txn->concurrent_abort)
+	{
+		/*
+		 * We don't need to update memory accounting for this change as we
+		 * have not added it to the queue yet.
+		 */
+		ReorderBufferReturnChange(rb, change, false);
+		return;
+	}
+
 	change->lsn = lsn;
 	change->txn = txn;
 
@@ -645,6 +762,9 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	/* update memory accounting information */
 	ReorderBufferChangeMemoryUpdate(rb, change, true);
 
+	/* process partial change */
+	ReorderBufferProcessPartialChange(rb, txn, change, toast_insert);
+
 	/* check the memory limits and evict something if needed */
 	ReorderBufferCheckMemoryLimit(rb);
 }
@@ -674,7 +794,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 		change->data.msg.message = palloc(message_size);
 		memcpy(change->data.msg.message, message, message_size);
 
-		ReorderBufferQueueChange(rb, xid, lsn, change);
+		ReorderBufferQueueChange(rb, xid, lsn, change, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -764,6 +884,38 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 }
 
 /*
+ * AssertChangeLsnOrder
+ *
+ * Check ordering of changes in the (sub)transaction.
+ */
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+	dlist_iter	iter;
+	XLogRecPtr	prev_lsn = txn->first_lsn;
+
+	dlist_foreach(iter, &txn->changes)
+	{
+		ReorderBufferChange *cur_change;
+
+		cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		Assert(txn->first_lsn != InvalidXLogRecPtr);
+		Assert(cur_change->lsn != InvalidXLogRecPtr);
+		Assert(txn->first_lsn <= cur_change->lsn);
+
+		if (txn->end_lsn != InvalidXLogRecPtr)
+			Assert(cur_change->lsn <= txn->end_lsn);
+
+		Assert(prev_lsn <= cur_change->lsn);
+
+		prev_lsn = cur_change->lsn;
+	}
+#endif
+}
+
+/*
  * ReorderBufferGetOldestTXN
  *		Return oldest transaction in reorderbuffer
  */
@@ -1018,6 +1170,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	*iter_state = NULL;
 
+	/* Check ordering of changes in the toplevel transaction. */
+	AssertChangeLsnOrder(txn);
+
 	/*
 	 * Calculate the size of our heap: one element for every transaction that
 	 * contains changes.  (Besides the transactions already in the reorder
@@ -1032,6 +1187,9 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
 
+		/* Check ordering of changes in this subtransaction. */
+		AssertChangeLsnOrder(cur_txn);
+
 		if (cur_txn->nentries > 0)
 			nr_txns++;
 	}
@@ -1148,7 +1306,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 	{
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1234,7 +1392,7 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 
 		change = dlist_container(ReorderBufferChange, node,
 								 dlist_pop_head_node(&state->old_change));
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 		Assert(dlist_is_empty(&state->old_change));
 	}
 
@@ -1280,7 +1438,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1297,7 +1455,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
 
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 	}
 
 	/*
@@ -1310,6 +1468,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
 	 * Remove TXN from its containing list.
 	 *
 	 * Note: if txn is known as subxact, we are deleting the TXN from its
@@ -1335,6 +1502,91 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
+ * Discard changes from a transaction (and subtransactions), after streaming
+ * them.  Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
+ */
+static void
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	dlist_mutable_iter iter;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(iter, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn;
+
+		subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		/*
+		 * Subtransactions are always associated to the toplevel TXN, even if
+		 * they originally were happening inside another subtxn, so we won't
+		 * ever recurse more than one level deep here.
+		 */
+		Assert(rbtxn_is_known_subxact(subtxn));
+		Assert(subtxn->nsubtxns == 0);
+
+		ReorderBufferTruncateTXN(rb, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(iter, &txn->changes)
+	{
+		ReorderBufferChange *change;
+
+		change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+		/* Check we're not mixing changes from different transactions. */
+		Assert(change->txn == txn);
+
+		/* remove the change from it's containing list */
+		dlist_delete(&change->node);
+
+		ReorderBufferReturnChange(rb, change, true);
+	}
+
+	/*
+	 * Mark the transaction as streamed.
+	 *
+	 * The toplevel transaction, identified by (toptxn==NULL), is marked as
+	 * streamed always, even if it does not contain any changes (that is, when
+	 * all the changes are in subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the toplevel xact (we send the XID in all messages), but we never
+	 * stream XIDs of empty subxacts.
+	 */
+	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+
+	/*
+	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
+	 * memory. We could also keep the hash table and update it with new ctid
+	 * values, but this seems simpler and good enough for now.
+	 */
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	/* If this txn is serialized then clean the disk space. */
+	if (rbtxn_is_serialized(txn))
+	{
+		ReorderBufferRestoreCleanup(rb, txn);
+		txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
+	}
+
+	/* also reset the number of entries in the transaction */
+	txn->nentries_mem = 0;
+	txn->nentries = 0;
+}
+
+/*
  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
  */
@@ -1485,57 +1737,191 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * Perform the replay of a transaction and its non-aborted subtransactions.
- *
- * Subtransactions previously have to be processed by
- * ReorderBufferCommitChild(), even if previously assigned to the toplevel
- * transaction with ReorderBufferAssignChild.
- *
- * We currently can only decode a transaction's contents when its commit
- * record is read because that's the only place where we know about cache
- * invalidations. Thus, once a toplevel commit is read, we iterate over the top
- * and subtransactions (using a k-way merge) and replay the changes in lsn
- * order.
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way.  That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	ReorderBufferTXN *txn;
-	volatile Snapshot snapshot_now;
-	volatile CommandId command_id = FirstCommandId;
-	bool		using_subtxn;
-	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	/* we should only call this for previously streamed transactions */
+	Assert(rbtxn_is_streamed(txn));
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
+	ReorderBufferStreamTXN(rb, txn);
 
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
+	rb->stream_commit(rb, txn, txn->final_lsn);
 
-	txn->final_lsn = commit_lsn;
-	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
-	txn->origin_id = origin_id;
-	txn->origin_lsn = origin_lsn;
+	ReorderBufferCleanupTXN(rb, txn);
+}
 
+/*
+ * Set xid to detect concurrent aborts.
+ *
+ * While streaming an in-progress transaction there is a possibility that the
+ * (sub)transaction might get aborted concurrently.  In such case if the
+ * (sub)transaction has catalog update then we might decode the tuple using
+ * wrong catalog version.  For example, suppose there is one catalog tuple with
+ * (xmin: 500, xmax: 0).  Now, the transaction 501 updates the catalog tuple
+ * and after that we will have two tuples (xmin: 500, xmax: 501) and
+ * (xmin: 501, xmax: 0).  Now, if 501 is aborted and some other transaction
+ * say 502 updates the same catalog tuple then the first tuple will be changed
+ * to (xmin: 500, xmax: 502).  So, the problem is that when we try to decode
+ * the tuple inserted/updated in 501 after the catalog update, we will see the
+ * catalog tuple with (xmin: 500, xmax: 502) as visible because it will
+ * consider that the tuple is deleted by xid 502 which is not visible to our
+ * snapshot.  And when we will try to decode with that catalog tuple, it can
+ * lead to a wrong result or a crash.  So, it is necessary to detect
+ * concurrent aborts to allow streaming of in-progress transactions.
+ *
+ * For detecting the concurrent abort we set CheckXidAlive to the current
+ * (sub)transaction's xid for which this change belongs to.  And, during
+ * catalog scan we can check the status of the xid and if it is aborted we will
+ * report a specific error so that we can stop streaming current transaction
+ * and discard the already streamed changes on such an error.  We might have
+ * already streamed some of the changes for the aborted (sub)transaction, but
+ * that is fine because when we decode the abort we will stream abort message
+ * to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
 	/*
-	 * If this transaction has no snapshot, it didn't make any changes to the
-	 * database, so there's nothing to decode.  Note that
-	 * ReorderBufferCommitChild will have transferred any snapshots from
-	 * subtransactions if there were any.
+	 * If the input transaction id is already set as a CheckXidAlive then
+	 * nothing to do.
 	 */
-	if (txn->base_snapshot == NULL)
-	{
-		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+	if (TransactionIdEquals(CheckXidAlive, xid))
 		return;
+
+	/*
+	 * setup CheckXidAlive if it's not committed yet.  We don't check if the
+	 * xid is aborted.  That will happen during catalog access.
+	 */
+	if (!TransactionIdDidCommit(xid))
+		CheckXidAlive = xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying change.
+ */
+static inline void
+ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						 Relation relation, ReorderBufferChange *change,
+						 bool streaming)
+{
+	if (streaming)
+		rb->stream_change(rb, txn, relation, change);
+	else
+		rb->apply_change(rb, txn, relation, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the truncate.
+ */
+static inline void
+ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						   int nrelations, Relation *relations,
+						   ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_truncate(rb, txn, nrelations, relations, change);
+	else
+		rb->apply_truncate(rb, txn, nrelations, relations, change);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN for applying the message.
+ */
+static inline void
+ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						  ReorderBufferChange *change, bool streaming)
+{
+	if (streaming)
+		rb->stream_message(rb, txn, change->lsn, true,
+						   change->data.msg.prefix,
+						   change->data.msg.message_size,
+						   change->data.msg.message);
+	else
+		rb->message(rb, txn, change->lsn, true,
+					change->data.msg.prefix,
+					change->data.msg.message_size,
+					change->data.msg.message);
+}
+
+/*
+ * Function to store the command id and snapshot at the end of the current
+ * stream so that we can reuse the same while sending the next stream.
+ */
+static inline void
+ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 Snapshot snapshot_now, CommandId command_id)
+{
+	txn->command_id = command_id;
+
+	/* Avoid copying if it's already copied. */
+	if (snapshot_now->copied)
+		txn->snapshot_now = snapshot_now;
+	else
+		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+												  txn, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferProcessTXN to handle the concurrent
+ * abort of the streaming transaction.  This resets the TXN such that it
+ * can be used to stream the remaining data of transaction being processed.
+ */
+static void
+ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+					  Snapshot snapshot_now,
+					  CommandId command_id,
+					  XLogRecPtr last_lsn,
+					  ReorderBufferChange *specinsert)
+{
+	/* Discard the changes that we just streamed */
+	ReorderBufferTruncateTXN(rb, txn);
+
+	/* Free all resources allocated for toast reconstruction */
+	ReorderBufferToastReset(rb, txn);
+
+	/* Return the spec insert change if it is not NULL */
+	if (specinsert != NULL)
+	{
+		ReorderBufferReturnChange(rb, specinsert, true);
+		specinsert = NULL;
 	}
 
-	snapshot_now = txn->base_snapshot;
+	/* Stop the stream. */
+	rb->stream_stop(rb, txn, last_lsn);
+
+	/* Remember the command ID and snapshot for the streaming run */
+	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+}
+
+/*
+ * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ *
+ * Send data of a transaction (and its subtransactions) to the
+ * output plugin. We iterate over the top and subtransactions (using a k-way
+ * merge) and replay the changes in lsn order.
+ *
+ * If streaming is true then data will be sent using stream API.
+ */
+static void
+ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn,
+						volatile Snapshot snapshot_now,
+						volatile CommandId command_id,
+						bool streaming)
+{
+	bool		using_subtxn;
+	MemoryContext ccxt = CurrentMemoryContext;
+	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
+	ReorderBufferChange *volatile specinsert = NULL;
+	volatile bool stream_started = false;
+	ReorderBufferTXN *volatile curtxn = NULL;
 
 	/* build data to be able to lookup the CommandIds of catalog tuples */
 	ReorderBufferBuildTupleCidHash(rb, txn);
@@ -1558,14 +1944,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
-		ReorderBufferChange *specinsert = NULL;
 
 		if (using_subtxn)
-			BeginInternalSubTransaction("replay");
+			BeginInternalSubTransaction(streaming ? "stream" : "replay");
 		else
 			StartTransactionCommand();
 
-		rb->begin(rb, txn);
+		/* We only need to send begin/commit for non-streamed transactions. */
+		if (!streaming)
+			rb->begin(rb, txn);
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -1573,6 +1960,36 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/*
+			 * We can't call start stream callback before processing first
+			 * change.
+			 */
+			if (prev_lsn == InvalidXLogRecPtr)
+			{
+				if (streaming)
+				{
+					txn->origin_id = change->origin_id;
+					rb->stream_start(rb, txn, change->lsn);
+					stream_started = true;
+				}
+			}
+
+			/*
+			 * Enforce correct ordering of changes, merged from multiple
+			 * subtransactions. The changes may have the same LSN due to
+			 * MULTI_INSERT xlog records.
+			 */
+			Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
+
+			prev_lsn = change->lsn;
+
+			/* Set the current xid to detect concurrent aborts. */
+			if (streaming)
+			{
+				curtxn = change->txn;
+				SetupCheckXidLive(curtxn->xid);
+			}
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1649,7 +2066,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					if (!IsToastRelation(relation))
 					{
 						ReorderBufferToastReplace(rb, txn, relation, change);
-						rb->apply_change(rb, txn, relation, change);
+						ReorderBufferApplyChange(rb, txn, relation, change,
+												 streaming);
 
 						/*
 						 * Only clear reassembled toast chunks if we're sure
@@ -1685,11 +2103,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					 */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
-					if (relation != NULL)
+					if (RelationIsValid(relation))
 					{
 						RelationClose(relation);
 						relation = NULL;
@@ -1714,7 +2132,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					/* clear out a pending (and thus failed) speculation */
 					if (specinsert != NULL)
 					{
-						ReorderBufferReturnChange(rb, specinsert);
+						ReorderBufferReturnChange(rb, specinsert, true);
 						specinsert = NULL;
 					}
 
@@ -1747,7 +2165,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							relations[nrelations++] = relation;
 						}
 
-						rb->apply_truncate(rb, txn, nrelations, relations, change);
+						/* Apply the truncate. */
+						ReorderBufferApplyTruncate(rb, txn, nrelations,
+												   relations, change,
+												   streaming);
 
 						for (i = 0; i < nrelations; i++)
 							RelationClose(relations[i]);
@@ -1756,10 +2177,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					}
 
 				case REORDER_BUFFER_CHANGE_MESSAGE:
-					rb->message(rb, txn, change->lsn, true,
-								change->data.msg.prefix,
-								change->data.msg.message_size,
-								change->data.msg.message);
+					ReorderBufferApplyMessage(rb, txn, change, streaming);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1790,7 +2208,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						snapshot_now = change->data.snapshot;
 					}
 
-
 					/* and continue with the new one */
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					break;
@@ -1837,7 +2254,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		 */
 		if (specinsert)
 		{
-			ReorderBufferReturnChange(rb, specinsert);
+			ReorderBufferReturnChange(rb, specinsert, true);
 			specinsert = NULL;
 		}
 
@@ -1845,14 +2262,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Done with current changes, send the last message for this set of
+		 * changes depending upon streaming mode.
+		 */
+		if (streaming)
+		{
+			if (stream_started)
+			{
+				rb->stream_stop(rb, txn, prev_lsn);
+				stream_started = false;
+			}
+		}
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
 			elog(ERROR, "output plugin used XID %u",
 				 GetCurrentTransactionId());
 
+		/*
+		 * Remember the command ID and snapshot for the next set of changes in
+		 * streaming mode.
+		 */
+		if (streaming)
+			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+		else if (snapshot_now->copied)
+			ReorderBufferFreeSnap(rb, snapshot_now);
+
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
 
@@ -1870,14 +2308,27 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * If we are streaming the in-progress transaction then discard the
+		 * changes that we just streamed, and mark the transactions as
+		 * streamed (if they contained changes). Otherwise, remove all the
+		 * changes and deallocate the ReorderBufferTXN.
+		 */
+		if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn);
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
+		else
+			ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
 	{
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1896,15 +2347,106 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (using_subtxn)
 			RollbackAndReleaseCurrentSubTransaction();
 
-		if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
+		/*
+		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
+		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			/*
+			 * This error can only occur when we are sending the data in
+			 * streaming mode and the streaming is not finished yet.
+			 */
+			Assert(streaming);
+			Assert(stream_started);
+
+			/* Cleanup the temporary error state. */
+			FlushErrorState();
+			FreeErrorData(errdata);
+			errdata = NULL;
+			curtxn->concurrent_abort = true;
+
+			/* Reset the TXN so that it is allowed to stream remaining data. */
+			ReorderBufferResetTXN(rb, txn, snapshot_now,
+								  command_id, prev_lsn,
+								  specinsert);
+		}
+		else
+		{
+			ReorderBufferCleanupTXN(rb, txn);
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+	}
+	PG_END_TRY();
+}
 
-		/* remove potential on-disk data, and deallocate */
-		ReorderBufferCleanupTXN(rb, txn);
+/*
+ * Perform the replay of a transaction and its non-aborted subtransactions.
+ *
+ * Subtransactions previously have to be processed by
+ * ReorderBufferCommitChild(), even if previously assigned to the toplevel
+ * transaction with ReorderBufferAssignChild.
+ *
+ * This interface is called once a toplevel commit is read for both streamed
+ * as well as non-streamed transactions.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+	Snapshot	snapshot_now;
+	CommandId	command_id = FirstCommandId;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
 
-		PG_RE_THROW();
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	/*
+	 * If the transaction was (partially) streamed, we need to commit it in a
+	 * 'streamed' way. That is, we first stream the remaining part of the
+	 * transaction, and then invoke stream_commit message.
+	 *
+	 * Called after everything (origin ID, LSN, ...) is stored in the
+	 * transaction to avoid passing that information directly.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		ReorderBufferStreamCommit(rb, txn);
+		return;
 	}
-	PG_END_TRY();
+
+	/*
+	 * If this transaction has no snapshot, it didn't make any changes to the
+	 * database, so there's nothing to decode.  Note that
+	 * ReorderBufferCommitChild will have transferred any snapshots from
+	 * subtransactions if there were any.
+	 */
+	if (txn->base_snapshot == NULL)
+	{
+		Assert(txn->ninvalidations == 0);
+		ReorderBufferCleanupTXN(rb, txn);
+		return;
+	}
+
+	snapshot_now = txn->base_snapshot;
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
+							command_id, false);
 }
 
 /*
@@ -1931,6 +2473,22 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_abort(rb, txn, lsn);
+
+		/*
+		 * We might have decoded changes for this transaction that could load
+		 * the cache as per the current transaction's view (consider DDL's
+		 * happened in this transaction). We don't want the decoding of future
+		 * transactions to use those cache entries so execute invalidations.
+		 */
+		if (txn->ninvalidations > 0)
+			ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+											   txn->invalidations);
+	}
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2000,6 +2558,10 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
+	/* For streamed transactions notify the remote node about the abort. */
+	if (rbtxn_is_streamed(txn))
+		rb->stream_abort(rb, txn, lsn);
+
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
@@ -2082,7 +2644,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
 	change->data.snapshot = snap;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
@@ -2131,12 +2693,21 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 	change->data.command_id = cid;
 	change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
 
-	ReorderBufferQueueChange(rb, xid, lsn, change);
+	ReorderBufferQueueChange(rb, xid, lsn, change, false);
 }
 
 /*
- * Update the memory accounting info. We track memory used by the whole
- * reorder buffer and the transaction containing the change.
+ * Update memory counters to account for the new or removed change.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -2144,6 +2715,8 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition)
 {
 	Size		sz;
+	ReorderBufferTXN *txn;
+	ReorderBufferTXN *toptxn = NULL;
 
 	Assert(change->txn);
 
@@ -2155,19 +2728,41 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
+	txn = change->txn;
+
+	/* If streaming supported, update the total size in top level as well. */
+	if (ReorderBufferCanStream(rb))
+	{
+		if (txn->toptxn != NULL)
+			toptxn = txn->toptxn;
+		else
+			toptxn = txn;
+	}
+
 	sz = ReorderBufferChangeSize(change);
 
 	if (addition)
 	{
-		change->txn->size += sz;
+		txn->size += sz;
 		rb->size += sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size += sz;
 	}
 	else
 	{
-		Assert((rb->size >= sz) && (change->txn->size >= sz));
-		change->txn->size -= sz;
+		Assert((rb->size >= sz) && (txn->size >= sz));
+		txn->size -= sz;
 		rb->size -= sz;
+
+		/* Update the total size in the top transaction. */
+		if (toptxn)
+			toptxn->total_size -= sz;
 	}
+
+	Assert(txn->size <= rb->size);
+	Assert((txn->size >= 0) && (rb->size >= 0));
 }
 
 /*
@@ -2388,6 +2983,51 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
+ * Find the largest toplevel transaction to evict (by streaming).
+ *
+ * This can be seen as an optimized version of ReorderBufferLargestTXN, which
+ * should give us the same transaction (because we don't update memory account
+ * for subtransaction with streaming, so it's always 0). But we can simply
+ * iterate over the limited number of toplevel transactions.
+ *
+ * Note that, we skip transactions that contains incomplete changes. There
+ * is a scope of optimization here such that we can select the largest transaction
+ * which has complete changes.  But that will make the code and design quite complex
+ * and that might not be worth the benefit.  If we plan to stream the transactions
+ * that contains incomplete changes then we need to find a way to partially
+ * stream/truncate the transaction changes in-memory and build a mechanism to
+ * partially truncate the spilled files.  Additionally, whenever we partially
+ * stream the transaction we need to maintain the last streamed lsn and next time
+ * we need to restore from that segment and the offset in WAL.  As we stream the
+ * changes from the top transaction and restore them subtransaction wise, we need
+ * to even remember the subxact from where we streamed the last change.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTopTXN(ReorderBuffer *rb)
+{
+	dlist_iter	iter;
+	Size		largest_size = 0;
+	ReorderBufferTXN *largest = NULL;
+
+	/* Find the largest top-level transaction. */
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	{
+		ReorderBufferTXN *txn;
+
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+
+		if ((largest != NULL || txn->total_size > largest_size) &&
+			(txn->total_size > 0) && !(rbtxn_has_incomplete_tuple(txn)))
+		{
+			largest = txn;
+			largest_size = txn->total_size;
+		}
+	}
+
+	return largest;
+}
+
+/*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
  * disk until we reach under the memory limit.
@@ -2419,11 +3059,33 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	{
 		/*
 		 * Pick the largest transaction (or subtransaction) and evict it from
-		 * memory by serializing it to disk.
+		 * memory by streaming, if possible.  Otherwise, spill to disk.
 		 */
-		txn = ReorderBufferLargestTXN(rb);
+		if (ReorderBufferCanStartStreaming(rb) &&
+			(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
+		{
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn && !txn->toptxn);
+			Assert(txn->total_size > 0);
+			Assert(rb->size >= txn->total_size);
 
-		ReorderBufferSerializeTXN(rb, txn);
+			ReorderBufferStreamTXN(rb, txn);
+		}
+		else
+		{
+			/*
+			 * Pick the largest transaction (or subtransaction) and evict it
+			 * from memory by serializing it to disk.
+			 */
+			txn = ReorderBufferLargestTXN(rb);
+
+			/* we know there has to be one, because the size is not zero */
+			Assert(txn);
+			Assert(txn->size > 0);
+			Assert(rb->size >= txn->size);
+
+			ReorderBufferSerializeTXN(rb, txn);
+		}
 
 		/*
 		 * After eviction, the transaction should have no entries in memory,
@@ -2501,7 +3163,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change);
+		ReorderBufferReturnChange(rb, change, true);
 
 		spilled++;
 	}
@@ -2713,6 +3375,136 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	Assert(ondisk->change.action == change->action);
 }
 
+/* Returns true, if the output plugin supports streaming, false, otherwise. */
+static inline bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+
+	return ctx->streaming;
+}
+
+/* Returns true, if the streaming can be started now, false, otherwise. */
+static inline bool
+ReorderBufferCanStartStreaming(ReorderBuffer *rb)
+{
+	LogicalDecodingContext *ctx = rb->private_data;
+	SnapBuild  *builder = ctx->snapshot_builder;
+
+	/*
+	 * We can't start streaming immediately even if the streaming is enabled
+	 * because we previously decoded this transaction and now just are
+	 * restarting.
+	 */
+	if (ReorderBufferCanStream(rb) &&
+		!SnapBuildXactNeedsSkip(builder, ctx->reader->EndRecPtr))
+	{
+		/* We must have a consistent snapshot by this time */
+		Assert(SnapBuildCurrentState(builder) == SNAPBUILD_CONSISTENT);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Send data of a large transaction (and its subtransactions) to the
+ * output plugin, but using the stream API.
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/* We can never reach here for a subtransaction. */
+	Assert(txn->toptxn == NULL);
+
+	/*
+	 * We can't make any assumptions about base snapshot here, similar to what
+	 * ReorderBufferCommit() does. That relies on base_snapshot getting
+	 * transferred from subxact in ReorderBufferCommitChild(), but that was
+	 * not yet called as the transaction is in-progress.
+	 *
+	 * So just walk the subxacts and use the same logic here. But we only need
+	 * to do that once, when the transaction is streamed for the first time.
+	 * After that we need to reuse the snapshot from the previous run.
+	 *
+	 * Unlike DecodeCommit which adds xids of all the subtransactions in
+	 * snapshot's xip array via SnapBuildCommittedTxn, we can't do that here
+	 * but we do add them to subxip array instead via ReorderBufferCopySnap.
+	 * This allows the catalog changes made in subtransactions decoded till
+	 * now to be visible.
+	 */
+	if (txn->snapshot_now == NULL)
+	{
+		dlist_iter	subxact_i;
+
+		/* make sure this transaction is streamed for the first time */
+		Assert(!rbtxn_is_streamed(txn));
+
+		/* at the beginning we should have invalid command ID */
+		Assert(txn->command_id == InvalidCommandId);
+
+		dlist_foreach(subxact_i, &txn->subtxns)
+		{
+			ReorderBufferTXN *subtxn;
+
+			subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+			ReorderBufferTransferSnapToParent(txn, subtxn);
+		}
+
+		/*
+		 * If this transaction has no snapshot, it didn't make any changes to
+		 * the database till now, so there's nothing to decode.
+		 */
+		if (txn->base_snapshot == NULL)
+		{
+			Assert(txn->ninvalidations == 0);
+			return;
+		}
+
+		command_id = FirstCommandId;
+		snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+											 txn, command_id);
+	}
+	else
+	{
+		/* the transaction must have been already streamed */
+		Assert(rbtxn_is_streamed(txn));
+
+		/*
+		 * Nah, we already have snapshot from the previous streaming run. We
+		 * assume new subxacts can't move the LSN backwards, and so can't beat
+		 * the LSN condition in the previous branch (so no need to walk
+		 * through subxacts again). In fact, we must not do that as we may be
+		 * using snapshot half-way through the subxact.
+		 */
+		command_id = txn->command_id;
+
+		/*
+		 * We can't use txn->snapshot_now directly because after the last
+		 * streaming run, we might have got some new sub-transactions. So we
+		 * need to add them to the snapshot.
+		 */
+		snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+											 txn, command_id);
+
+		/* Free the previously copied snapshot. */
+		Assert(txn->snapshot_now->copied);
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		txn->snapshot_now = NULL;
+	}
+
+	/* Process and send the changes to output plugin. */
+	ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
+							command_id, true);
+
+	Assert(dlist_is_empty(&txn->changes));
+	Assert(txn->nentries == 0);
+	Assert(txn->nentries_mem == 0);
+}
+
 /*
  * Size of a change in memory.
  */
@@ -2813,7 +3605,7 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		dlist_container(ReorderBufferChange, node, cleanup_iter.cur);
 
 		dlist_delete(&cleanup->node);
-		ReorderBufferReturnChange(rb, cleanup);
+		ReorderBufferReturnChange(rb, cleanup, true);
 	}
 	txn->nentries_mem = 0;
 	Assert(dlist_is_empty(&txn->changes));
@@ -3522,7 +4314,7 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			dlist_container(ReorderBufferChange, node, it.cur);
 
 			dlist_delete(&change->node);
-			ReorderBufferReturnChange(rb, change);
+			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
 
@@ -3812,6 +4604,17 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
 	BlockNumber blockno;
 	bool		updated_mapping = false;
 
+	/*
+	 * Return unresolved if tuplecid_data is not valid.  That's because when
+	 * streaming in-progress transactions we may run into tuples with the CID
+	 * before actually decoding them.  Think e.g. about INSERT followed by
+	 * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
+	 * INSERT.  So in such cases, we assume the CID is from the future
+	 * command.
+	 */
+	if (tuplecid_data == NULL)
+		return false;
+
 	/* be careful about padding */
 	memset(&key, 0, sizeof(key));
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cd..aa17f7d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLH_INSERT_LAST_IN_MULTI				(1<<1)
 #define XLH_INSERT_IS_SPECULATIVE				(1<<2)
 #define XLH_INSERT_CONTAINS_NEW_TUPLE			(1<<3)
+#define XLH_INSERT_ON_TOAST_RELATION			(1<<4)
 
 /*
  * xl_heap_update flag values, 8 bits are available.
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 7ba72c8..387eb34 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -19,6 +19,7 @@
 
 #include "access/relscan.h"
 #include "access/sdir.h"
+#include "access/xact.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -903,6 +904,15 @@ static inline bool
 table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
 {
 	slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);
+
+	/*
+	 * We don't expect direct calls to table_scan_getnextslot with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding");
+
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
@@ -1017,6 +1027,13 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						TupleTableSlot *slot,
 						bool *call_again, bool *all_dead)
 {
+	/*
+	 * We don't expect direct calls to table_index_fetch_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
@@ -1056,6 +1073,14 @@ table_tuple_fetch_row_version(Relation rel,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_tuple_fetch_row_version with
+	 * valid CheckXidAlive for catalog or regular tables.  See detailed
+	 * comments in xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
+
 	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
 }
 
@@ -1713,6 +1738,14 @@ static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
 														   tbmres);
 }
@@ -1730,6 +1763,14 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 							 struct TBMIterateResult *tbmres,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
+
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
 														   tbmres,
 														   slot);
@@ -1748,6 +1789,13 @@ static inline bool
 table_scan_sample_next_block(TableScanDesc scan,
 							 struct SampleScanState *scanstate)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_block with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_block call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_block(scan, scanstate);
 }
 
@@ -1764,6 +1812,13 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 							 struct SampleScanState *scanstate,
 							 TupleTableSlot *slot)
 {
+	/*
+	 * We don't expect direct calls to table_scan_sample_next_tuple with valid
+	 * CheckXidAlive for catalog or regular tables.  See detailed comments in
+	 * xact.c where these variables are declared.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
+		elog(ERROR, "unexpected table_scan_sample_next_tuple call during logical decoding");
 	return scan->rs_rd->rd_tableam->scan_sample_next_tuple(scan, scanstate,
 														   slot);
 }
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5348011..c18554b 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,10 @@ typedef enum
 /* Synchronous commit level */
 extern int	synchronous_commit;
 
+/* used during logical streaming of a transaction */
+extern TransactionId CheckXidAlive;
+extern bool bsysscan;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index deef318..b0fae98 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -121,5 +121,6 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
+extern void ResetLogicalStreamingState(void);
 
 #endif
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 42bc817..1ae17d5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -162,6 +162,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_IS_STREAMED         0x0008
+#define RBTXN_HAS_TOAST_INSERT    0x0010
+#define RBTXN_HAS_SPEC_INSERT     0x0020
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -181,6 +184,40 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_SERIALIZED) != 0 \
 )
 
+/* This transaction's changes has toast insert, without main table insert. */
+#define rbtxn_has_toast_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_TOAST_INSERT) != 0 \
+)
+/*
+ * This transaction's changes has speculative insert, without speculative
+ * confirm.
+ */
+#define rbtxn_has_spec_insert(txn) \
+( \
+	((txn)->txn_flags & RBTXN_HAS_SPEC_INSERT) != 0 \
+)
+
+/* Check whether this transaction has an incomplete change. */
+#define rbtxn_has_incomplete_tuple(txn) \
+( \
+	rbtxn_has_toast_insert(txn) || rbtxn_has_spec_insert(txn) \
+)
+
+/*
+ * Has this transaction been streamed to downstream?
+ *
+ * (It's not possible to deduce this from nentries and nentries_mem for
+ * various reasons. For example, all changes may be in subtransactions in
+ * which case we'd have nentries==0 for the toplevel one, which would say
+ * nothing about the streaming. So we maintain this flag, but only for the
+ * toplevel transaction.)
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -249,6 +286,13 @@ typedef struct ReorderBufferTXN
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
 	/*
+	 * Snapshot/CID from the previous streaming run. Only valid for already
+	 * streamed transactions (NULL/InvalidCommandId otherwise).
+	 */
+	Snapshot	snapshot_now;
+	CommandId	command_id;
+
+	/*
 	 * How many ReorderBufferChange's do we have in this txn.
 	 *
 	 * Changes in subtransactions are *not* included but tracked separately.
@@ -313,6 +357,12 @@ typedef struct ReorderBufferTXN
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
 	Size		size;
+
+	/* Size of top-transaction including sub-transactions. */
+	Size		total_size;
+
+	/* If we have detected concurrent abort then ignore future changes. */
+	bool		concurrent_abort;
 } ReorderBufferTXN;
 
 /* so we can define the callbacks used inside struct ReorderBuffer itself */
@@ -484,12 +534,14 @@ void		ReorderBufferFree(ReorderBuffer *);
 ReorderBufferTupleBuf *ReorderBufferGetTupleBuf(ReorderBuffer *, Size tuple_len);
 void		ReorderBufferReturnTupleBuf(ReorderBuffer *, ReorderBufferTupleBuf *tuple);
 ReorderBufferChange *ReorderBufferGetChange(ReorderBuffer *);
-void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
+void		ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *, bool);
 
 Oid		   *ReorderBufferGetRelids(ReorderBuffer *, int nrelids);
 void		ReorderBufferReturnRelids(ReorderBuffer *, Oid *relids);
 
-void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+void		ReorderBufferQueueChange(ReorderBuffer *, TransactionId,
+									 XLogRecPtr lsn, ReorderBufferChange *,
+									 bool toast_insert);
 void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
-- 
1.8.3.1

v47-0002-Extend-the-BufFile-interface-for-the-streaming-o.patch000664 000765 000024 00000037534 13712740650 025440 0ustar00amitkapilastaff000000 000000 From 655098f97b70e660190bea1667559c6568d59d28 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v47 2/6] Extend the BufFile interface for the streaming of
 in-progress transactions.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times.

Implement the interface for BufFileTruncate interface to allow files to be
truncated up to a particular offset.  Extend BufFileSeek API to support
SEEK_END case.  Add an option to provide a mode while opening the shared
BufFiles instead of always opening in read-only mode.
---
 src/backend/postmaster/pgstat.c           |  3 +
 src/backend/storage/file/buffile.c        | 86 +++++++++++++++++++++++----
 src/backend/storage/file/fd.c             |  9 ++-
 src/backend/storage/file/sharedfileset.c  | 98 ++++++++++++++++++++++++++++---
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  4 +-
 10 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 15f92b6..3804412 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 2d7a082..a9ca5d9 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile supports temporary files that can be used by the single backend when
+ * the corresponding files need to be survived across the transaction and need
+ * to be opened and closed multiple times.  Such files need to be created as a
+ * member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -364,6 +368,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -666,11 +673,22 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * Get the file size of the last file to get the last offset of
+			 * that file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
+			break;
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +856,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offsets.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset = file->curOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420e..f376a97 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594..9a3dc10 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can also be used by backends when the temporary files need
+ * to be opened/closed multiple times and the underlying files need to survive
+ * across transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,19 +29,29 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -223,6 +255,58 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 }
 
 /*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
+/*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
  */
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59..788815c 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a43..b83fb50 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201..807a9c1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752ba..fc34c49 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d..e209f04 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf07..d5edb60 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
1.8.3.1

v47-0003-Add-support-for-streaming-to-built-in-replicatio.patch000664 000765 000024 00000273545 13712740650 025652 0ustar00amitkapilastaff000000 000000 From ac3ee7d69fa6279c0534eb4a6950cdfc26ac721f Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v47 3/6] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  49 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   4 +
 src/backend/replication/logical/proto.c            | 140 ++-
 src/backend/replication/logical/worker.c           | 946 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 348 +++++++-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  46 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/009_stream_simple.pl       |  86 ++
 src/test/subscription/t/010_stream_subxact.pl      | 102 +++
 src/test/subscription/t/011_stream_ddl.pl          |  95 +++
 .../subscription/t/012_stream_subxact_abort.pl     |  82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |  84 ++
 src/test/subscription/t/015_stream_binary.pl       |  86 ++
 19 files changed, 2060 insertions(+), 47 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/015_stream_binary.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70..a81bd54 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal> and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c5..b7d7457 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf..311d462 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377..4c58ad8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,8 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +197,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +350,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +375,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -439,6 +455,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -698,6 +721,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +732,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +789,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +834,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +877,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3804412..bb0f95a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e905723..a6101ac 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -408,6 +408,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097..ff25924 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 2fcf2e6..98e7fd0 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,62 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +291,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,17 +752,323 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or
+	 * inside streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		!in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -635,6 +1081,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1099,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1138,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1256,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1408,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1781,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1922,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2050,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2162,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1947,6 +2435,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1979,6 +2481,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2145,6 +3080,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc..3360bd5 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,29 +47,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,11 +123,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -115,16 +149,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +226,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +255,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +279,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +300,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +331,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +394,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +444,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +486,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,6 +507,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -406,7 +539,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +559,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +584,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +605,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +631,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +663,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -606,6 +744,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -642,6 +887,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -771,12 +1048,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -811,7 +1121,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35..1d09154 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc..655144d 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,49 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbe..6c0a4e3 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/015_stream_binary.pl b/src/test/subscription/t/015_stream_binary.pl
new file mode 100644
index 0000000..fa2362e
--- /dev/null
+++ b/src/test/subscription/t/015_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v47-0004-Enable-streaming-for-all-subscription-TAP-tests.patch000664 000765 000024 00000035342 13712740650 025406 0ustar00amitkapilastaff000000 000000 From 66ce76b712d6d2977c9b03299d4b022e2592c147 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v47 4/6] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318f..6f7bedc 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -78,7 +78,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 9f140b5..21410fa 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334e..cdf9b8e 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v47-0005-Add-TAP-test-for-streaming-vs.-DDL.patch000664 000765 000024 00000010621 13712740650 022334 0ustar00amitkapilastaff000000 000000 From 739b2d4ba7ea62535012b82d271bd9f75c2c977b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v47 5/6] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v47-0006-Add-streaming-option-in-pg_dump.patch000664 000765 000024 00000005247 13712740650 022420 0ustar00amitkapilastaff000000 000000 From 01e7e63bd1dde746bd07d0b9f4068df6e1645f41 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 11:09:18 +0530
Subject: [PATCH v47 6/6] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 17 +++++++++++++++--
 src/bin/pg_dump/pg_dump.h |  1 +
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 94459b3..f69d64c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4240,10 +4241,17 @@ getSubscriptions(Archive *fout)
 
 	if (fout->remoteVersion >= 140000)
 		appendPQExpBuffer(query,
-						  " s.subbinary\n");
+						  " s.subbinary,\n");
 	else
 		appendPQExpBuffer(query,
-						  " false AS subbinary\n");
+						  " false AS subbinary,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBuffer(query,
+						  " s.substream\n");
+	else
+		appendPQExpBuffer(query,
+						  " false AS substream\n");
 
 	appendPQExpBuffer(query,
 					  "FROM pg_subscription s\n"
@@ -4263,6 +4271,7 @@ getSubscriptions(Archive *fout)
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
+	i_substream = PQfnumber(res, "substream");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4286,6 +4295,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subpublications));
 		subinfo[i].subbinary =
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
+		subinfo[i].substream =
+			pg_strdup(PQgetvalue(res, i, i_substream));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4357,6 +4368,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subbinary, "t") == 0)
 		appendPQExpBuffer(query, ", binary = true");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index da97b73..cc10c7c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -626,6 +626,7 @@ typedef struct _SubscriptionInfo
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subbinary;
+	char       *substream;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
-- 
1.8.3.1

#467Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#466)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Aug 5, 2020 at 7:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Aug 5, 2020 at 6:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Can we add a test for incomplete changes (probably with toast
insertion but we can do it for spec_insert case as well) in
ReorderBuffer such that it needs to first serialize the changes and
then stream it? I have manually verified such scenarios but it is
good to have the test for the same.

I have added a new test for the same in the stream.sql file.

Thanks, I have slightly changed the test so that we can consume DDL
changes separately. I have made a number of other adjustments like
changing few more comments (to make them consistent with nearby
comments), removed unnecessary inclusion of header file, ran pgindent.
The next patch (v47-0001-Implement-streaming-mode-in-ReorderBuffer) in
this series looks good to me. I am planning to push it after one more
read-through unless you or anyone else has any comments on the same.
The patch I am talking about has the following functionality:

Implement streaming mode in ReorderBuffer. Instead of serializing the
transaction to disk after reaching the logical_decoding_work_mem limit
in memory, we consume the changes we have in memory and invoke stream
API methods added by commit 45fdc9738b. However, sometimes if we have
incomplete toast or speculative insert we spill to the disk because we
can't stream till we have the complete tuple. And, as soon as we get
the complete tuple we stream the transaction including the serialized
changes. Now that we can stream in-progress transactions, the
concurrent aborts may cause failures when the output plugin consults
catalogs (both system and user-defined). We handle such failures by
returning ERRCODE_TRANSACTION_ROLLBACK sqlerrcode from system table
scan APIs to the backend or WALSender decoding a specific uncommitted
transaction. The decoding logic on the receipt of such a sqlerrcode
aborts the decoding of the current transaction and continues with the
decoding of other transactions. We also provide a new option via SQL
APIs to fetch the changes being streamed.

This patch's functionality can be independently verified by SQL APIs

Your changes look fine to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#468Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#467)
5 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Aug 7, 2020 at 2:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

..

This patch's functionality can be independently verified by SQL APIs

Your changes look fine to me.

I have pushed that patch last week and attached are the remaining
patches. I have made a few changes in the next patch
0001-Extend-the-BufFile-interface.patch and have some comments on it
which are as below:

1.
  case SEEK_END:
- /* could be implemented, not needed currently */
+
+ /*
+ * Get the file size of the last file to get the last offset of
+ * that file.
+ */
+ newFile = file->numFiles - 1;
+ newOffset = FileSize(file->files[file->numFiles - 1]);
+ if (newOffset < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not determine size of temporary file \"%s\" from
BufFile \"%s\": %m",
+ FilePathName(file->files[file->numFiles - 1]),
+ file->name)));
+ break;
  break;

There is no need for multiple breaks in the above code. I have fixed
this one in the attached patch.

2.
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+ int newFile = file->numFiles;
+ off_t newOffset = file->curOffset;
+ char segment_name[MAXPGPATH];
+ int i;
+
+ /* Loop over all the files upto the fileno which we want to truncate. */
+ for (i = file->numFiles - 1; i >= fileno; i--)
+ {
+ /*
+ * Except the fileno, we can directly delete other files.  If the
+ * offset is 0 then we can delete the fileno file as well unless it is
+ * the first file.
+ */
+ if ((i != fileno || offset == 0) && fileno != 0)
+ {
+ SharedSegmentName(segment_name, file->name, i);
+ FileClose(file->files[i]);
+ if (!SharedFileSetDelete(file->fileset, segment_name, true))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not delete shared fileset \"%s\": %m",
+ segment_name)));
+ newFile--;
+ newOffset = MAX_PHYSICAL_FILESIZE;
+ }
+ else
+ {
+ if (FileTruncate(file->files[i], offset,
+ WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not truncate file \"%s\": %m",
+ FilePathName(file->files[i]))));
+
+ newOffset = offset;
+ }
+ }
+
+ file->numFiles = newFile;
+ file->curOffset = newOffset;
+}

In the end, you have only set 'numFiles' and 'curOffset' members of
BufFile and left others. I think other members like 'curFile' also
need to be set especially for the case where we have deleted segments
at the end, also, shouldn't we need to set 'pos' and 'nbytes' as we do
in BufFileSeek. If there is some reason that we don't to set these
other members then maybe it is better to add a comment to make it
clear.

Another thing we need to think here whether we need to flush the
buffer data for the dirty buffer? Consider a case where we truncate
the file up to a position that falls in the buffer. Now we might
truncate the file and part of buffer contents will become invalid,
next time if we flush such a buffer then the file can contain the
garbage or maybe this will be handled if we update the position in
buffer appropriately but all of this should be explained in comments.
If what I said is correct, then we still can skip buffer flush in some
cases as we do in BufFileSeek. Also, consider if we need to do other
handling (convert seek to "start of next seg" to "end of last seg") as
we do after changing the seek position in BufFileSeek.

3.
/*
* Initialize a space for temporary files that can be opened by other backends.
* Other backends must attach to it before accessing it. Associate this
* SharedFileSet with 'seg'. Any contained files will be deleted when the
* last backend detaches.
*
* We can also use this interface if the temporary files are used only by
* single backend but the files need to be opened and closed multiple times
* and also the underlying files need to survive across transactions. For
* such cases, dsm segment 'seg' should be passed as NULL. We remove such
* files on proc exit.
*
* Files will be distributed over the tablespaces configured in
* temp_tablespaces.
*
* Under the covers the set is one or more directories which will eventually
* be deleted when there are no backends attached.
*/
void
SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
{
..

I think we can remove the part of the above comment after 'eventually
be deleted' (see last sentence in comment) because now the files can
be removed in more than one way and we have explained that in the
comments before this last sentence of the comment. If you can rephrase
it differently to cover the other case as well, then that is fine too.

--
With Regards,
Amit Kapila.

Attachments:

v48-0001-Extend-the-BufFile-interface.patchapplication/octet-stream; name=v48-0001-Extend-the-BufFile-interface.patchDownload
From 046b7838f2bb63b63d9e385360ce242b10b312b2 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v48 1/5] Extend the BufFile interface.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times. Such files
need to be created as a member of a SharedFileSet.

Implement the interface for BufFileTruncate to allow files to be truncated
up to a particular offset. Extend BufFileSeek API to support SEEK_END case.
Add an option to provide a mode while opening the shared BufFiles instead
of always opening in read-only mode.

These enhancements in BufFile interface are required for the upcoming
patch to allow the replication apply worker, to properly handle streamed
in-progress transactions.

Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/postmaster/pgstat.c           |  3 +
 src/backend/storage/file/buffile.c        | 85 +++++++++++++++++---
 src/backend/storage/file/fd.c             |  9 +--
 src/backend/storage/file/sharedfileset.c  | 98 +++++++++++++++++++++--
 src/backend/utils/sort/logtape.c          |  4 +-
 src/backend/utils/sort/sharedtuplestore.c |  2 +-
 src/include/pgstat.h                      |  1 +
 src/include/storage/buffile.h             |  4 +-
 src/include/storage/fd.h                  |  2 +-
 src/include/storage/sharedfileset.h       |  4 +-
 10 files changed, 185 insertions(+), 27 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 73ce944fb1..8116b23614 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 2d7a082320..f15cb4d561 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile also supports temporary files that can be used by the single backend
+ * when the corresponding files need to be survived across the transaction and
+ * need to be opened and closed multiple times.  Such files need to be created
+ * as a member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -364,6 +368,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -666,11 +673,21 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * The file size of the last file gives us the end offset of that
+			 * file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +855,51 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offset.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			newFile = file->numFiles;
+	off_t		newOffset = file->curOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			newFile--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = newFile;
+	file->curOffset = newOffset;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420efb2..f376a97ed6 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594756..9a3dc102f5 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can also be used by backends when the temporary files need
+ * to be opened/closed multiple times and the underlying files need to survive
+ * across transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,19 +29,29 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -222,6 +254,58 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 		SharedFileSetDeleteAll(fileset);
 }
 
+/*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
 /*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59c50..788815cdab 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a4303b..b83fb50dac 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..807a9c1edf 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752bab0d..fc34c49522 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d7df..e209f047e8 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf077e5..d5edb600af 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
2.28.0.windows.1

v48-0004-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v48-0004-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From 32447cb24e836804076c8077e0ab87628e6ba469 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v48 4/5] Add TAP test for streaming vs. DDL

---
 .../subscription/t/014_stream_through_ddl.pl  | 98 +++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000000..b8d78b1972
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.28.0.windows.1

v48-0005-Add-streaming-option-in-pg_dump.patchapplication/octet-stream; name=v48-0005-Add-streaming-option-in-pg_dump.patchDownload
From 916074be6bb00097fa05454957f1fce253a55521 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 11:09:18 +0530
Subject: [PATCH v48 5/5] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 17 +++++++++++++++--
 src/bin/pg_dump/pg_dump.h |  1 +
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 9c8436dde6..4c18ea4e2d 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4240,10 +4241,17 @@ getSubscriptions(Archive *fout)
 
 	if (fout->remoteVersion >= 140000)
 		appendPQExpBuffer(query,
-						  " s.subbinary\n");
+						  " s.subbinary,\n");
 	else
 		appendPQExpBuffer(query,
-						  " false AS subbinary\n");
+						  " false AS subbinary,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBuffer(query,
+						  " s.substream\n");
+	else
+		appendPQExpBuffer(query,
+						  " false AS substream\n");
 
 	appendPQExpBuffer(query,
 					  "FROM pg_subscription s\n"
@@ -4263,6 +4271,7 @@ getSubscriptions(Archive *fout)
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
+	i_substream = PQfnumber(res, "substream");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4286,6 +4295,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subpublications));
 		subinfo[i].subbinary =
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
+		subinfo[i].substream =
+			pg_strdup(PQgetvalue(res, i, i_substream));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4357,6 +4368,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subbinary, "t") == 0)
 		appendPQExpBuffer(query, ", binary = true");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index da97b731b1..cc10c7c1cc 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -626,6 +626,7 @@ typedef struct _SubscriptionInfo
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subbinary;
+	char       *substream;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
-- 
2.28.0.windows.1

v48-0003-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v48-0003-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From aacc0f0a76031714e0de110f850f8e870d5b54d7 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v48 3/5] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 0680f44a1a..4c9b48e9c2 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -82,7 +82,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2fbc..94c71f8ae2 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 9f140b552b..21410fac1c 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9181..a6fae9c3f1 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17a78..202871a658 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10a19..70c86b22ac 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6d63..f9c8d1d348 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334ed89..cdf9b8e7bb 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bdba9..21f50c7012 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133f69..30561d8f96 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae38592b..9a6bac6822 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..ed56fbf96c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cba4c..4df1ddef63 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1a8a..c3caff6149 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7c2f..c62eb521e7 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30f59..2be7542831 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0578..2da9607a7d 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9435..96ffc091b0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
2.28.0.windows.1

v48-0002-Add-support-for-streaming-to-built-in-replicatio.patchapplication/octet-stream; name=v48-0002-Add-support-for-streaming-to-built-in-replicatio.patchDownload
From f3e4f3a9643b3b2c026670ce84434782a57154a2 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v48 2/5] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml      |   5 +-
 doc/src/sgml/ref/create_subscription.sgml     |  11 +
 src/backend/catalog/pg_subscription.c         |   1 +
 src/backend/commands/subscriptioncmds.c       |  49 +-
 src/backend/postmaster/pgstat.c               |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |   4 +
 src/backend/replication/logical/proto.c       | 140 ++-
 src/backend/replication/logical/worker.c      | 946 +++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c   | 348 ++++++-
 src/include/catalog/pg_subscription.h         |   3 +
 src/include/pgstat.h                          |   6 +-
 src/include/replication/logicalproto.h        |  46 +-
 src/include/replication/walreceiver.h         |   1 +
 src/test/subscription/t/009_stream_simple.pl  |  86 ++
 src/test/subscription/t/010_stream_subxact.pl | 102 ++
 src/test/subscription/t/011_stream_ddl.pl     |  95 ++
 .../t/012_stream_subxact_abort.pl             |  82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |  84 ++
 src/test/subscription/t/015_stream_binary.pl  |  86 ++
 19 files changed, 2060 insertions(+), 47 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/015_stream_binary.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70cdf..a81bd54efc 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal> and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c54fe..b7d7457d00 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf0c6..311d46225a 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377a85..4c58ad8b07 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,8 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +197,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +350,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +375,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -439,6 +455,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -698,6 +721,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +732,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +789,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +834,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +877,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8116b23614..450346e9c7 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 8afa5a29b4..ad574099ff 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -425,6 +425,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097bf5..ff25924e68 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b576e342cb..deaed0f2a6 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,62 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +291,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,16 +752,322 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or
+	 * inside streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		!in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = nsubxacts; i > 0; i--)
+		{
+			if (subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxacts[subidx].fileno, subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -635,6 +1081,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1099,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1138,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1256,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1408,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1781,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1922,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2050,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2162,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1947,6 +2435,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1979,6 +2481,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do,  but if already have
+	 * subxact file then delete that.
+	 */
+	if (nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts));
+	BufFileWrite(fd, subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxacts);
+	Assert(nsubxacts == 0);
+	Assert(nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	nsubxacts_max = 1 << my_log2(nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (nsubxacts == nsubxacts_max)
+	{
+		nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts, nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd, &subxacts[nsubxacts].fileno,
+				&subxacts[nsubxacts].offset);
+
+	nsubxacts++;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream start/stop
+		 * calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxacts)
+		pfree(subxacts);
+
+	subxacts = NULL;
+	subxact_last = InvalidTransactionId;
+	nsubxacts = 0;
+	nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2151,6 +3086,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc4c1..3360bd5dd0 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,29 +47,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,11 +123,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -115,16 +149,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +226,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +255,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +279,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +300,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +331,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +394,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +444,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +486,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,6 +507,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -406,7 +539,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +559,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +584,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +605,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +631,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +663,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -605,6 +743,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -641,6 +886,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -771,11 +1048,44 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -811,7 +1121,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35000..1d091546bf 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1edf..0dfbac46b4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc85c..655144d03a 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,49 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbee54..6c0a4e30a8 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/015_stream_binary.pl b/src/test/subscription/t/015_stream_binary.pl
new file mode 100644
index 0000000000..fa2362e32b
--- /dev/null
+++ b/src/test/subscription/t/015_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.28.0.windows.1

#469Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#468)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Aug 13, 2020 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Aug 7, 2020 at 2:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

..

This patch's functionality can be independently verified by SQL APIs

Your changes look fine to me.

I have pushed that patch last week and attached are the remaining
patches. I have made a few changes in the next patch
0001-Extend-the-BufFile-interface.patch and have some comments on it
which are as below:

Few more comments on the latest patches:
v48-0002-Add-support-for-streaming-to-built-in-replicatio
1. It appears to me that we don't remove the temporary folders created
by the apply worker. So, we have folders like
pgsql_tmp15324.0.sharedfileset in base/pgsql_tmp directory even when
the apply worker exits. I think we can remove these by calling
PathNameDeleteTemporaryDir in SharedFileSetUnregister while removing
the fileset from registered filesetlist.

2.
+typedef struct SubXactInfo
+{
+ TransactionId xid; /* XID of the subxact */
+ int fileno; /* file number in the buffile */
+ off_t offset; /* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;

Will it be better if we move all the subxact related variables (like
nsubxacts, nsubxacts_max and subxact_last) inside SubXactInfo struct
as all the information anyway is related to sub-transactions?

3.
+ /*
+ * If there is no subtransaction then nothing to do,  but if already have
+ * subxact file then delete that.
+ */

extra space before 'but' in the above sentence is not required.

v48-0001-Extend-the-BufFile-interface
4.
- * SharedFileSets can also be used by backends when the temporary files need
- * to be opened/closed multiple times and the underlying files need to survive
+ * SharedFileSets can be used by backends when the temporary files need to be
+ * opened/closed multiple times and the underlying files need to survive
  * across transactions.
  *

No need of 'also' in the above sentence.

--
With Regards,
Amit Kapila.

#470Thomas Munro
thomas.munro@gmail.com
In reply to: Amit Kapila (#468)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Aug 13, 2020 at 6:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have pushed that patch last week and attached are the remaining
patches. I have made a few changes in the next patch
0001-Extend-the-BufFile-interface.patch and have some comments on it
which are as below:

Hi Amit,

I noticed that Konstantin Knizhnik's CF entry 2386 calls
table_scan_XXX() functions from an extension, namely
contrib/auto_explain, and started failing to build on Windows after
commit 7259736a. This seems to be due to the new global variables
CheckXidAlive and bsysscan, which probably need PGDLLIMPORT if they
are accessed from inline functions that are part of the API that we
expect extensions to be allowed to call.

#471Amit Kapila
amit.kapila16@gmail.com
In reply to: Thomas Munro (#470)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Aug 14, 2020 at 10:11 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Aug 13, 2020 at 6:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have pushed that patch last week and attached are the remaining
patches. I have made a few changes in the next patch
0001-Extend-the-BufFile-interface.patch and have some comments on it
which are as below:

Hi Amit,

I noticed that Konstantin Knizhnik's CF entry 2386 calls
table_scan_XXX() functions from an extension, namely
contrib/auto_explain, and started failing to build on Windows after
commit 7259736a. This seems to be due to the new global variables
CheckXidAlive and bsysscan, which probably need PGDLLIMPORT if they
are accessed from inline functions that are part of the API that we
expect extensions to be allowed to call.

Yeah, that makes sense. I will take care of that later today or
tomorrow. We have not noticed that because currently none of the
extensions is using those functions. BTW, I noticed that after
failure, the next run is green, why so? Is the next run not on
windows?

--
With Regards,
Amit Kapila.

#472Thomas Munro
thomas.munro@gmail.com
In reply to: Amit Kapila (#471)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Aug 14, 2020 at 6:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, that makes sense. I will take care of that later today or
tomorrow. We have not noticed that because currently none of the
extensions is using those functions. BTW, I noticed that after
failure, the next run is green, why so? Is the next run not on
windows?

The three cfbot results are for applying the patch, testing on Windows
and testing on Ubuntu in that order. It's not at all clear and I'll
probably find a better way to display it when I get around to adding
some more operating systems, maybe with some OS icons or something
like that...

#473Amit Kapila
amit.kapila16@gmail.com
In reply to: Thomas Munro (#472)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Aug 15, 2020 at 4:14 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Aug 14, 2020 at 6:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, that makes sense. I will take care of that later today or
tomorrow. We have not noticed that because currently none of the
extensions is using those functions. BTW, I noticed that after
failure, the next run is green, why so? Is the next run not on
windows?

The three cfbot results are for applying the patch, testing on Windows
and testing on Ubuntu in that order. It's not at all clear and I'll
probably find a better way to display it when I get around to adding
some more operating systems, maybe with some OS icons or something
like that...

Good to know, anyway, I have pushed a patch to mark those variables
with PGDLLIMPORT.

--
With Regards,
Amit Kapila.

#474Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#468)
5 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Aug 13, 2020 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Aug 7, 2020 at 2:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

..

This patch's functionality can be independently verified by SQL APIs

Your changes look fine to me.

I have pushed that patch last week and attached are the remaining
patches. I have made a few changes in the next patch
0001-Extend-the-BufFile-interface.patch and have some comments on it
which are as below:

1.
case SEEK_END:
- /* could be implemented, not needed currently */
+
+ /*
+ * Get the file size of the last file to get the last offset of
+ * that file.
+ */
+ newFile = file->numFiles - 1;
+ newOffset = FileSize(file->files[file->numFiles - 1]);
+ if (newOffset < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not determine size of temporary file \"%s\" from
BufFile \"%s\": %m",
+ FilePathName(file->files[file->numFiles - 1]),
+ file->name)));
+ break;
break;

There is no need for multiple breaks in the above code. I have fixed
this one in the attached patch.

Ok.

2.
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+ int newFile = file->numFiles;
+ off_t newOffset = file->curOffset;
+ char segment_name[MAXPGPATH];
+ int i;
+
+ /* Loop over all the files upto the fileno which we want to truncate. */
+ for (i = file->numFiles - 1; i >= fileno; i--)
+ {
+ /*
+ * Except the fileno, we can directly delete other files.  If the
+ * offset is 0 then we can delete the fileno file as well unless it is
+ * the first file.
+ */
+ if ((i != fileno || offset == 0) && fileno != 0)
+ {
+ SharedSegmentName(segment_name, file->name, i);
+ FileClose(file->files[i]);
+ if (!SharedFileSetDelete(file->fileset, segment_name, true))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not delete shared fileset \"%s\": %m",
+ segment_name)));
+ newFile--;
+ newOffset = MAX_PHYSICAL_FILESIZE;
+ }
+ else
+ {
+ if (FileTruncate(file->files[i], offset,
+ WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not truncate file \"%s\": %m",
+ FilePathName(file->files[i]))));
+
+ newOffset = offset;
+ }
+ }
+
+ file->numFiles = newFile;
+ file->curOffset = newOffset;
+}

In the end, you have only set 'numFiles' and 'curOffset' members of
BufFile and left others. I think other members like 'curFile' also
need to be set especially for the case where we have deleted segments
at the end,

Yes this must be set.

also, shouldn't we need to set 'pos' and 'nbytes' as we do

in BufFileSeek. If there is some reason that we don't to set these
other members then maybe it is better to add a comment to make it
clear.

IMHO, we can directly call the BufFileFlush, this will reset the pos
and nbytes and we can directly set the absolute location of the
curOffset. Next time BufFileRead/BufFileWrite reread the buffer so
everything will be fine.

Another thing we need to think here whether we need to flush the
buffer data for the dirty buffer? Consider a case where we truncate
the file up to a position that falls in the buffer. Now we might
truncate the file and part of buffer contents will become invalid,
next time if we flush such a buffer then the file can contain the
garbage or maybe this will be handled if we update the position in
buffer appropriately but all of this should be explained in comments.
If what I said is correct, then we still can skip buffer flush in some
cases as we do in BufFileSeek.

I think all the cases we can flush the buffer and reset the pos and nbytes.

Also, consider if we need to do other

handling (convert seek to "start of next seg" to "end of last seg") as
we do after changing the seek position in BufFileSeek.

We also do this when we truncate complete file, see this
+ if ((i != fileno || offset == 0) && fileno != 0)
+ {
+ SharedSegmentName(segment_name, file->name, i);
+ FileClose(file->files[i]);
+ if (!SharedFileSetDelete(file->fileset, segment_name, true))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not delete shared fileset \"%s\": %m",
+ segment_name)));
+ newFile--;
+ newOffset = MAX_PHYSICAL_FILESIZE;
+ }

3.
/*
* Initialize a space for temporary files that can be opened by other backends.
* Other backends must attach to it before accessing it. Associate this
* SharedFileSet with 'seg'. Any contained files will be deleted when the
* last backend detaches.
*
* We can also use this interface if the temporary files are used only by
* single backend but the files need to be opened and closed multiple times
* and also the underlying files need to survive across transactions. For
* such cases, dsm segment 'seg' should be passed as NULL. We remove such
* files on proc exit.
*
* Files will be distributed over the tablespaces configured in
* temp_tablespaces.
*
* Under the covers the set is one or more directories which will eventually
* be deleted when there are no backends attached.
*/
void
SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
{
..

I think we can remove the part of the above comment after 'eventually
be deleted' (see last sentence in comment) because now the files can
be removed in more than one way and we have explained that in the
comments before this last sentence of the comment. If you can rephrase
it differently to cover the other case as well, then that is fine too.

I think it makes sense to remove, so I have removed it.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v49-0001-Extend-the-BufFile-interface.patchapplication/octet-stream; name=v49-0001-Extend-the-BufFile-interface.patchDownload
From 287777a6fe46fb897eff7d77e11bd021ad549c56 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v49 1/5] Extend the BufFile interface.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times. Such files
need to be created as a member of a SharedFileSet.

Implement the interface for BufFileTruncate to allow files to be truncated
up to a particular offset. Extend BufFileSeek API to support SEEK_END case.
Add an option to provide a mode while opening the shared BufFiles instead
of always opening in read-only mode.

These enhancements in BufFile interface are required for the upcoming
patch to allow the replication apply worker, to properly handle streamed
in-progress transactions.

Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/postmaster/pgstat.c           |   3 +
 src/backend/storage/file/buffile.c        |  92 +++++++++++++++++++++++---
 src/backend/storage/file/fd.c             |   9 ++-
 src/backend/storage/file/sharedfileset.c  | 103 +++++++++++++++++++++++++++---
 src/backend/utils/sort/logtape.c          |   4 +-
 src/backend/utils/sort/sharedtuplestore.c |   2 +-
 src/include/pgstat.h                      |   1 +
 src/include/storage/buffile.h             |   4 +-
 src/include/storage/fd.h                  |   2 +-
 src/include/storage/sharedfileset.h       |   4 +-
 10 files changed, 196 insertions(+), 28 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 73ce944..8116b23 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 2d7a082..939f092 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile also supports temporary files that can be used by the single backend
+ * when the corresponding files need to be survived across the transaction and
+ * need to be opened and closed multiple times.  Such files need to be created
+ * as a member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -364,6 +368,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -666,11 +673,21 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * The file size of the last file gives us the end offset of that
+			 * file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +855,58 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offset.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			numFiles = file->numFiles;
+	int			curFile = file->curFile;
+	off_t		curOffset = file->curOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			numFiles--;
+			curFile--;
+			curOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+			curOffset = offset;
+		}
+	}
+
+	/* Otherwise, must reposition buffer, so flush any dirty data */
+	BufFileFlush(file);
+
+	file->numFiles = numFiles;
+	file->curFile = curFile;
+	file->curOffset = curOffset;
+	file->pos = 0;
+	file->nbytes = 0;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420e..f376a97 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594..b183805 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can be used by backends when the temporary files need to be
+ * opened/closed multiple times and the underlying files need to survive across
+ * transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,25 +29,35 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
  *
  * Under the covers the set is one or more directories which will eventually
- * be deleted when there are no backends attached.
+ * be deleted.
  */
 void
 SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -223,6 +255,61 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 }
 
 /*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+
+	/* Delete all files in the set */
+	SharedFileSetDeleteAll(input_fileset);
+}
+
+/*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
  */
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59..788815c 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a43..b83fb50 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201..807a9c1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752ba..fc34c49 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d..e209f04 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf07..d5edb60 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
1.8.3.1

v49-0005-Add-streaming-option-in-pg_dump.patchapplication/octet-stream; name=v49-0005-Add-streaming-option-in-pg_dump.patchDownload
From 3eb2bce77f421276f7d8e0d5b7a82f534d16ff0b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 11:09:18 +0530
Subject: [PATCH v49 5/5] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 17 +++++++++++++++--
 src/bin/pg_dump/pg_dump.h |  1 +
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 9c8436d..4c18ea4 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4240,10 +4241,17 @@ getSubscriptions(Archive *fout)
 
 	if (fout->remoteVersion >= 140000)
 		appendPQExpBuffer(query,
-						  " s.subbinary\n");
+						  " s.subbinary,\n");
 	else
 		appendPQExpBuffer(query,
-						  " false AS subbinary\n");
+						  " false AS subbinary,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBuffer(query,
+						  " s.substream\n");
+	else
+		appendPQExpBuffer(query,
+						  " false AS substream\n");
 
 	appendPQExpBuffer(query,
 					  "FROM pg_subscription s\n"
@@ -4263,6 +4271,7 @@ getSubscriptions(Archive *fout)
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
+	i_substream = PQfnumber(res, "substream");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4286,6 +4295,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subpublications));
 		subinfo[i].subbinary =
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
+		subinfo[i].substream =
+			pg_strdup(PQgetvalue(res, i, i_substream));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4357,6 +4368,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subbinary, "t") == 0)
 		appendPQExpBuffer(query, ", binary = true");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index da97b73..cc10c7c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -626,6 +626,7 @@ typedef struct _SubscriptionInfo
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subbinary;
+	char       *substream;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
-- 
1.8.3.1

v49-0002-Add-support-for-streaming-to-built-in-replicatio.patchapplication/octet-stream; name=v49-0002-Add-support-for-streaming-to-built-in-replicatio.patchDownload
From 0ae2c657b82d6d49a34e20dbdc5eda8ef1c6adbf Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v49 2/5] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  49 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   4 +
 src/backend/replication/logical/proto.c            | 140 ++-
 src/backend/replication/logical/worker.c           | 960 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 348 +++++++-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  46 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/009_stream_simple.pl       |  86 ++
 src/test/subscription/t/010_stream_subxact.pl      | 102 +++
 src/test/subscription/t/011_stream_ddl.pl          |  95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |  82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |  84 ++
 src/test/subscription/t/015_stream_binary.pl       |  86 ++
 src/tools/pgindent/typedefs.list                   |   3 +
 20 files changed, 2077 insertions(+), 47 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/015_stream_binary.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70..a81bd54 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal> and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c5..b7d7457 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf..311d462 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377..4c58ad8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,8 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +197,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +350,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +375,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -439,6 +455,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -698,6 +721,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +732,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +789,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +834,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +877,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8116b23..450346e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 8afa5a2..ad57409 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -425,6 +425,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097..ff25924 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b576e34..8d87556 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,68 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+/* Sub-transaction data of the current streaming transaction. */
+typedef struct ApplySubXactData
+{
+	uint32		nsubxacts;		/* number of sub-transactions */
+	uint32		nsubxacts_max;	/* current capacity of subxact_last */
+	TransactionId subxact_last; /* xid of the last sub-transaction */
+	SubXactInfo *subxacts;		/* sub-xact offset in file */
+} ApplySubXactData;
+
+static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +297,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,17 +758,324 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or inside
+	 * streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		 !in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = subxact_data.nsubxacts; i > 0; i--)
+		{
+			if (subxact_data.subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < subxact_data.nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxact_data.subxacts[subidx].fileno,
+							  subxact_data.subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		subxact_data.nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -635,6 +1088,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1106,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1145,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1263,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1415,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1788,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1929,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2057,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2169,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1947,6 +2442,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1979,6 +2488,446 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do, but if already have
+	 * subxact file then delete that.
+	 */
+	if (subxact_data.nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &subxact_data.nsubxacts, sizeof(subxact_data.nsubxacts));
+	BufFileWrite(fd, subxact_data.subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxact_data.subxacts);
+	Assert(subxact_data.nsubxacts == 0);
+	Assert(subxact_data.nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &subxact_data.nsubxacts,
+					sizeof(subxact_data.nsubxacts)) !=
+		sizeof(subxact_data.nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	subxact_data.nsubxacts_max = 1 << my_log2(subxact_data.nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxact_data.subxacts = palloc(subxact_data.nsubxacts_max *
+								   sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxact_data.subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	SubXactInfo *subxacts = subxact_data.subxacts;
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_data.subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_data.subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = subxact_data.nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (subxact_data.nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		subxact_data.nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (subxact_data.nsubxacts == subxact_data.nsubxacts_max)
+	{
+		subxact_data.nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts,
+							subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[subxact_data.nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd,
+				&subxacts[subxact_data.nsubxacts].fileno,
+				&subxacts[subxact_data.nsubxacts].offset);
+
+	subxact_data.nsubxacts++;
+	subxact_data.subxacts = subxacts;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxact_data.subxacts)
+		pfree(subxact_data.subxacts);
+
+	subxact_data.subxacts = NULL;
+	subxact_data.subxact_last = InvalidTransactionId;
+	subxact_data.nsubxacts = 0;
+	subxact_data.nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2151,6 +3100,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc..3360bd5 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,29 +47,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,11 +123,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -115,16 +149,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +226,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +255,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +279,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +300,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +331,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +394,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +444,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +486,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,6 +507,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -406,7 +539,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +559,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +584,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +605,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +631,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +663,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -606,6 +744,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -642,6 +887,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -771,12 +1048,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -811,7 +1121,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35..1d09154 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc..655144d 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,49 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbe..6c0a4e3 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/015_stream_binary.pl b/src/test/subscription/t/015_stream_binary.pl
new file mode 100644
index 0000000..fa2362e
--- /dev/null
+++ b/src/test/subscription/t/015_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b4948ac..543332b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -111,6 +111,7 @@ Append
 AppendPath
 AppendRelInfo
 AppendState
+ApplySubXactData
 Archive
 ArchiveEntryPtrType
 ArchiveFormat
@@ -2371,6 +2372,7 @@ StopList
 StopWorkersData
 StrategyNumber
 StreamCtl
+StreamXidHash
 StringInfo
 StringInfoData
 StripnullState
@@ -2381,6 +2383,7 @@ SubPlanState
 SubTransactionId
 SubXactCallback
 SubXactCallbackItem
+SubXactInfo
 SubXactEvent
 SubplanResultRelHashElem
 SubqueryScan
-- 
1.8.3.1

v49-0004-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v49-0004-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From dea12a6bf4749a9500386a349c4841ac82465e0a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v49 4/5] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v49-0003-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v49-0003-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From 6955a1e82b25127ca279d880bc51a7ab23bbaa49 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v49 3/5] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 0680f44..4c9b48e 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -82,7 +82,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 9f140b5..21410fa 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334e..cdf9b8e 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

#475Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#469)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Aug 13, 2020 at 6:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 13, 2020 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Aug 7, 2020 at 2:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

..

This patch's functionality can be independently verified by SQL APIs

Your changes look fine to me.

I have pushed that patch last week and attached are the remaining
patches. I have made a few changes in the next patch
0001-Extend-the-BufFile-interface.patch and have some comments on it
which are as below:

Few more comments on the latest patches:
v48-0002-Add-support-for-streaming-to-built-in-replicatio
1. It appears to me that we don't remove the temporary folders created
by the apply worker. So, we have folders like
pgsql_tmp15324.0.sharedfileset in base/pgsql_tmp directory even when
the apply worker exits. I think we can remove these by calling
PathNameDeleteTemporaryDir in SharedFileSetUnregister while removing
the fileset from registered filesetlist.

I think we need to call SharedFileSetDeleteAll(input_fileset), from
SharedFileSetUnregister, so that all the directories created for this
fileset are removed

2.
+typedef struct SubXactInfo
+{
+ TransactionId xid; /* XID of the subxact */
+ int fileno; /* file number in the buffile */
+ off_t offset; /* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;

Will it be better if we move all the subxact related variables (like
nsubxacts, nsubxacts_max and subxact_last) inside SubXactInfo struct
as all the information anyway is related to sub-transactions?

I have moved them all to a structure.

3.
+ /*
+ * If there is no subtransaction then nothing to do,  but if already have
+ * subxact file then delete that.
+ */

extra space before 'but' in the above sentence is not required.

Fixed

v48-0001-Extend-the-BufFile-interface
4.
- * SharedFileSets can also be used by backends when the temporary files need
- * to be opened/closed multiple times and the underlying files need to survive
+ * SharedFileSets can be used by backends when the temporary files need to be
+ * opened/closed multiple times and the underlying files need to survive
* across transactions.
*

No need of 'also' in the above sentence.

Fixed

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#476Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#475)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Aug 15, 2020 at 3:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Aug 13, 2020 at 6:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 13, 2020 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Aug 7, 2020 at 2:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

..

This patch's functionality can be independently verified by SQL APIs

Your changes look fine to me.

I have pushed that patch last week and attached are the remaining
patches. I have made a few changes in the next patch
0001-Extend-the-BufFile-interface.patch and have some comments on it
which are as below:

Few more comments on the latest patches:
v48-0002-Add-support-for-streaming-to-built-in-replicatio
1. It appears to me that we don't remove the temporary folders created
by the apply worker. So, we have folders like
pgsql_tmp15324.0.sharedfileset in base/pgsql_tmp directory even when
the apply worker exits. I think we can remove these by calling
PathNameDeleteTemporaryDir in SharedFileSetUnregister while removing
the fileset from registered filesetlist.

I think we need to call SharedFileSetDeleteAll(input_fileset), from
SharedFileSetUnregister, so that all the directories created for this
fileset are removed

2.
+typedef struct SubXactInfo
+{
+ TransactionId xid; /* XID of the subxact */
+ int fileno; /* file number in the buffile */
+ off_t offset; /* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;

Will it be better if we move all the subxact related variables (like
nsubxacts, nsubxacts_max and subxact_last) inside SubXactInfo struct
as all the information anyway is related to sub-transactions?

I have moved them all to a structure.

3.
+ /*
+ * If there is no subtransaction then nothing to do,  but if already have
+ * subxact file then delete that.
+ */

extra space before 'but' in the above sentence is not required.

Fixed

v48-0001-Extend-the-BufFile-interface
4.
- * SharedFileSets can also be used by backends when the temporary files need
- * to be opened/closed multiple times and the underlying files need to survive
+ * SharedFileSets can be used by backends when the temporary files need to be
+ * opened/closed multiple times and the underlying files need to survive
* across transactions.
*

No need of 'also' in the above sentence.

Fixed

In last patch v49-0001, there is one issue, Basically, I have called
BufFileFlush in all the cases. But, ideally, we can not call this if
the underlying files are deleted/truncated because those files/blocks
might not exist now. So I think if the truncate position is within
the same buffer we just need to adjust the buffer, otherwise we just
need to set the currFile and currOffset to the absolute number and set
the pos and nbytes 0. Attached patch fixes this issue.

+ errmsg("could not truncate file \"%s\": %m",
+ FilePathName(file->files[i]))));
+ curOffset = offset;
+ }
+ }
+
+ /* Otherwise, must reposition buffer, so flush any dirty data */
+ BufFileFlush(file);
+

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v50.tarapplication/x-tar; name=v50.tarDownload
v50/0000775000175000017500000000000013716460332012563 5ustar  dilipkumardilipkumarv50/v50-0001-Extend-the-BufFile-interface.patch0000664000175000017500000004200513716476777022074 0ustar  dilipkumardilipkumarFrom c56299ec739be4797d378bb24505d1c55bafe027 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v50 1/5] Extend the BufFile interface.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times. Such files
need to be created as a member of a SharedFileSet.

Implement the interface for BufFileTruncate to allow files to be truncated
up to a particular offset. Extend BufFileSeek API to support SEEK_END case.
Add an option to provide a mode while opening the shared BufFiles instead
of always opening in read-only mode.

These enhancements in BufFile interface are required for the upcoming
patch to allow the replication apply worker, to properly handle streamed
in-progress transactions.

Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/postmaster/pgstat.c           |   3 +
 src/backend/storage/file/buffile.c        | 103 +++++++++++++++++++++++++++---
 src/backend/storage/file/fd.c             |   9 ++-
 src/backend/storage/file/sharedfileset.c  | 103 +++++++++++++++++++++++++++---
 src/backend/utils/sort/logtape.c          |   4 +-
 src/backend/utils/sort/sharedtuplestore.c |   2 +-
 src/include/pgstat.h                      |   1 +
 src/include/storage/buffile.h             |   4 +-
 src/include/storage/fd.h                  |   2 +-
 src/include/storage/sharedfileset.h       |   4 +-
 10 files changed, 207 insertions(+), 28 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 73ce944..8116b23 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 2d7a082..7e03d8a 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile also supports temporary files that can be used by the single backend
+ * when the corresponding files need to be survived across the transaction and
+ * need to be opened and closed multiple times.  Such files need to be created
+ * as a member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -364,6 +368,9 @@ BufFileDeleteShared(SharedFileSet *fileset, const char *name)
 
 	if (!found)
 		elog(ERROR, "could not delete unknown shared BufFile \"%s\"", name);
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -666,11 +673,21 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * The file size of the last file gives us the end offset of that
+			 * file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +855,69 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offset.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			numFiles = file->numFiles;
+	int			curFile = file->curFile;
+	off_t		curOffset = file->curOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && fileno != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			numFiles--;
+			curFile--;
+			curOffset = MAX_PHYSICAL_FILESIZE;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+			curOffset = offset;
+		}
+	}
+
+	/*
+	 * If the truncate point is within existing buffer then we can just
+	 * adjust pos-within-buffer, without flushing buffer.  Otherwise,
+	 * we don't need to do anything because we have already deleted/truncated
+	 * the underlying files.
+	 */
+	if (curFile == file->curFile &&
+		curOffset >= file->curOffset &&
+		curOffset <= file->curOffset + file->nbytes)
+	{
+		file->pos = (int) (curOffset - file->curOffset);
+		return;
+	}
+
+	file->numFiles = numFiles;
+	file->curFile = curFile;
+	file->curOffset = curOffset;
+	file->pos = 0;
+	file->nbytes = 0;
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420e..f376a97 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594..b183805 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can be used by backends when the temporary files need to be
+ * opened/closed multiple times and the underlying files need to survive across
+ * transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,25 +29,35 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
  *
  * Under the covers the set is one or more directories which will eventually
- * be deleted when there are no backends attached.
+ * be deleted.
  */
 void
 SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -223,6 +255,61 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 }
 
 /*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+
+	/* Delete all files in the set */
+	SharedFileSetDeleteAll(input_fileset);
+}
+
+/*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
  */
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59..788815c 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a43..b83fb50 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201..807a9c1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752ba..fc34c49 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d..e209f04 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf07..d5edb60 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
1.8.3.1

v50/v50-0002-Add-support-for-streaming-to-built-in-replicatio.patch0000664000175000017500000027671613716476777026111 0ustar  dilipkumardilipkumarFrom 8ccbca96517bc59401656c0453005fb12d6f501d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v50 2/5] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  49 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   4 +
 src/backend/replication/logical/proto.c            | 140 ++-
 src/backend/replication/logical/worker.c           | 960 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 348 +++++++-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  46 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/009_stream_simple.pl       |  86 ++
 src/test/subscription/t/010_stream_subxact.pl      | 102 +++
 src/test/subscription/t/011_stream_ddl.pl          |  95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |  82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |  84 ++
 src/test/subscription/t/015_stream_binary.pl       |  86 ++
 src/tools/pgindent/typedefs.list                   |   3 +
 20 files changed, 2077 insertions(+), 47 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/015_stream_binary.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70..a81bd54 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal> and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c5..b7d7457 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf..311d462 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377..4c58ad8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,8 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +197,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +350,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +375,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -439,6 +455,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -698,6 +721,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +732,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +789,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +834,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +877,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8116b23..450346e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 8afa5a2..ad57409 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -425,6 +425,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097..ff25924 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b576e34..8d87556 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,68 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+/* Sub-transaction data of the current streaming transaction. */
+typedef struct ApplySubXactData
+{
+	uint32		nsubxacts;		/* number of sub-transactions */
+	uint32		nsubxacts_max;	/* current capacity of subxact_last */
+	TransactionId subxact_last; /* xid of the last sub-transaction */
+	SubXactInfo *subxacts;		/* sub-xact offset in file */
+} ApplySubXactData;
+
+static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +297,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,17 +758,324 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or inside
+	 * streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		 !in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = subxact_data.nsubxacts; i > 0; i--)
+		{
+			if (subxact_data.subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < subxact_data.nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxact_data.subxacts[subidx].fileno,
+							  subxact_data.subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		subxact_data.nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -635,6 +1088,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1106,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1145,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1263,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1415,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1788,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1929,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2057,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2169,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1947,6 +2442,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1979,6 +2488,446 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do, but if already have
+	 * subxact file then delete that.
+	 */
+	if (subxact_data.nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &subxact_data.nsubxacts, sizeof(subxact_data.nsubxacts));
+	BufFileWrite(fd, subxact_data.subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxact_data.subxacts);
+	Assert(subxact_data.nsubxacts == 0);
+	Assert(subxact_data.nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &subxact_data.nsubxacts,
+					sizeof(subxact_data.nsubxacts)) !=
+		sizeof(subxact_data.nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	subxact_data.nsubxacts_max = 1 << my_log2(subxact_data.nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxact_data.subxacts = palloc(subxact_data.nsubxacts_max *
+								   sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxact_data.subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	SubXactInfo *subxacts = subxact_data.subxacts;
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_data.subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_data.subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = subxact_data.nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (subxact_data.nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		subxact_data.nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (subxact_data.nsubxacts == subxact_data.nsubxacts_max)
+	{
+		subxact_data.nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts,
+							subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[subxact_data.nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd,
+				&subxacts[subxact_data.nsubxacts].fileno,
+				&subxacts[subxact_data.nsubxacts].offset);
+
+	subxact_data.nsubxacts++;
+	subxact_data.subxacts = subxacts;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	BufFileDeleteShared(ent->stream_fileset, path);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		BufFileDeleteShared(ent->subxact_fileset, path);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxact_data.subxacts)
+		pfree(subxact_data.subxacts);
+
+	subxact_data.subxacts = NULL;
+	subxact_data.subxact_last = InvalidTransactionId;
+	subxact_data.nsubxacts = 0;
+	subxact_data.nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2151,6 +3100,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc..3360bd5 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,29 +47,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,11 +123,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -115,16 +149,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +226,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +255,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +279,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +300,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +331,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +394,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +444,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +486,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,6 +507,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -406,7 +539,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +559,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +584,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +605,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +631,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +663,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -606,6 +744,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -642,6 +887,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -771,12 +1048,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -811,7 +1121,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35..1d09154 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc..655144d 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,49 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbe..6c0a4e3 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/015_stream_binary.pl b/src/test/subscription/t/015_stream_binary.pl
new file mode 100644
index 0000000..fa2362e
--- /dev/null
+++ b/src/test/subscription/t/015_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b4948ac..543332b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -111,6 +111,7 @@ Append
 AppendPath
 AppendRelInfo
 AppendState
+ApplySubXactData
 Archive
 ArchiveEntryPtrType
 ArchiveFormat
@@ -2371,6 +2372,7 @@ StopList
 StopWorkersData
 StrategyNumber
 StreamCtl
+StreamXidHash
 StringInfo
 StringInfoData
 StripnullState
@@ -2381,6 +2383,7 @@ SubPlanState
 SubTransactionId
 SubXactCallback
 SubXactCallbackItem
+SubXactInfo
 SubXactEvent
 SubplanResultRelHashElem
 SubqueryScan
-- 
1.8.3.1

v50/v50-0003-Enable-streaming-for-all-subscription-TAP-tests.patch0000664000175000017500000003534213716476777025642 0ustar  dilipkumardilipkumarFrom f52b4b5d94fe83a7f8e54593e80492ee96ced77c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v50 3/5] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 0680f44..4c9b48e 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -82,7 +82,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 9f140b5..21410fa 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334e..cdf9b8e 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v50/v50-0004-Add-TAP-test-for-streaming-vs.-DDL.patch0000664000175000017500000001062113716476777022570 0ustar  dilipkumardilipkumarFrom 094c085d381b4e1d793814a97e3ada175f5ff402 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v50 4/5] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v50/v50-0005-Add-streaming-option-in-pg_dump.patch0000664000175000017500000000524713716476777022654 0ustar  dilipkumardilipkumarFrom 674f1f336d764984d4f92537682292224bec2c43 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 11:09:18 +0530
Subject: [PATCH v50 5/5] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 17 +++++++++++++++--
 src/bin/pg_dump/pg_dump.h |  1 +
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 9c8436d..4c18ea4 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4201,6 +4201,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4240,10 +4241,17 @@ getSubscriptions(Archive *fout)
 
 	if (fout->remoteVersion >= 140000)
 		appendPQExpBuffer(query,
-						  " s.subbinary\n");
+						  " s.subbinary,\n");
 	else
 		appendPQExpBuffer(query,
-						  " false AS subbinary\n");
+						  " false AS subbinary,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBuffer(query,
+						  " s.substream\n");
+	else
+		appendPQExpBuffer(query,
+						  " false AS substream\n");
 
 	appendPQExpBuffer(query,
 					  "FROM pg_subscription s\n"
@@ -4263,6 +4271,7 @@ getSubscriptions(Archive *fout)
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
+	i_substream = PQfnumber(res, "substream");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4286,6 +4295,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subpublications));
 		subinfo[i].subbinary =
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
+		subinfo[i].substream =
+			pg_strdup(PQgetvalue(res, i, i_substream));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4357,6 +4368,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subbinary, "t") == 0)
 		appendPQExpBuffer(query, ", binary = true");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index da97b73..cc10c7c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -626,6 +626,7 @@ typedef struct _SubscriptionInfo
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subbinary;
+	char       *substream;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
-- 
1.8.3.1

#477Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#476)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

In last patch v49-0001, there is one issue, Basically, I have called
BufFileFlush in all the cases. But, ideally, we can not call this if
the underlying files are deleted/truncated because those files/blocks
might not exist now. So I think if the truncate position is within
the same buffer we just need to adjust the buffer, otherwise we just
need to set the currFile and currOffset to the absolute number and set
the pos and nbytes 0. Attached patch fixes this issue.

Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
1.
+
+ /*
+ * If the truncate point is within existing buffer then we can just
+ * adjust pos-within-buffer, without flushing buffer.  Otherwise,
+ * we don't need to do anything because we have already deleted/truncated
+ * the underlying files.
+ */
+ if (curFile == file->curFile &&
+ curOffset >= file->curOffset &&
+ curOffset <= file->curOffset + file->nbytes)
+ {
+ file->pos = (int) (curOffset - file->curOffset);
+ return;
+ }

I think in this case you have set the position correctly but what
about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
because the contents of the buffer are still valid but I don't think
the same is true here.

2.
+ int curFile = file->curFile;
+ off_t curOffset = file->curOffset;

I find the previous naming (newFile, newOffset) was better as it
distinguishes them from BufFile variables.

3.
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
..
+ /* Delete all files in the set */
+ SharedFileSetDeleteAll(input_fileset);
..
}

I am not sure if this is completely correct because we call this
function (SharedFileSetUnregister) from BufFileDeleteShared which
would have already removed all the required files. This raises the
question in my mind whether it is correct to call
SharedFileSetUnregister from BufFileDeleteShared from the API
perspective as one might not want to remove the entire fileset at that
point of time. It will work for your use case (where while removing
buffile you also want to remove the entire fileset) but not sure if it
is generic enough. For your case, I wonder if we can directly call
SharedFileSetDeleteAll and we can have a call like
SharedFileSetUnregister which will be called from it.

--
With Regards,
Amit Kapila.

#478Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#477)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

In last patch v49-0001, there is one issue, Basically, I have called
BufFileFlush in all the cases. But, ideally, we can not call this if
the underlying files are deleted/truncated because those files/blocks
might not exist now. So I think if the truncate position is within
the same buffer we just need to adjust the buffer, otherwise we just
need to set the currFile and currOffset to the absolute number and set
the pos and nbytes 0. Attached patch fixes this issue.

Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
1.
+
+ /*
+ * If the truncate point is within existing buffer then we can just
+ * adjust pos-within-buffer, without flushing buffer.  Otherwise,
+ * we don't need to do anything because we have already deleted/truncated
+ * the underlying files.
+ */
+ if (curFile == file->curFile &&
+ curOffset >= file->curOffset &&
+ curOffset <= file->curOffset + file->nbytes)
+ {
+ file->pos = (int) (curOffset - file->curOffset);
+ return;
+ }

I think in this case you have set the position correctly but what
about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
because the contents of the buffer are still valid but I don't think
the same is true here.

I think you need to set 'nbytes' to curOffset as per your current
patch as that is the new size of the file.
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno,
off_t offset)
                curOffset <= file->curOffset + file->nbytes)
        {
                file->pos = (int) (curOffset - file->curOffset);
+               file->nbytes = (int) curOffset;
                return;
        }

Also, what about file 'numFiles', that can also change due to the
removal of certain files, shouldn't that be also set in this case?

--
With Regards,
Amit Kapila.

#479Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#477)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Aug 19, 2020 at 10:11 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

In last patch v49-0001, there is one issue, Basically, I have called
BufFileFlush in all the cases. But, ideally, we can not call this if
the underlying files are deleted/truncated because those files/blocks
might not exist now. So I think if the truncate position is within
the same buffer we just need to adjust the buffer, otherwise we just
need to set the currFile and currOffset to the absolute number and set
the pos and nbytes 0. Attached patch fixes this issue.

Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
1.
+
+ /*
+ * If the truncate point is within existing buffer then we can just
+ * adjust pos-within-buffer, without flushing buffer.  Otherwise,
+ * we don't need to do anything because we have already deleted/truncated
+ * the underlying files.
+ */
+ if (curFile == file->curFile &&
+ curOffset >= file->curOffset &&
+ curOffset <= file->curOffset + file->nbytes)
+ {
+ file->pos = (int) (curOffset - file->curOffset);
+ return;
+ }

I think in this case you have set the position correctly but what
about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
because the contents of the buffer are still valid but I don't think
the same is true here.

Right, I think we need to set nbytes to new file->pos as shown below

+ file->pos = (int) (curOffset - file->curOffset);
file->nbytes = file->pos

2.
+ int curFile = file->curFile;
+ off_t curOffset = file->curOffset;

I find the previous naming (newFile, newOffset) was better as it
distinguishes them from BufFile variables.

Ok

3.
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
..
+ /* Delete all files in the set */
+ SharedFileSetDeleteAll(input_fileset);
..
}

I am not sure if this is completely correct because we call this
function (SharedFileSetUnregister) from BufFileDeleteShared which
would have already removed all the required files. This raises the
question in my mind whether it is correct to call
SharedFileSetUnregister from BufFileDeleteShared from the API
perspective as one might not want to remove the entire fileset at that
point of time. It will work for your use case (where while removing
buffile you also want to remove the entire fileset) but not sure if it
is generic enough. For your case, I wonder if we can directly call
SharedFileSetDeleteAll and we can have a call like
SharedFileSetUnregister which will be called from it.

Yeah this make more sense to me that we can directly call
SharedFileSetDeleteAll, instead of calling BufFileDeleteShared and we
can call SharedFileSetUnregister from SharedFileSetDeleteAll.

I will make these changes and send the patch after some testing.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#480Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#478)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Aug 19, 2020 at 12:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

In last patch v49-0001, there is one issue, Basically, I have called
BufFileFlush in all the cases. But, ideally, we can not call this if
the underlying files are deleted/truncated because those files/blocks
might not exist now. So I think if the truncate position is within
the same buffer we just need to adjust the buffer, otherwise we just
need to set the currFile and currOffset to the absolute number and set
the pos and nbytes 0. Attached patch fixes this issue.

Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
1.
+
+ /*
+ * If the truncate point is within existing buffer then we can just
+ * adjust pos-within-buffer, without flushing buffer.  Otherwise,
+ * we don't need to do anything because we have already deleted/truncated
+ * the underlying files.
+ */
+ if (curFile == file->curFile &&
+ curOffset >= file->curOffset &&
+ curOffset <= file->curOffset + file->nbytes)
+ {
+ file->pos = (int) (curOffset - file->curOffset);
+ return;
+ }

I think in this case you have set the position correctly but what
about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
because the contents of the buffer are still valid but I don't think
the same is true here.

I think you need to set 'nbytes' to curOffset as per your current
patch as that is the new size of the file.
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno,
off_t offset)
curOffset <= file->curOffset + file->nbytes)
{
file->pos = (int) (curOffset - file->curOffset);
+               file->nbytes = (int) curOffset;
return;
}

Also, what about file 'numFiles', that can also change due to the
removal of certain files, shouldn't that be also set in this case

Right, we need to set the numFile. I will fix this as well.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#481Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#480)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Aug 19, 2020 at 1:35 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Aug 19, 2020 at 12:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

In last patch v49-0001, there is one issue, Basically, I have called
BufFileFlush in all the cases. But, ideally, we can not call this if
the underlying files are deleted/truncated because those files/blocks
might not exist now. So I think if the truncate position is within
the same buffer we just need to adjust the buffer, otherwise we just
need to set the currFile and currOffset to the absolute number and set
the pos and nbytes 0. Attached patch fixes this issue.

Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
1.
+
+ /*
+ * If the truncate point is within existing buffer then we can just
+ * adjust pos-within-buffer, without flushing buffer.  Otherwise,
+ * we don't need to do anything because we have already deleted/truncated
+ * the underlying files.
+ */
+ if (curFile == file->curFile &&
+ curOffset >= file->curOffset &&
+ curOffset <= file->curOffset + file->nbytes)
+ {
+ file->pos = (int) (curOffset - file->curOffset);
+ return;
+ }

I think in this case you have set the position correctly but what
about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
because the contents of the buffer are still valid but I don't think
the same is true here.

I think you need to set 'nbytes' to curOffset as per your current
patch as that is the new size of the file.
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno,
off_t offset)
curOffset <= file->curOffset + file->nbytes)
{
file->pos = (int) (curOffset - file->curOffset);
+               file->nbytes = (int) curOffset;
return;
}

Also, what about file 'numFiles', that can also change due to the
removal of certain files, shouldn't that be also set in this case

Right, we need to set the numFile. I will fix this as well.

I think there are a couple of more problems in the truncate APIs,
basically, if the curFile and curOffset are already smaller than the
truncate location the truncate should not change that. So the
truncate should only change the curFile and curOffset if it is
truncating the part of the file where the curFile or curOffset is
pointing. I will work on those along with your other comments and
submit the updated patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#482Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#481)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Aug 20, 2020 at 1:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Aug 19, 2020 at 1:35 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Aug 19, 2020 at 12:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

In last patch v49-0001, there is one issue, Basically, I have called
BufFileFlush in all the cases. But, ideally, we can not call this if
the underlying files are deleted/truncated because those files/blocks
might not exist now. So I think if the truncate position is within
the same buffer we just need to adjust the buffer, otherwise we just
need to set the currFile and currOffset to the absolute number and set
the pos and nbytes 0. Attached patch fixes this issue.

Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
1.
+
+ /*
+ * If the truncate point is within existing buffer then we can just
+ * adjust pos-within-buffer, without flushing buffer.  Otherwise,
+ * we don't need to do anything because we have already deleted/truncated
+ * the underlying files.
+ */
+ if (curFile == file->curFile &&
+ curOffset >= file->curOffset &&
+ curOffset <= file->curOffset + file->nbytes)
+ {
+ file->pos = (int) (curOffset - file->curOffset);
+ return;
+ }

I think in this case you have set the position correctly but what
about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
because the contents of the buffer are still valid but I don't think
the same is true here.

I think you need to set 'nbytes' to curOffset as per your current
patch as that is the new size of the file.
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno,
off_t offset)
curOffset <= file->curOffset + file->nbytes)
{
file->pos = (int) (curOffset - file->curOffset);
+               file->nbytes = (int) curOffset;
return;
}

Also, what about file 'numFiles', that can also change due to the
removal of certain files, shouldn't that be also set in this case

Right, we need to set the numFile. I will fix this as well.

I think there are a couple of more problems in the truncate APIs,
basically, if the curFile and curOffset are already smaller than the
truncate location the truncate should not change that. So the
truncate should only change the curFile and curOffset if it is
truncating the part of the file where the curFile or curOffset is
pointing.

Right, I think this can happen if one has changed those by BufFileSeek
before doing truncate. We should fix that case as well.

I will work on those along with your other comments and
submit the updated patch.

Thanks.

--
With Regards,
Amit Kapila.

#483Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#482)
2 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 20, 2020 at 1:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Aug 19, 2020 at 1:35 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Aug 19, 2020 at 12:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

In last patch v49-0001, there is one issue, Basically, I have called
BufFileFlush in all the cases. But, ideally, we can not call this if
the underlying files are deleted/truncated because those files/blocks
might not exist now. So I think if the truncate position is within
the same buffer we just need to adjust the buffer, otherwise we just
need to set the currFile and currOffset to the absolute number and set
the pos and nbytes 0. Attached patch fixes this issue.

Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
1.
+
+ /*
+ * If the truncate point is within existing buffer then we can just
+ * adjust pos-within-buffer, without flushing buffer.  Otherwise,
+ * we don't need to do anything because we have already deleted/truncated
+ * the underlying files.
+ */
+ if (curFile == file->curFile &&
+ curOffset >= file->curOffset &&
+ curOffset <= file->curOffset + file->nbytes)
+ {
+ file->pos = (int) (curOffset - file->curOffset);
+ return;
+ }

I think in this case you have set the position correctly but what
about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
because the contents of the buffer are still valid but I don't think
the same is true here.

I think you need to set 'nbytes' to curOffset as per your current
patch as that is the new size of the file.
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno,
off_t offset)
curOffset <= file->curOffset + file->nbytes)
{
file->pos = (int) (curOffset - file->curOffset);
+               file->nbytes = (int) curOffset;
return;
}

Also, what about file 'numFiles', that can also change due to the
removal of certain files, shouldn't that be also set in this case

Right, we need to set the numFile. I will fix this as well.

I think there are a couple of more problems in the truncate APIs,
basically, if the curFile and curOffset are already smaller than the
truncate location the truncate should not change that. So the
truncate should only change the curFile and curOffset if it is
truncating the part of the file where the curFile or curOffset is
pointing.

Right, I think this can happen if one has changed those by BufFileSeek
before doing truncate. We should fix that case as well.

Right.

I will work on those along with your other comments and
submit the updated patch.

I have fixed this in the attached patch along with your other
comments. I have also attached a contrib module that is just used for
testing the truncate API.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v51.tarapplication/x-tar; name=v51.tarDownload
._v51000775 000765 000024 00000000470 13717454553 013161 0ustar00dilipkumarstaff000000 000000 Mac OS X        	28ATTR8���Hcom.apple.macl�<com.apple.quarantine6:A�gIF�^1�z��q/0081;5f3e53e3;Chrome;5A2CF396-2579-4B9E-B160-BC2862689FD6PaxHeader/v51000775 000765 000024 00000000036 13717454553 014713 xustar00dilipkumarstaff000000 000000 30 mtime=1597921643.299467273
v51/000775 000765 000024 00000000000 13717454553 013015 5ustar00dilipkumarstaff000000 000000 v51/._v51-0004-Add-TAP-test-for-streaming-vs.-DDL.patch000664 000765 000024 00000000324 13717162307 023204 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��<�<com.apple.quarantineq/0081;5f3d26b3;Chrome;B6261ABB-220B-45A0-B1AC-3C533E494A92v51/v51-0004-Add-TAP-test-for-streaming-vs.-DDL.patch000664 000765 000024 00000010621 13717162307 022770 0ustar00dilipkumarstaff000000 000000 From 9af9ff8c7b1efe4d5540338e76b5478f3d0883b5 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v51 4/5] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v51/._v51-0002-Add-support-for-streaming-to-built-in-replicatio.patch000664 000765 000024 00000000324 13717162307 026502 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��<�<com.apple.quarantineq/0081;5f3d26b3;Chrome;B6261ABB-220B-45A0-B1AC-3C533E494A92v51/v51-0002-Add-support-for-streaming-to-built-in-replicatio.patch000664 000765 000024 00000276710 13717162307 026303 0ustar00dilipkumarstaff000000 000000 From 62bf90800bb3cd598c0ec1ede67b7b54e53e2364 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v51 2/5] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  49 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   4 +
 src/backend/replication/logical/proto.c            | 140 ++-
 src/backend/replication/logical/worker.c           | 960 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 348 +++++++-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  46 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/009_stream_simple.pl       |  86 ++
 src/test/subscription/t/010_stream_subxact.pl      | 102 +++
 src/test/subscription/t/011_stream_ddl.pl          |  95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |  82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |  84 ++
 src/test/subscription/t/015_stream_binary.pl       |  86 ++
 src/tools/pgindent/typedefs.list                   |   3 +
 20 files changed, 2077 insertions(+), 47 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/015_stream_binary.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70..a81bd54 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal> and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c5..b7d7457 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf..311d462 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377..4c58ad8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,8 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +197,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +350,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +375,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -439,6 +455,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -698,6 +721,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +732,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +789,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +834,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +877,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8116b23..450346e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 8afa5a2..ad57409 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -425,6 +425,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097..ff25924 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b576e34..1347031 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,68 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+/* Sub-transaction data of the current streaming transaction. */
+typedef struct ApplySubXactData
+{
+	uint32		nsubxacts;		/* number of sub-transactions */
+	uint32		nsubxacts_max;	/* current capacity of subxact_last */
+	TransactionId subxact_last; /* xid of the last sub-transaction */
+	SubXactInfo *subxacts;		/* sub-xact offset in file */
+} ApplySubXactData;
+
+static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +297,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,17 +758,324 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or inside
+	 * streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		 !in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = subxact_data.nsubxacts; i > 0; i--)
+		{
+			if (subxact_data.subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < subxact_data.nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxact_data.subxacts[subidx].fileno,
+							  subxact_data.subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		subxact_data.nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -635,6 +1088,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1106,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1145,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1263,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1415,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1788,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1929,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2057,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2169,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1947,6 +2442,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1979,6 +2488,446 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do, but if already have
+	 * subxact file then delete that.
+	 */
+	if (subxact_data.nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &subxact_data.nsubxacts, sizeof(subxact_data.nsubxacts));
+	BufFileWrite(fd, subxact_data.subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxact_data.subxacts);
+	Assert(subxact_data.nsubxacts == 0);
+	Assert(subxact_data.nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &subxact_data.nsubxacts,
+					sizeof(subxact_data.nsubxacts)) !=
+		sizeof(subxact_data.nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	subxact_data.nsubxacts_max = 1 << my_log2(subxact_data.nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxact_data.subxacts = palloc(subxact_data.nsubxacts_max *
+								   sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxact_data.subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	SubXactInfo *subxacts = subxact_data.subxacts;
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_data.subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_data.subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = subxact_data.nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (subxact_data.nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		subxact_data.nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (subxact_data.nsubxacts == subxact_data.nsubxacts_max)
+	{
+		subxact_data.nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts,
+							subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[subxact_data.nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd,
+				&subxacts[subxact_data.nsubxacts].fileno,
+				&subxacts[subxact_data.nsubxacts].offset);
+
+	subxact_data.nsubxacts++;
+	subxact_data.subxacts = subxacts;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	SharedFileSetDeleteAll(ent->stream_fileset);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		SharedFileSetDeleteAll(ent->subxact_fileset);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxact_data.subxacts)
+		pfree(subxact_data.subxacts);
+
+	subxact_data.subxacts = NULL;
+	subxact_data.subxact_last = InvalidTransactionId;
+	subxact_data.nsubxacts = 0;
+	subxact_data.nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2151,6 +3100,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc..3360bd5 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,29 +47,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,11 +123,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -115,16 +149,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +226,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +255,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +279,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +300,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +331,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +394,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +444,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +486,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,6 +507,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -406,7 +539,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +559,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +584,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +605,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +631,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +663,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -606,6 +744,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -642,6 +887,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -771,12 +1048,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -811,7 +1121,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35..1d09154 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc..655144d 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,49 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbe..6c0a4e3 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/015_stream_binary.pl b/src/test/subscription/t/015_stream_binary.pl
new file mode 100644
index 0000000..fa2362e
--- /dev/null
+++ b/src/test/subscription/t/015_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3d99046..500623e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -111,6 +111,7 @@ Append
 AppendPath
 AppendRelInfo
 AppendState
+ApplySubXactData
 Archive
 ArchiveEntryPtrType
 ArchiveFormat
@@ -2370,6 +2371,7 @@ StopList
 StopWorkersData
 StrategyNumber
 StreamCtl
+StreamXidHash
 StringInfo
 StringInfoData
 StripnullState
@@ -2380,6 +2382,7 @@ SubPlanState
 SubTransactionId
 SubXactCallback
 SubXactCallbackItem
+SubXactInfo
 SubXactEvent
 SubplanResultRelHashElem
 SubqueryScan
-- 
1.8.3.1

v51/._v51-0005-Add-streaming-option-in-pg_dump.patch000664 000765 000024 00000000324 13717162307 023260 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��<�<com.apple.quarantineq/0081;5f3d26b3;Chrome;B6261ABB-220B-45A0-B1AC-3C533E494A92v51/v51-0005-Add-streaming-option-in-pg_dump.patch000664 000765 000024 00000005247 13717162307 023054 0ustar00dilipkumarstaff000000 000000 From 3c13f5eb6104568ef897a5b709629222310a704d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 11:09:18 +0530
Subject: [PATCH v51 5/5] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 17 +++++++++++++++--
 src/bin/pg_dump/pg_dump.h |  1 +
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 2cb3f9b..ca9d1fb 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4202,6 +4202,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4241,10 +4242,17 @@ getSubscriptions(Archive *fout)
 
 	if (fout->remoteVersion >= 140000)
 		appendPQExpBuffer(query,
-						  " s.subbinary\n");
+						  " s.subbinary,\n");
 	else
 		appendPQExpBuffer(query,
-						  " false AS subbinary\n");
+						  " false AS subbinary,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBuffer(query,
+						  " s.substream\n");
+	else
+		appendPQExpBuffer(query,
+						  " false AS substream\n");
 
 	appendPQExpBuffer(query,
 					  "FROM pg_subscription s\n"
@@ -4264,6 +4272,7 @@ getSubscriptions(Archive *fout)
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
+	i_substream = PQfnumber(res, "substream");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4287,6 +4296,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subpublications));
 		subinfo[i].subbinary =
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
+		subinfo[i].substream =
+			pg_strdup(PQgetvalue(res, i, i_substream));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4358,6 +4369,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subbinary, "t") == 0)
 		appendPQExpBuffer(query, ", binary = true");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index da97b73..cc10c7c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -626,6 +626,7 @@ typedef struct _SubscriptionInfo
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subbinary;
+	char       *substream;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
-- 
1.8.3.1

v51/._v51-0001-Extend-the-BufFile-interface.patch000664 000765 000024 00000000324 13717453733 022515 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��<�<com.apple.quarantineq/0081;5f3e53e3;Chrome;5A2CF396-2579-4B9E-B160-BC2862689FD6v51/PaxHeader/v51-0001-Extend-the-BufFile-interface.patch000664 000765 000024 00000000035 13717453733 024250 xustar00dilipkumarstaff000000 000000 29 mtime=1597921243.77180696
v51/v51-0001-Extend-the-BufFile-interface.patch000664 000765 000024 00000042726 13717453733 022314 0ustar00dilipkumarstaff000000 000000 From 29bb8a1a330c1a50e57ee3e597ad0416d7e781b2 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v51 1/5] Extend the BufFile interface.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times. Such files
need to be created as a member of a SharedFileSet.

Implement the interface for BufFileTruncate to allow files to be truncated
up to a particular offset. Extend BufFileSeek API to support SEEK_END case.
Add an option to provide a mode while opening the shared BufFiles instead
of always opening in read-only mode.

These enhancements in BufFile interface are required for the upcoming
patch to allow the replication apply worker, to properly handle streamed
in-progress transactions.

Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/backend/postmaster/pgstat.c           |   3 +
 src/backend/storage/file/buffile.c        | 117 ++++++++++++++++++++--
 src/backend/storage/file/fd.c             |   9 +-
 src/backend/storage/file/sharedfileset.c  | 103 +++++++++++++++++--
 src/backend/utils/sort/logtape.c          |   4 +-
 src/backend/utils/sort/sharedtuplestore.c |   2 +-
 src/include/pgstat.h                      |   1 +
 src/include/storage/buffile.h             |   4 +-
 src/include/storage/fd.h                  |   2 +-
 src/include/storage/sharedfileset.h       |   4 +-
 10 files changed, 221 insertions(+), 28 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 73ce944fb1..8116b23614 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 2d7a082320..df9b9dcb96 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile also supports temporary files that can be used by the single backend
+ * when the corresponding files need to be survived across the transaction and
+ * need to be opened and closed multiple times.  Such files need to be created
+ * as a member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -666,11 +670,21 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * The file size of the last file gives us the end offset of that
+			 * file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +852,86 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate the file upto the given fileno and the offset.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			numFiles = file->numFiles;
+	int			newFile = fileno;
+	off_t		newOffset = file->curOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/* Loop over all the files upto the fileno which we want to truncate. */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		/*
+		 * Except the fileno, we can directly delete other files.  If the
+		 * offset is 0 then we can delete the fileno file as well unless it is
+		 * the first file.
+		 */
+		if ((i != fileno || offset == 0) && i != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			numFiles--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+
+			if (i == fileno)
+				newFile--;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = numFiles;
+
+	/*
+	 * If the truncate point is within existing buffer then we can just adjust
+	 * pos within buffer.
+	 */
+	if (newFile == file->curFile &&
+		newOffset >= file->curOffset &&
+		newOffset <= file->curOffset + file->nbytes)
+	{
+		/*
+		 * If the new pos is smaller then the curPos then adjust the file pos
+		 * otherwise file pos remain the same.
+		 */
+		if (newOffset <= file->curOffset + file->pos)
+			file->pos = (int) (newOffset - file->curOffset);
+
+		/* Adjust the nbytes for the current buffer. */
+		file->nbytes = (int) (newOffset - file->curOffset);
+	}
+
+	/*
+	 * If the new location is smaller then the current location in file then
+	 * we need to set the curFile and the curOffset to the new values and also
+	 * reset the pos and nbytes.  Otherwise nothing to do.
+	 */
+	else if ((newFile < file->curFile) ||
+			 newOffset < file->curOffset + file->pos)
+	{
+		file->curFile = newFile;
+		file->curOffset = newOffset;
+		file->pos = 0;
+		file->nbytes = 0;
+	}
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420efb2..f376a97ed6 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594756..2a90446930 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can be used by backends when the temporary files need to be
+ * opened/closed multiple times and the underlying files need to survive across
+ * transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,25 +29,35 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
  *
  * Under the covers the set is one or more directories which will eventually
- * be deleted when there are no backends attached.
+ * be deleted.
  */
 void
 SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -192,6 +224,9 @@ SharedFileSetDeleteAll(SharedFileSet *fileset)
 		SharedFileSetPath(dirpath, fileset, fileset->tablespaces[i]);
 		PathNameDeleteTemporaryDir(dirpath);
 	}
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -222,6 +257,58 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 		SharedFileSetDeleteAll(fileset);
 }
 
+/*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool	found = false;
+	ListCell *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't
+	 * maintain the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach (l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
 /*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59c50..788815cdab 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a4303b..b83fb50dac 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..807a9c1edf 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752bab0d..fc34c49522 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d7df..e209f047e8 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf077e5..d5edb600af 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
2.23.0

v51/._v51-0003-Enable-streaming-for-all-subscription-TAP-tests.patch000664 000765 000024 00000000324 13717162307 026247 0ustar00dilipkumarstaff000000 000000 Mac OS X        	2��ATTR��<�<com.apple.quarantineq/0081;5f3d26b3;Chrome;B6261ABB-220B-45A0-B1AC-3C533E494A92v51/v51-0003-Enable-streaming-for-all-subscription-TAP-tests.patch000664 000765 000024 00000035342 13717162307 026042 0ustar00dilipkumarstaff000000 000000 From e1a11ad2821cbe3cdb77e92fb5335eab1f1cc026 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v51 3/5] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 0680f44..4c9b48e 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -82,7 +82,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 9f140b5..21410fa 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334e..cdf9b8e 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v1-0001-bufile_test.patchapplication/octet-stream; name=v1-0001-bufile_test.patchDownload
From 8c12f5d6c8fa572eb7eb17721aa2f09589897f53 Mon Sep 17 00:00:00 2001
From: dilip kumar <dilipbalaut@localhost.localdomain>
Date: Tue, 18 Aug 2020 13:44:53 +0530
Subject: [PATCH v1] bufile_test

---
 contrib/buffile_test/.gitignore            |   4 +
 contrib/buffile_test/Makefile              |  22 +++++
 contrib/buffile_test/buffile_test--1.0.sql |  13 +++
 contrib/buffile_test/buffile_test.c        | 109 +++++++++++++++++++++
 contrib/buffile_test/buffile_test.control  |   5 +
 5 files changed, 153 insertions(+)
 create mode 100644 contrib/buffile_test/.gitignore
 create mode 100644 contrib/buffile_test/Makefile
 create mode 100644 contrib/buffile_test/buffile_test--1.0.sql
 create mode 100644 contrib/buffile_test/buffile_test.c
 create mode 100644 contrib/buffile_test/buffile_test.control

diff --git a/contrib/buffile_test/.gitignore b/contrib/buffile_test/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/contrib/buffile_test/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/contrib/buffile_test/Makefile b/contrib/buffile_test/Makefile
new file mode 100644
index 0000000000..96da1928fa
--- /dev/null
+++ b/contrib/buffile_test/Makefile
@@ -0,0 +1,22 @@
+# contrib/amcheck/Makefile
+
+MODULE_big	= buffile_test
+OBJS = \
+	$(WIN32RES) \
+	buffile_test.o
+
+EXTENSION = buffile_test
+DATA = buffile_test--1.0.sql
+PGFILEDESC = "buffile_test"
+
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/amcheck
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/buffile_test/buffile_test--1.0.sql b/contrib/buffile_test/buffile_test--1.0.sql
new file mode 100644
index 0000000000..6305f3eef1
--- /dev/null
+++ b/contrib/buffile_test/buffile_test--1.0.sql
@@ -0,0 +1,13 @@
+/* contrib/amcheck/amcheck--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION buffile_test" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+CREATE FUNCTION buffile_test()
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'buffile_test'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
diff --git a/contrib/buffile_test/buffile_test.c b/contrib/buffile_test/buffile_test.c
new file mode 100644
index 0000000000..47eb0b5c1b
--- /dev/null
+++ b/contrib/buffile_test/buffile_test.c
@@ -0,0 +1,109 @@
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/nbtree.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "catalog/index.h"
+#include "catalog/pg_am.h"
+#include "commands/tablecmds.h"
+#include "lib/bloomfilter.h"
+#include "miscadmin.h"
+#include "storage/buf_internals.h"
+#include "storage/buffile.h"
+#include "storage/fd.h"
+#include "utils/memutils.h"
+#include "utils/snapmgr.h"
+
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(buffile_test);
+
+/* test truncate */
+static void
+buffile_test1(SharedFileSet *fileset)
+{
+	BufFile    *fd;
+	int			fileno = 0;
+	off_t			offset = 0;
+	size_t  nread = 0;
+	char		readbuf[100];
+
+	fd = BufFileCreateShared(fileset, "test_file");
+	BufFileWrite(fd, "aaaaaaaaaa", 10);
+	BufFileTell(fd, &fileno, &offset);
+	BufFileWrite(fd, "bbbbbbbbbb", 10);
+	BufFileTruncateShared(fd, fileno, offset);
+	BufFileWrite(fd, "ccccc", 5);
+	BufFileSeek(fd, 0, 0, SEEK_SET);
+	nread = BufFileRead(fd, readbuf, 20);
+
+	if (nread != 15)
+		elog(ERROR, "FAILED: unexpected bytes read");
+	else if (strncmp(readbuf, "aaaaaaaaaaccccc", 15) != 0)
+		elog(ERROR, "FAILED: unexpected data read");
+	else
+		elog(WARNING, "PASSED: expected bytes read");
+	BufFileClose(fd);
+
+	BufFileDeleteShared(fileset, "test_file");
+}
+
+#define MAX_PHYSICAL_FILESIZE	0x40000000
+#define BUFFILE_SEG_SIZE		(MAX_PHYSICAL_FILESIZE / BLCKSZ)
+
+/* test truncate on multiple files*/
+static void
+buffile_test2(SharedFileSet *fileset)
+{
+	BufFile    *fd;
+	int			fileno = 0;
+	off_t		offset = 0;
+	size_t  	size = 0;
+	char		buf[BLCKSZ] = {'b'};
+	int			i;
+
+	fd = BufFileCreateShared(fileset, "test_file");
+	BufFileWrite(fd, "aaaaaaaaaaaaaaaaaaaa", 20);
+	BufFileTell(fd, &fileno, &offset);
+
+	/* create 3 files */
+	for (i = 0; i < 3* BUFFILE_SEG_SIZE; i++)
+	{
+		BufFileWrite(fd, buf, BLCKSZ);
+	}
+
+	/* seek to some location in the first file */
+	BufFileSeek(fd, 0, 10, SEEK_SET);
+
+	/* truncate within the first file and in same buffer */
+	BufFileTruncateShared(fd, fileno, 15);
+	size = BufFileSize(fd);
+	if (size == 15)
+		elog(WARNING, "PASSED: expected file size");
+	else
+		elog(WARNING, "FAILED: unexpected file size");
+
+	BufFileClose(fd);
+
+	BufFileDeleteShared(fileset, "test_file");
+}
+
+Datum
+buffile_test(PG_FUNCTION_ARGS)
+{
+	SharedFileSet *fileset;
+
+	fileset = palloc(sizeof(SharedFileSet));
+	SharedFileSetInit(fileset, NULL);
+
+	buffile_test1(fileset);
+	buffile_test2(fileset);
+
+	SharedFileSetDeleteAll(fileset);
+
+	PG_RETURN_VOID();
+}
diff --git a/contrib/buffile_test/buffile_test.control b/contrib/buffile_test/buffile_test.control
new file mode 100644
index 0000000000..a7c6fa280c
--- /dev/null
+++ b/contrib/buffile_test/buffile_test.control
@@ -0,0 +1,5 @@
+# buffile_test extension
+comment = 'test buffile'
+default_version = '1.0'
+module_pathname = '$libdir/buffile_test'
+relocatable = true
-- 
2.23.0

#484Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#483)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Aug 20, 2020 at 5:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Right, I think this can happen if one has changed those by BufFileSeek
before doing truncate. We should fix that case as well.

Right.

I will work on those along with your other comments and
submit the updated patch.

I have fixed this in the attached patch along with your other
comments. I have also attached a contrib module that is just used for
testing the truncate API.

Few comments:
==============
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
{
..
+ if ((i != fileno || offset == 0) && i != 0)
+ {
+ SharedSegmentName(segment_name, file->name, i);
+ FileClose(file->files[i]);
+ if (!SharedFileSetDelete(file->fileset, segment_name, true))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not delete shared fileset \"%s\": %m",
+ segment_name)));
+ numFiles--;
+ newOffset = MAX_PHYSICAL_FILESIZE;
+
+ if (i == fileno)
+ newFile--;
+ }

Here, shouldn't it be i <= fileno? Because we need to move back the
curFile up to newFile whenever curFile is greater than newFile

2.
+ /*
+ * If the new location is smaller then the current location in file then
+ * we need to set the curFile and the curOffset to the new values and also
+ * reset the pos and nbytes.  Otherwise nothing to do.
+ */
+ else if ((newFile < file->curFile) ||
+ newOffset < file->curOffset + file->pos)
+ {
+ file->curFile = newFile;
+ file->curOffset = newOffset;
+ file->pos = 0;
+ file->nbytes = 0;
+ }

Shouldn't there be && instead of || because if newFile is greater than
curFile then there is no meaning to update it?

--
With Regards,
Amit Kapila.

#485Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#484)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 20, 2020 at 5:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Right, I think this can happen if one has changed those by BufFileSeek
before doing truncate. We should fix that case as well.

Right.

I will work on those along with your other comments and
submit the updated patch.

I have fixed this in the attached patch along with your other
comments. I have also attached a contrib module that is just used for
testing the truncate API.

Few comments:
==============
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
{
..
+ if ((i != fileno || offset == 0) && i != 0)
+ {
+ SharedSegmentName(segment_name, file->name, i);
+ FileClose(file->files[i]);
+ if (!SharedFileSetDelete(file->fileset, segment_name, true))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not delete shared fileset \"%s\": %m",
+ segment_name)));
+ numFiles--;
+ newOffset = MAX_PHYSICAL_FILESIZE;
+
+ if (i == fileno)
+ newFile--;
+ }

Here, shouldn't it be i <= fileno? Because we need to move back the
curFile up to newFile whenever curFile is greater than newFile

2.
+ /*
+ * If the new location is smaller then the current location in file then
+ * we need to set the curFile and the curOffset to the new values and also
+ * reset the pos and nbytes.  Otherwise nothing to do.
+ */
+ else if ((newFile < file->curFile) ||
+ newOffset < file->curOffset + file->pos)
+ {
+ file->curFile = newFile;
+ file->curOffset = newOffset;
+ file->pos = 0;
+ file->nbytes = 0;
+ }

Shouldn't there be && instead of || because if newFile is greater than
curFile then there is no meaning to update it?

Wait, actually, it is not clear to me which case second condition
(newOffset < file->curOffset + file->pos) is trying to cover, so I
can't recommend anything for this. Can you please explain to me why
you have added the second condition in the above check?

--
With Regards,
Amit Kapila.

#486Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#484)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 20, 2020 at 5:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Right, I think this can happen if one has changed those by BufFileSeek
before doing truncate. We should fix that case as well.

Right.

I will work on those along with your other comments and
submit the updated patch.

I have fixed this in the attached patch along with your other
comments. I have also attached a contrib module that is just used for
testing the truncate API.

Few comments:
==============
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
{
..
+ if ((i != fileno || offset == 0) && i != 0)
+ {
+ SharedSegmentName(segment_name, file->name, i);
+ FileClose(file->files[i]);
+ if (!SharedFileSetDelete(file->fileset, segment_name, true))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not delete shared fileset \"%s\": %m",
+ segment_name)));
+ numFiles--;
+ newOffset = MAX_PHYSICAL_FILESIZE;
+
+ if (i == fileno)
+ newFile--;
+ }

Here, shouldn't it be i <= fileno? Because we need to move back the
curFile up to newFile whenever curFile is greater than newFile

I think now I have understood why you have added this condition but
probably a comment on the lines "This is required to indicate that we
have removed the given fileno" would be better for future readers.

--
With Regards,
Amit Kapila.

#487Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#484)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 20, 2020 at 5:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Right, I think this can happen if one has changed those by BufFileSeek
before doing truncate. We should fix that case as well.

Right.

I will work on those along with your other comments and
submit the updated patch.

I have fixed this in the attached patch along with your other
comments. I have also attached a contrib module that is just used for
testing the truncate API.

Few comments:
==============
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
{
..
+ if ((i != fileno || offset == 0) && i != 0)
+ {
+ SharedSegmentName(segment_name, file->name, i);
+ FileClose(file->files[i]);
+ if (!SharedFileSetDelete(file->fileset, segment_name, true))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not delete shared fileset \"%s\": %m",
+ segment_name)));
+ numFiles--;
+ newOffset = MAX_PHYSICAL_FILESIZE;
+
+ if (i == fileno)
+ newFile--;
+ }

Here, shouldn't it be i <= fileno? Because we need to move back the
curFile up to newFile whenever curFile is greater than newFile

+/* Loop over all the files upto the fileno which we want to truncate. */
+for (i = file->numFiles - 1; i >= fileno; i--)

Because the above loop is up to the fileno, so I feel there is no
point of that check or any assert.

2.
+ /*
+ * If the new location is smaller then the current location in file then
+ * we need to set the curFile and the curOffset to the new values and also
+ * reset the pos and nbytes.  Otherwise nothing to do.
+ */
+ else if ((newFile < file->curFile) ||
+ newOffset < file->curOffset + file->pos)
+ {
+ file->curFile = newFile;
+ file->curOffset = newOffset;
+ file->pos = 0;
+ file->nbytes = 0;
+ }

Shouldn't there be && instead of || because if newFile is greater than
curFile then there is no meaning to update it?

I think this condition is wrong it should be,

else if ((newFile < file->curFile) || ((newFile == file->curFile) &&
(newOffset < file->curOffset + file->pos)

Basically, either new file is smaller otherwise if it is the same
then-new offset should be smaller.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#488Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#486)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Aug 21, 2020 at 10:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 20, 2020 at 5:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Right, I think this can happen if one has changed those by BufFileSeek
before doing truncate. We should fix that case as well.

Right.

I will work on those along with your other comments and
submit the updated patch.

I have fixed this in the attached patch along with your other
comments. I have also attached a contrib module that is just used for
testing the truncate API.

Few comments:
==============
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
{
..
+ if ((i != fileno || offset == 0) && i != 0)
+ {
+ SharedSegmentName(segment_name, file->name, i);
+ FileClose(file->files[i]);
+ if (!SharedFileSetDelete(file->fileset, segment_name, true))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not delete shared fileset \"%s\": %m",
+ segment_name)));
+ numFiles--;
+ newOffset = MAX_PHYSICAL_FILESIZE;
+
+ if (i == fileno)
+ newFile--;
+ }

Here, shouldn't it be i <= fileno? Because we need to move back the
curFile up to newFile whenever curFile is greater than newFile

I think now I have understood why you have added this condition but
probably a comment on the lines "This is required to indicate that we
have removed the given fileno" would be better for future readers.

Okay.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#489Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#487)
5 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Aug 21, 2020 at 10:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

2.
+ /*
+ * If the new location is smaller then the current location in file then
+ * we need to set the curFile and the curOffset to the new values and also
+ * reset the pos and nbytes.  Otherwise nothing to do.
+ */
+ else if ((newFile < file->curFile) ||
+ newOffset < file->curOffset + file->pos)
+ {
+ file->curFile = newFile;
+ file->curOffset = newOffset;
+ file->pos = 0;
+ file->nbytes = 0;
+ }

Shouldn't there be && instead of || because if newFile is greater than
curFile then there is no meaning to update it?

I think this condition is wrong it should be,

else if ((newFile < file->curFile) || ((newFile == file->curFile) &&
(newOffset < file->curOffset + file->pos)

Basically, either new file is smaller otherwise if it is the same
then-new offset should be smaller.

I think we don't need to use file->pos for that as that is required
only for the current buffer, otherwise, such a condition should
suffice the need. However, I was not happy with the way code and
conditions were arranged in BufFileTruncateShared, so I have
re-arranged them and change quite a few comments in that API. Apart
from that I have updated the docs and ran pgindent for the first
patch. Do let me know if you have any more comments on the first
patch?

--
With Regards,
Amit Kapila.

Attachments:

v52-0004-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v52-0004-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From c9e614fc38e5116d5bc520355dca416b2014b047 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v52 4/5] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v52-0003-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v52-0003-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From 2c4fe92ac8881605349ac6f8854ebcfac55d16e6 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v52 3/5] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 0680f44..4c9b48e 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -82,7 +82,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 9f140b5..21410fa 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334e..cdf9b8e 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v52-0005-Add-streaming-option-in-pg_dump.patchapplication/octet-stream; name=v52-0005-Add-streaming-option-in-pg_dump.patchDownload
From 3eaba1be4f9e7a0ea3353667ac2b702a10aa39d4 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 11:09:18 +0530
Subject: [PATCH v52 5/5] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 17 +++++++++++++++--
 src/bin/pg_dump/pg_dump.h |  1 +
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 2cb3f9b..ca9d1fb 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4202,6 +4202,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4241,10 +4242,17 @@ getSubscriptions(Archive *fout)
 
 	if (fout->remoteVersion >= 140000)
 		appendPQExpBuffer(query,
-						  " s.subbinary\n");
+						  " s.subbinary,\n");
 	else
 		appendPQExpBuffer(query,
-						  " false AS subbinary\n");
+						  " false AS subbinary,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBuffer(query,
+						  " s.substream\n");
+	else
+		appendPQExpBuffer(query,
+						  " false AS substream\n");
 
 	appendPQExpBuffer(query,
 					  "FROM pg_subscription s\n"
@@ -4264,6 +4272,7 @@ getSubscriptions(Archive *fout)
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
+	i_substream = PQfnumber(res, "substream");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4287,6 +4296,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subpublications));
 		subinfo[i].subbinary =
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
+		subinfo[i].substream =
+			pg_strdup(PQgetvalue(res, i, i_substream));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4358,6 +4369,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subbinary, "t") == 0)
 		appendPQExpBuffer(query, ", binary = true");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index da97b73..cc10c7c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -626,6 +626,7 @@ typedef struct _SubscriptionInfo
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subbinary;
+	char       *substream;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
-- 
1.8.3.1

v52-0001-Extend-the-BufFile-interface.patchapplication/octet-stream; name=v52-0001-Extend-the-BufFile-interface.patchDownload
From e6d6aa0d98675469f59ce547fd08314b74bb4b0d Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v52 1/5] Extend the BufFile interface.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times. Such files
need to be created as a member of a SharedFileSet.

Implement the interface for BufFileTruncate to allow files to be truncated
up to a particular offset. Extend BufFileSeek API to support SEEK_END case.
Add an option to provide a mode while opening the shared BufFiles instead
of always opening in read-only mode.

These enhancements in BufFile interface are required for the upcoming
patch to allow the replication apply worker, to properly handle streamed
in-progress transactions.

Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 doc/src/sgml/monitoring.sgml              |   4 +
 src/backend/postmaster/pgstat.c           |   3 +
 src/backend/storage/file/buffile.c        | 129 +++++++++++++++++++++++++++---
 src/backend/storage/file/fd.c             |   9 +--
 src/backend/storage/file/sharedfileset.c  | 104 ++++++++++++++++++++++--
 src/backend/utils/sort/logtape.c          |   4 +-
 src/backend/utils/sort/sharedtuplestore.c |   2 +-
 src/include/pgstat.h                      |   1 +
 src/include/storage/buffile.h             |   4 +-
 src/include/storage/fd.h                  |   2 +-
 src/include/storage/sharedfileset.h       |   4 +-
 11 files changed, 238 insertions(+), 28 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 304c49f..c3dba5c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1203,6 +1203,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting for a write to a buffered file.</entry>
      </row>
      <row>
+      <entry><literal>BufFileTruncate</literal></entry>
+      <entry>Waiting for a buffered file to be truncated.</entry>
+     </row>
+     <row>
       <entry><literal>ControlFileRead</literal></entry>
       <entry>Waiting for a read from the <filename>pg_control</filename>
        file.</entry>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 73ce944..8116b23 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 2d7a082..d581f96 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile also supports temporary files that can be used by the single backend
+ * when the corresponding files need to be survived across the transaction and
+ * need to be opened and closed multiple times.  Such files need to be created
+ * as a member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -666,11 +670,21 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * The file size of the last file gives us the end offset of that
+			 * file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +852,98 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate a BufFile created by BufFileCreateShared up to the given fileno and
+ * the offset.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			numFiles = file->numFiles;
+	int			newFile = fileno;
+	off_t		newOffset = file->curOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/*
+	 * Loop over all the files up to the given fileno and remove the files
+	 * that are greater than the fileno and truncate the given file up to the
+	 * offset. Note that we also remove the given fileno if the offset is 0
+	 * provided it is not the first file in which we truncate it.
+	 */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		if ((i != fileno || offset == 0) && i != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			numFiles--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+
+			/*
+			 * This is required to indicate that we have deleted the given
+			 * fileno.
+			 */
+			if (i == fileno)
+				newFile--;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = numFiles;
+
+	/*
+	 * If the truncate point is within existing buffer then we can just adjust
+	 * pos within buffer.
+	 */
+	if (newFile == file->curFile &&
+		newOffset >= file->curOffset &&
+		newOffset <= file->curOffset + file->nbytes)
+	{
+		/* No need to reset the current pos if the new pos is greater. */
+		if (newOffset <= file->curOffset + file->pos)
+			file->pos = (int) (newOffset - file->curOffset);
+
+		/* Adjust the nbytes for the current buffer. */
+		file->nbytes = (int) (newOffset - file->curOffset);
+	}
+	else if (newFile == file->curFile &&
+			 newOffset < file->curOffset)
+	{
+		/*
+		 * The truncate point is within the existing file but prior to the
+		 * current position, so we can forget the current buffer and reset the
+		 * current position.
+		 */
+		file->curOffset = newOffset;
+		file->pos = 0;
+		file->nbytes = 0;
+	}
+	else if (newFile < file->curFile)
+	{
+		/*
+		 * The truncate point is prior to the current file, so need to reset
+		 * the current position accordingly.
+		 */
+		file->curFile = newFile;
+		file->curOffset = newOffset;
+		file->pos = 0;
+		file->nbytes = 0;
+	}
+	/* Nothing to do, if the truncate point is beyond current file. */
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420e..f376a97 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594..ac58344 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can be used by backends when the temporary files need to be
+ * opened/closed multiple times and the underlying files need to survive across
+ * transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,25 +29,35 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
+ * files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
  *
  * Under the covers the set is one or more directories which will eventually
- * be deleted when there are no backends attached.
+ * be deleted.
  */
 void
 SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
@@ -84,7 +98,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +179,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -192,6 +224,9 @@ SharedFileSetDeleteAll(SharedFileSet *fileset)
 		SharedFileSetPath(dirpath, fileset, fileset->tablespaces[i]);
 		PathNameDeleteTemporaryDir(dirpath);
 	}
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -223,6 +258,59 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 }
 
 /*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell   *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach(l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool		found = false;
+	ListCell   *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't maintain
+	 * the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach(l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
+/*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
  */
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59..788815c 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a43..b83fb50 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201..807a9c1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752ba..fc34c49 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d..e209f04 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf07..d5edb60 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
1.8.3.1

v52-0002-Add-support-for-streaming-to-built-in-replicatio.patchapplication/octet-stream; name=v52-0002-Add-support-for-streaming-to-built-in-replicatio.patchDownload
From bfc118516bc8b5f3924278254d934baac1ff6c48 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v52 2/5] Add support for streaming to built-in replication

To add support for streaming of in-progress transactions into the
built-in transaction, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
---
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  49 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   4 +
 src/backend/replication/logical/proto.c            | 140 ++-
 src/backend/replication/logical/worker.c           | 960 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 348 +++++++-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  46 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/009_stream_simple.pl       |  86 ++
 src/test/subscription/t/010_stream_subxact.pl      | 102 +++
 src/test/subscription/t/011_stream_ddl.pl          |  95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |  82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |  84 ++
 src/test/subscription/t/015_stream_binary.pl       |  86 ++
 src/tools/pgindent/typedefs.list                   |   3 +
 20 files changed, 2077 insertions(+), 47 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/015_stream_binary.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70..a81bd54 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal> and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c5..b7d7457 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf..311d462 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377..4c58ad8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,8 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+		*streaming_given = false;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +197,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +350,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +375,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -439,6 +455,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subpublications - 1] =
 		publicationListToArray(publications);
 
+	if (streaming_given)
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(streaming);
+	else
+		values[Anum_pg_subscription_substream - 1] =
+			BoolGetDatum(false);
+
 	tup = heap_form_tuple(RelationGetDescr(rel), values, nulls);
 
 	/* Insert tuple into catalog. */
@@ -698,6 +721,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +732,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +765,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +789,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +834,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +877,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8116b23..450346e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "ReorderLogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "ReorderLogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "ReorderLogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "ReorderLogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 8afa5a2..ad57409 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -425,6 +425,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097..ff25924 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,104 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b576e34..1347031 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,68 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+/* Sub-transaction data of the current streaming transaction. */
+typedef struct ApplySubXactData
+{
+	uint32		nsubxacts;		/* number of sub-transactions */
+	uint32		nsubxacts_max;	/* current capacity of subxact_last */
+	TransactionId subxact_last; /* xid of the last sub-transaction */
+	SubXactInfo *subxacts;		/* sub-xact offset in file */
+} ApplySubXactData;
+
+static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +297,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,17 +758,324 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or inside
+	 * streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		 !in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = subxact_data.nsubxacts; i > 0; i--)
+		{
+			if (subxact_data.subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < subxact_data.nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxact_data.subxacts[subidx].fileno,
+							  subxact_data.subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		subxact_data.nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -635,6 +1088,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1106,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1145,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1263,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1415,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1788,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1929,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2057,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2169,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1947,6 +2442,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1979,6 +2488,446 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do, but if already have
+	 * subxact file then delete that.
+	 */
+	if (subxact_data.nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &subxact_data.nsubxacts, sizeof(subxact_data.nsubxacts));
+	BufFileWrite(fd, subxact_data.subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxact_data.subxacts);
+	Assert(subxact_data.nsubxacts == 0);
+	Assert(subxact_data.nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &subxact_data.nsubxacts,
+					sizeof(subxact_data.nsubxacts)) !=
+		sizeof(subxact_data.nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	subxact_data.nsubxacts_max = 1 << my_log2(subxact_data.nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxact_data.subxacts = palloc(subxact_data.nsubxacts_max *
+								   sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxact_data.subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	SubXactInfo *subxacts = subxact_data.subxacts;
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_data.subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_data.subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = subxact_data.nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (subxact_data.nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		subxact_data.nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (subxact_data.nsubxacts == subxact_data.nsubxacts_max)
+	{
+		subxact_data.nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts,
+							subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[subxact_data.nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd,
+				&subxacts[subxact_data.nsubxacts].fileno,
+				&subxacts[subxact_data.nsubxacts].offset);
+
+	subxact_data.nsubxacts++;
+	subxact_data.subxacts = subxacts;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	SharedFileSetDeleteAll(ent->stream_fileset);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		SharedFileSetDeleteAll(ent->subxact_fileset);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxact_data.subxacts)
+		pfree(subxact_data.subxacts);
+
+	subxact_data.subxacts = NULL;
+	subxact_data.subxact_last = InvalidTransactionId;
+	subxact_data.nsubxacts = 0;
+	subxact_data.nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2151,6 +3100,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc..3360bd5 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,29 +47,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,11 +123,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -115,16 +149,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +226,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +255,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +279,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +300,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +331,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +394,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +444,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +486,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,6 +507,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -406,7 +539,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +559,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +584,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +605,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +631,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +663,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -606,6 +744,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -642,6 +887,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -771,12 +1048,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -811,7 +1121,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35..1d09154 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc..655144d 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,49 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbe..6c0a4e3 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/015_stream_binary.pl b/src/test/subscription/t/015_stream_binary.pl
new file mode 100644
index 0000000..fa2362e
--- /dev/null
+++ b/src/test/subscription/t/015_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3d99046..500623e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -111,6 +111,7 @@ Append
 AppendPath
 AppendRelInfo
 AppendState
+ApplySubXactData
 Archive
 ArchiveEntryPtrType
 ArchiveFormat
@@ -2370,6 +2371,7 @@ StopList
 StopWorkersData
 StrategyNumber
 StreamCtl
+StreamXidHash
 StringInfo
 StringInfoData
 StripnullState
@@ -2380,6 +2382,7 @@ SubPlanState
 SubTransactionId
 SubXactCallback
 SubXactCallbackItem
+SubXactInfo
 SubXactEvent
 SubplanResultRelHashElem
 SubqueryScan
-- 
1.8.3.1

#490Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#489)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Aug 21, 2020 at 3:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Aug 21, 2020 at 10:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

2.
+ /*
+ * If the new location is smaller then the current location in file then
+ * we need to set the curFile and the curOffset to the new values and also
+ * reset the pos and nbytes.  Otherwise nothing to do.
+ */
+ else if ((newFile < file->curFile) ||
+ newOffset < file->curOffset + file->pos)
+ {
+ file->curFile = newFile;
+ file->curOffset = newOffset;
+ file->pos = 0;
+ file->nbytes = 0;
+ }

Shouldn't there be && instead of || because if newFile is greater than
curFile then there is no meaning to update it?

I think this condition is wrong it should be,

else if ((newFile < file->curFile) || ((newFile == file->curFile) &&
(newOffset < file->curOffset + file->pos)

Basically, either new file is smaller otherwise if it is the same
then-new offset should be smaller.

I think we don't need to use file->pos for that as that is required
only for the current buffer, otherwise, such a condition should
suffice the need. However, I was not happy with the way code and
conditions were arranged in BufFileTruncateShared, so I have
re-arranged them and change quite a few comments in that API. Apart
from that I have updated the docs and ran pgindent for the first
patch. Do let me know if you have any more comments on the first
patch?

I have reviewed and tested the patch and the changes look fine to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#491Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#490)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Aug 21, 2020 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have reviewed and tested the patch and the changes look fine to me.

Thanks, I will push the next patch early next week (by Tuesday) unless
you or someone else has any more comments on it. The summary of the
patch (v52-0001-Extend-the-BufFile-interface, attached with my
previous email) I am planning to push is: "It extends the BufFile
interface to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times. Such
files need to be created as a member of a SharedFileSet. We have
implemented the interface for BufFileTruncate to allow files to be
truncated up to a particular offset and extended the BufFileSeek API
to support SEEK_END case. We have also added an option to provide a
mode while opening the shared BufFiles instead of always opening in
read-only mode. These enhancements in BufFile interface are required
for the upcoming patch to allow the replication apply worker, to
properly handle streamed in-progress transactions."

--
With Regards,
Amit Kapila.

#492Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#491)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Aug 22, 2020 at 8:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Aug 21, 2020 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have reviewed and tested the patch and the changes look fine to me.

Thanks, I will push the next patch early next week (by Tuesday) unless
you or someone else has any more comments on it. The summary of the
patch (v52-0001-Extend-the-BufFile-interface, attached with my
previous email) I am planning to push is: "It extends the BufFile
interface to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times. Such
files need to be created as a member of a SharedFileSet. We have
implemented the interface for BufFileTruncate to allow files to be
truncated up to a particular offset and extended the BufFileSeek API
to support SEEK_END case. We have also added an option to provide a
mode while opening the shared BufFiles instead of always opening in
read-only mode. These enhancements in BufFile interface are required
for the upcoming patch to allow the replication apply worker, to
properly handle streamed in-progress transactions."

While reviewing 0002, I realized that instead of using individual
shared fileset for each transaction, we can use just one common shared
file set. We can create individual buffile under one shared fileset
and whenever a transaction commits/aborts we can just delete its
buffile and the shared fileset can stay.

I have attached a POC patch for this idea and if we agree with this
approach then I will prepare a final patch in a couple of days.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

buffile_changes.patchapplication/octet-stream; name=buffile_changes.patchDownload
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 4c58ad8b07..bf5fdda672 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -101,7 +101,10 @@ parse_subscription_options(List *options,
 		*binary = false;
 	}
 	if (streaming)
+	{
 		*streaming_given = false;
+		*streaming = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 1347031d01..9f46b0b34f 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -136,20 +136,6 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
-/*
- * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
- * xidhash and along with it create the streaming file and store the fileset handle.
- * The subxact file is created iff there is any suxact info under this xid. This
- * entry is used on the subsequent streams for the xid to get the corresponding
- * fileset handles.
- */
-typedef struct StreamXidHash
-{
-	TransactionId xid;			/* xid is the hash key and must be first */
-	SharedFileSet *stream_fileset;	/* shared file set for stream data */
-	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
-} StreamXidHash;
-
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
@@ -169,14 +155,6 @@ bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
 
-/*
- * Hash table for storing the streaming xid information along with shared file
- * set for streaming and subxact files.  On every stream start we need to open
- * the xid's files and for that we need the shared file set handle.  So storing
- * it in xid hash make it faster to search.
- */
-static HTAB *xidhash = NULL;
-
 /* Buf file handle of the current streaming file. */
 static BufFile *stream_fd = NULL;
 
@@ -196,6 +174,8 @@ typedef struct ApplySubXactData
 	SubXactInfo *subxacts;		/* sub-xact offset in file */
 } ApplySubXactData;
 
+SharedFileSet *fileset;
+
 static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
 
 static void subxact_filename(char *path, Oid subid, TransactionId xid);
@@ -776,7 +756,6 @@ static void
 apply_handle_stream_start(StringInfo s)
 {
 	bool		first_segment;
-	HASHCTL		hash_ctl;
 
 	Assert(!in_streamed_transaction);
 
@@ -793,16 +772,6 @@ apply_handle_stream_start(StringInfo s)
 	/* extract XID of the top-level transaction */
 	stream_xid = logicalrep_read_stream_start(s, &first_segment);
 
-	/* Initialize the xidhash table if we haven't yet */
-	if (xidhash == NULL)
-	{
-		hash_ctl.keysize = sizeof(TransactionId);
-		hash_ctl.entrysize = sizeof(StreamXidHash);
-		hash_ctl.hcxt = ApplyContext;
-		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
-							  HASH_ELEM | HASH_CONTEXT);
-	}
-
 	/* open the spool file for this transaction */
 	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
 
@@ -885,7 +854,6 @@ apply_handle_stream_abort(StringInfo s)
 		BufFile    *fd;
 		bool		found = false;
 		char		path[MAXPGPATH];
-		StreamXidHash *ent;
 
 		subidx = -1;
 		ensure_transaction();
@@ -916,15 +884,9 @@ apply_handle_stream_abort(StringInfo s)
 
 		Assert((subidx >= 0) && (subidx < subxact_data.nsubxacts));
 
-		ent = (StreamXidHash *) hash_search(xidhash,
-											(void *) &xid,
-											HASH_FIND,
-											&found);
-		Assert(found);
-
 		/* open the changes file */
 		changes_filename(path, MyLogicalRepWorker->subid, xid);
-		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		fd = BufFileOpenShared(fileset, path, O_RDWR);
 
 		/* OK, truncate the file at the right offset */
 		BufFileTruncateShared(fd, subxact_data.subxacts[subidx].fileno,
@@ -951,9 +913,7 @@ apply_handle_stream_commit(StringInfo s)
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
-	bool		found;
 	LogicalRepCommitData commit_data;
-	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
@@ -969,12 +929,8 @@ apply_handle_stream_commit(StringInfo s)
 	/* open the spool file for the committed transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file '%s'", path);
-	ent = (StreamXidHash *) hash_search(xidhash,
-										(void *) &xid,
-										HASH_FIND,
-										&found);
-	Assert(found);
-	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	fd = BufFileOpenShared(fileset, path, O_RDONLY);
 
 	buffer = palloc(BLCKSZ);
 	initStringInfo(&s2);
@@ -2501,61 +2457,14 @@ static void
 subxact_info_write(Oid subid, TransactionId xid)
 {
 	char		path[MAXPGPATH];
-	bool		found;
 	Size		len;
-	StreamXidHash *ent;
 	BufFile    *fd;
 
 	Assert(TransactionIdIsValid(xid));
 
 	subxact_filename(path, subid, xid);
 
-	/* find the xid entry in the xidhash */
-	ent = (StreamXidHash *) hash_search(xidhash,
-										(void *) &xid,
-										HASH_FIND,
-										&found);
-	/* we must found the entry for its top transaction by this time */
-	Assert(found);
-
-	/*
-	 * If there is no subtransaction then nothing to do, but if already have
-	 * subxact file then delete that.
-	 */
-	if (subxact_data.nsubxacts == 0)
-	{
-		if (ent->subxact_fileset)
-		{
-			cleanup_subxact_info();
-			BufFileDeleteShared(ent->subxact_fileset, path);
-			pfree(ent->subxact_fileset);
-			ent->subxact_fileset = NULL;
-		}
-
-		return;
-	}
-
-	/*
-	 * Create the subxact file if it not already created, otherwise open the
-	 * existing file.
-	 */
-	if (ent->subxact_fileset == NULL)
-	{
-		MemoryContext oldctx;
-
-		/*
-		 * We need to maintain shared fileset across multiple stream
-		 * start/stop calls.  So, need to allocate it in a persistent context.
-		 */
-		oldctx = MemoryContextSwitchTo(ApplyContext);
-		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
-		SharedFileSetInit(ent->subxact_fileset, NULL);
-		MemoryContextSwitchTo(oldctx);
-
-		fd = BufFileCreateShared(ent->subxact_fileset, path);
-	}
-	else
-		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+	fd = BufFileOpenShared(fileset, path, O_RDWR);
 
 	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
 
@@ -2583,10 +2492,8 @@ static void
 subxact_info_read(Oid subid, TransactionId xid)
 {
 	char		path[MAXPGPATH];
-	bool		found;
 	Size		len;
 	BufFile    *fd;
-	StreamXidHash *ent;
 	MemoryContext oldctx;
 
 	Assert(TransactionIdIsValid(xid));
@@ -2594,22 +2501,9 @@ subxact_info_read(Oid subid, TransactionId xid)
 	Assert(subxact_data.nsubxacts == 0);
 	Assert(subxact_data.nsubxacts_max == 0);
 
-	/* Find the stream xid entry in the xidhash */
-	ent = (StreamXidHash *) hash_search(xidhash,
-										(void *) &xid,
-										HASH_FIND,
-										&found);
-
-	/*
-	 * If subxact_fileset is not valid that mean we don't have any subxact
-	 * info
-	 */
-	if (ent->subxact_fileset == NULL)
-		return;
-
 	subxact_filename(path, subid, xid);
 
-	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+	fd = BufFileOpenShared(fileset, path, O_RDONLY);
 
 	/* read number of subxact items */
 	if (BufFileRead(fd, &subxact_data.nsubxacts,
@@ -2753,27 +2647,18 @@ static void
 stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
 {
 	char		path[MAXPGPATH];
-	StreamXidHash *ent;
 
-	/* Remove the xid entry from the stream xid hash */
-	ent = (StreamXidHash *) hash_search(xidhash,
-										(void *) &xid,
-										HASH_REMOVE,
-										NULL);
-	/* By this time we must have created the transaction entry */
-	Assert(ent != NULL);
+	Assert(fileset != NULL);
 
 	/* Delete the change file and release the stream fileset memory */
 	changes_filename(path, subid, xid);
-	SharedFileSetDeleteAll(ent->stream_fileset);
-	pfree(ent->stream_fileset);
-
+	BufFileDeleteShared(fileset, path);
+	
 	/* Delete the subxact file and release the memory, if it exist */
-	if (ent->subxact_fileset)
+	if (subxact_data.nsubxacts > 0)
 	{
 		subxact_filename(path, subid, xid);
-		SharedFileSetDeleteAll(ent->subxact_fileset);
-		pfree(ent->subxact_fileset);
+		BufFileDeleteShared(fileset, path);
 	}
 }
 
@@ -2793,23 +2678,33 @@ static void
 stream_open_file(Oid subid, TransactionId xid, bool first_segment)
 {
 	char		path[MAXPGPATH];
-	bool		found;
+	char		subxact_path[MAXPGPATH];
 	MemoryContext oldcxt;
-	StreamXidHash *ent;
 
 	Assert(in_streamed_transaction);
 	Assert(OidIsValid(subid));
 	Assert(TransactionIdIsValid(xid));
 	Assert(stream_fd == NULL);
 
-	/* create or find the xid entry in the xidhash */
-	ent = (StreamXidHash *) hash_search(xidhash,
-										(void *) &xid,
-										HASH_ENTER | HASH_FIND,
-										&found);
-	Assert(first_segment || found);
+	/*
+	 * If shared fileset is not initialized yet then do it now. We need to
+	 * maintain shared fileset across multiple stream start/stop calls.  So,
+	 * need to allocate it in a persistent context.
+	 */
+	if (fileset == NULL)
+	{
+		MemoryContext savectx;
+
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+	}
+
 	changes_filename(path, subid, xid);
-	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+	subxact_filename(subxact_path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);	
 
 	/*
 	 * Create/open the buffiles under the logical streaming context so that we
@@ -2824,25 +2719,11 @@ stream_open_file(Oid subid, TransactionId xid, bool first_segment)
 	 */
 	if (first_segment)
 	{
-		MemoryContext savectx;
-		SharedFileSet *fileset;
-
-		/*
-		 * We need to maintain shared fileset across multiple stream
-		 * start/stop calls.  So, need to allocate it in a persistent context.
-		 */
-		savectx = MemoryContextSwitchTo(ApplyContext);
-		fileset = palloc(sizeof(SharedFileSet));
-
-		SharedFileSetInit(fileset, NULL);
-		MemoryContextSwitchTo(savectx);
+		BufFile	*subxact_fd;
 
 		stream_fd = BufFileCreateShared(fileset, path);
-
-		/* Remember the fileset for the next stream of the same transaction */
-		ent->xid = xid;
-		ent->stream_fileset = fileset;
-		ent->subxact_fileset = NULL;
+		subxact_fd = BufFileCreateShared(fileset, subxact_path);
+		BufFileClose(subxact_fd);
 	}
 	else
 	{
@@ -2850,7 +2731,7 @@ stream_open_file(Oid subid, TransactionId xid, bool first_segment)
 		 * Open the file and seek to the end of the file because we always
 		 * append the changes file.
 		 */
-		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		stream_fd = BufFileOpenShared(fileset, path, O_RDWR);
 		BufFileSeek(stream_fd, 0, 0, SEEK_END);
 	}
 
#493Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#492)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Aug 24, 2020 at 9:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Aug 22, 2020 at 8:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Aug 21, 2020 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have reviewed and tested the patch and the changes look fine to me.

Thanks, I will push the next patch early next week (by Tuesday) unless
you or someone else has any more comments on it. The summary of the
patch (v52-0001-Extend-the-BufFile-interface, attached with my
previous email) I am planning to push is: "It extends the BufFile
interface to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times. Such
files need to be created as a member of a SharedFileSet. We have
implemented the interface for BufFileTruncate to allow files to be
truncated up to a particular offset and extended the BufFileSeek API
to support SEEK_END case. We have also added an option to provide a
mode while opening the shared BufFiles instead of always opening in
read-only mode. These enhancements in BufFile interface are required
for the upcoming patch to allow the replication apply worker, to
properly handle streamed in-progress transactions."

While reviewing 0002, I realized that instead of using individual
shared fileset for each transaction, we can use just one common shared
file set. We can create individual buffile under one shared fileset
and whenever a transaction commits/aborts we can just delete its
buffile and the shared fileset can stay.

I think the existing design is superior as it allows the flexibility
to create transaction files in different temp_tablespaces which is
quite important to consider as we know the files will be created only
for large transactions. Once we fix the sharedfileset for a worker all
the files will be created in the temp_tablespaces chosen for the first
time apply worker creates it even if it got changed at some later
point of time (user can change its value and then do reload config
which I think will impact the worker settings as well). This all can
happen because we set the tablespaces at the time of
SharedFileSetInit.

The other relatively smaller thing which I don't like is that we
always need to create a buffile for subxact even though we don't need
it. We might be able to find some solution for this but I guess the
previous point is what bothers me more.

--
With Regards,
Amit Kapila.

#494Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#493)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Aug 25, 2020 at 9:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 24, 2020 at 9:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Aug 22, 2020 at 8:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Aug 21, 2020 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have reviewed and tested the patch and the changes look fine to me.

Thanks, I will push the next patch early next week (by Tuesday) unless
you or someone else has any more comments on it. The summary of the
patch (v52-0001-Extend-the-BufFile-interface, attached with my
previous email) I am planning to push is: "It extends the BufFile
interface to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times. Such
files need to be created as a member of a SharedFileSet. We have
implemented the interface for BufFileTruncate to allow files to be
truncated up to a particular offset and extended the BufFileSeek API
to support SEEK_END case. We have also added an option to provide a
mode while opening the shared BufFiles instead of always opening in
read-only mode. These enhancements in BufFile interface are required
for the upcoming patch to allow the replication apply worker, to
properly handle streamed in-progress transactions."

While reviewing 0002, I realized that instead of using individual
shared fileset for each transaction, we can use just one common shared
file set. We can create individual buffile under one shared fileset
and whenever a transaction commits/aborts we can just delete its
buffile and the shared fileset can stay.

I think the existing design is superior as it allows the flexibility
to create transaction files in different temp_tablespaces which is
quite important to consider as we know the files will be created only
for large transactions. Once we fix the sharedfileset for a worker all
the files will be created in the temp_tablespaces chosen for the first
time apply worker creates it even if it got changed at some later
point of time (user can change its value and then do reload config
which I think will impact the worker settings as well). This all can
happen because we set the tablespaces at the time of
SharedFileSetInit.

Yeah, I agree with this point, that if we use the single shared
fileset then it will always use the same tablespace for all the
streaming transactions. And, we might get the benefit of concurrent
I/O if we use different tablespaces as we are not immediately flushing
the files to the disk.

The other relatively smaller thing which I don't like is that we
always need to create a buffile for subxact even though we don't need
it. We might be able to find some solution for this but I guess the
previous point is what bothers me more.

Yeah, if we go this way we might need to find some solution to this.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#495Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#494)
5 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Aug 25, 2020 at 10:41 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Aug 25, 2020 at 9:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think the existing design is superior as it allows the flexibility
to create transaction files in different temp_tablespaces which is
quite important to consider as we know the files will be created only
for large transactions. Once we fix the sharedfileset for a worker all
the files will be created in the temp_tablespaces chosen for the first
time apply worker creates it even if it got changed at some later
point of time (user can change its value and then do reload config
which I think will impact the worker settings as well). This all can
happen because we set the tablespaces at the time of
SharedFileSetInit.

Yeah, I agree with this point, that if we use the single shared
fileset then it will always use the same tablespace for all the
streaming transactions. And, we might get the benefit of concurrent
I/O if we use different tablespaces as we are not immediately flushing
the files to the disk.

Okay, so let's retain the original approach then. I have made a few
cosmetic modifications in the first two patches which include updating
docs, comments, slightly modify the commit message, and change the
code to match the nearby code. One change which you might have a
different opinion is below:

+ case WAIT_EVENT_LOGICAL_CHANGES_READ:
+ event_name = "ReorderLogicalChangesRead";
+ break;
+ case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+ event_name = "ReorderLogicalChangesWrite";
+ break;
+ case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+ event_name = "ReorderLogicalSubxactRead";
+ break;
+ case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+ event_name = "ReorderLogicalSubxactWrite";
+ break;

Why do we want to name these events starting with name as Reorder*? I
think these are used in subscriber-side, so no need to use the word
Reorder, so I have removed it from the attached patch. I am planning
to push the first patch (v53-0001-Extend-the-BufFile-interface) in
this series tomorrow unless you have any comments on the same.

--
With Regards,
Amit Kapila.

Attachments:

v53-0001-Extend-the-BufFile-interface.patchapplication/octet-stream; name=v53-0001-Extend-the-BufFile-interface.patchDownload
From dee13cfb791937edef715c935f4cb61ced11b098 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 14 Jul 2020 10:56:51 +0530
Subject: [PATCH v53 1/5] Extend the BufFile interface.

Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times. Such files
need to be created as a member of a SharedFileSet.

This commit implements the interface for BufFileTruncate to allow files to
be truncated up to a particular offset and extends the BufFileSeek API to
support the SEEK_END case. This also adds an option to provide a mode
while opening the shared BufFiles instead of always opening in read-only
mode.

These enhancements in BufFile interface are required for the upcoming
patch to allow the replication apply worker, to handle streamed
in-progress transactions.

Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 doc/src/sgml/monitoring.sgml              |   4 +
 src/backend/postmaster/pgstat.c           |   3 +
 src/backend/storage/file/buffile.c        | 129 +++++++++++++++++++++++++++---
 src/backend/storage/file/fd.c             |   9 +--
 src/backend/storage/file/sharedfileset.c  | 105 ++++++++++++++++++++++--
 src/backend/utils/sort/logtape.c          |   4 +-
 src/backend/utils/sort/sharedtuplestore.c |   2 +-
 src/include/pgstat.h                      |   1 +
 src/include/storage/buffile.h             |   4 +-
 src/include/storage/fd.h                  |   2 +-
 src/include/storage/sharedfileset.h       |   4 +-
 11 files changed, 239 insertions(+), 28 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0f11375..17a0df6 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1203,6 +1203,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting for a write to a buffered file.</entry>
      </row>
      <row>
+      <entry><literal>BufFileTruncate</literal></entry>
+      <entry>Waiting for a buffered file to be truncated.</entry>
+     </row>
+     <row>
       <entry><literal>ControlFileRead</literal></entry>
       <entry>Waiting for a read from the <filename>pg_control</filename>
        file.</entry>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 73ce944..8116b23 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3940,6 +3940,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_BUFFILE_WRITE:
 			event_name = "BufFileWrite";
 			break;
+		case WAIT_EVENT_BUFFILE_TRUNCATE:
+			event_name = "BufFileTruncate";
+			break;
 		case WAIT_EVENT_CONTROL_FILE_READ:
 			event_name = "ControlFileRead";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 2d7a082..d581f96 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -32,10 +32,14 @@
  * (by opening multiple fd.c temporary files).  This is an essential feature
  * for sorts and hashjoins on large amounts of data.
  *
- * BufFile supports temporary files that can be made read-only and shared with
- * other backends, as infrastructure for parallel execution.  Such files need
- * to be created as a member of a SharedFileSet that all participants are
- * attached to.
+ * BufFile supports temporary files that can be shared with other backends, as
+ * infrastructure for parallel execution.  Such files need to be created as a
+ * member of a SharedFileSet that all participants are attached to.
+ *
+ * BufFile also supports temporary files that can be used by the single backend
+ * when the corresponding files need to be survived across the transaction and
+ * need to be opened and closed multiple times.  Such files need to be created
+ * as a member of a SharedFileSet.
  *-------------------------------------------------------------------------
  */
 
@@ -277,7 +281,7 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
  * backends and render it read-only.
  */
 BufFile *
-BufFileOpenShared(SharedFileSet *fileset, const char *name)
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode)
 {
 	BufFile    *file;
 	char		segment_name[MAXPGPATH];
@@ -301,7 +305,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 		}
 		/* Try to load a segment. */
 		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name, mode);
 		if (files[nfiles] <= 0)
 			break;
 		++nfiles;
@@ -321,7 +325,7 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 
 	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
+	file->readOnly = (mode == O_RDONLY) ? true : false;
 	file->fileset = fileset;
 	file->name = pstrdup(name);
 
@@ -666,11 +670,21 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 			newFile = file->curFile;
 			newOffset = (file->curOffset + file->pos) + offset;
 			break;
-#ifdef NOT_USED
 		case SEEK_END:
-			/* could be implemented, not needed currently */
+
+			/*
+			 * The file size of the last file gives us the end offset of that
+			 * file.
+			 */
+			newFile = file->numFiles - 1;
+			newOffset = FileSize(file->files[file->numFiles - 1]);
+			if (newOffset < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m",
+								FilePathName(file->files[file->numFiles - 1]),
+								file->name)));
 			break;
-#endif
 		default:
 			elog(ERROR, "invalid whence: %d", whence);
 			return EOF;
@@ -838,3 +852,98 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+/*
+ * Truncate a BufFile created by BufFileCreateShared up to the given fileno and
+ * the offset.
+ */
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+	int			numFiles = file->numFiles;
+	int			newFile = fileno;
+	off_t		newOffset = file->curOffset;
+	char		segment_name[MAXPGPATH];
+	int			i;
+
+	/*
+	 * Loop over all the files up to the given fileno and remove the files
+	 * that are greater than the fileno and truncate the given file up to the
+	 * offset. Note that we also remove the given fileno if the offset is 0
+	 * provided it is not the first file in which we truncate it.
+	 */
+	for (i = file->numFiles - 1; i >= fileno; i--)
+	{
+		if ((i != fileno || offset == 0) && i != 0)
+		{
+			SharedSegmentName(segment_name, file->name, i);
+			FileClose(file->files[i]);
+			if (!SharedFileSetDelete(file->fileset, segment_name, true))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not delete shared fileset \"%s\": %m",
+								segment_name)));
+			numFiles--;
+			newOffset = MAX_PHYSICAL_FILESIZE;
+
+			/*
+			 * This is required to indicate that we have deleted the given
+			 * fileno.
+			 */
+			if (i == fileno)
+				newFile--;
+		}
+		else
+		{
+			if (FileTruncate(file->files[i], offset,
+							 WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								FilePathName(file->files[i]))));
+			newOffset = offset;
+		}
+	}
+
+	file->numFiles = numFiles;
+
+	/*
+	 * If the truncate point is within existing buffer then we can just adjust
+	 * pos within buffer.
+	 */
+	if (newFile == file->curFile &&
+		newOffset >= file->curOffset &&
+		newOffset <= file->curOffset + file->nbytes)
+	{
+		/* No need to reset the current pos if the new pos is greater. */
+		if (newOffset <= file->curOffset + file->pos)
+			file->pos = (int) (newOffset - file->curOffset);
+
+		/* Adjust the nbytes for the current buffer. */
+		file->nbytes = (int) (newOffset - file->curOffset);
+	}
+	else if (newFile == file->curFile &&
+			 newOffset < file->curOffset)
+	{
+		/*
+		 * The truncate point is within the existing file but prior to the
+		 * current position, so we can forget the current buffer and reset the
+		 * current position.
+		 */
+		file->curOffset = newOffset;
+		file->pos = 0;
+		file->nbytes = 0;
+	}
+	else if (newFile < file->curFile)
+	{
+		/*
+		 * The truncate point is prior to the current file, so need to reset
+		 * the current position accordingly.
+		 */
+		file->curFile = newFile;
+		file->curOffset = newOffset;
+		file->pos = 0;
+		file->nbytes = 0;
+	}
+	/* Nothing to do, if the truncate point is beyond current file. */
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420e..f376a97 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1743,18 +1743,17 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 /*
  * Open a file that was created with PathNameCreateTemporaryFile, possibly in
  * another backend.  Files opened this way don't count against the
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are automatically closed at the end of the
+ * transaction but are not deleted on close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)
 {
 	File		file;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
-	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFile(path, mode | PG_BINARY);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 16b7594..65fd8ff 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -13,6 +13,10 @@
  * files can be discovered by name, and a shared ownership semantics so that
  * shared files survive until the last user detaches.
  *
+ * SharedFileSets can be used by backends when the temporary files need to be
+ * opened/closed multiple times and the underlying files need to survive across
+ * transactions.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -25,25 +29,36 @@
 #include "common/hashfn.h"
 #include "miscadmin.h"
 #include "storage/dsm.h"
+#include "storage/ipc.h"
 #include "storage/sharedfileset.h"
 #include "utils/builtins.h"
 
+static List *filesetlist = NIL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetDeleteOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name);
 static Oid	ChooseTablespace(const SharedFileSet *fileset, const char *name);
 
 /*
- * Initialize a space for temporary files that can be opened for read-only
- * access by other backends.  Other backends must attach to it before
- * accessing it.  Associate this SharedFileSet with 'seg'.  Any contained
- * files will be deleted when the last backend detaches.
+ * Initialize a space for temporary files that can be opened by other backends.
+ * Other backends must attach to it before accessing it.  Associate this
+ * SharedFileSet with 'seg'.  Any contained files will be deleted when the
+ * last backend detaches.
+ *
+ * We can also use this interface if the temporary files are used only by
+ * single backend but the files need to be opened and closed multiple times
+ * and also the underlying files need to survive across transactions.  For
+ * such cases, dsm segment 'seg' should be passed as NULL.  Callers are
+ * expected to explicitly remove such files by using SharedFileSetDelete/
+ * SharedFileSetDeleteAll or we remove such files on proc exit.
  *
  * Files will be distributed over the tablespaces configured in
  * temp_tablespaces.
  *
  * Under the covers the set is one or more directories which will eventually
- * be deleted when there are no backends attached.
+ * be deleted.
  */
 void
 SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
@@ -84,7 +99,25 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
 	}
 
 	/* Register our cleanup callback. */
-	on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	if (seg)
+		on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+	else
+	{
+		static bool registered_cleanup = false;
+
+		if (!registered_cleanup)
+		{
+			/*
+			 * We must not have registered any fileset before registering the
+			 * fileset clean up.
+			 */
+			Assert(filesetlist == NIL);
+			on_proc_exit(SharedFileSetDeleteOnProcExit, 0);
+			registered_cleanup = true;
+		}
+
+		filesetlist = lcons((void *) fileset, filesetlist);
+	}
 }
 
 /*
@@ -147,13 +180,13 @@ SharedFileSetCreate(SharedFileSet *fileset, const char *name)
  * another backend.
  */
 File
-SharedFileSetOpen(SharedFileSet *fileset, const char *name)
+SharedFileSetOpen(SharedFileSet *fileset, const char *name, int mode)
 {
 	char		path[MAXPGPATH];
 	File		file;
 
 	SharedFilePath(path, fileset, name);
-	file = PathNameOpenTemporaryFile(path);
+	file = PathNameOpenTemporaryFile(path, mode);
 
 	return file;
 }
@@ -192,6 +225,9 @@ SharedFileSetDeleteAll(SharedFileSet *fileset)
 		SharedFileSetPath(dirpath, fileset, fileset->tablespaces[i]);
 		PathNameDeleteTemporaryDir(dirpath);
 	}
+
+	/* Unregister the shared fileset */
+	SharedFileSetUnregister(fileset);
 }
 
 /*
@@ -223,6 +259,59 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 }
 
 /*
+ * Callback function that will be invoked on the process exit.  This will
+ * process the list of all the registered sharedfilesets and delete the
+ * underlying files.
+ */
+static void
+SharedFileSetDeleteOnProcExit(int status, Datum arg)
+{
+	ListCell   *l;
+
+	/* Loop over all the pending shared fileset entry */
+	foreach(l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		SharedFileSetDeleteAll(fileset);
+	}
+
+	filesetlist = NIL;
+}
+
+/*
+ * Unregister the shared fileset entry registered for cleanup on proc exit.
+ */
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+	bool		found = false;
+	ListCell   *l;
+
+	/*
+	 * If the caller is following the dsm based cleanup then we don't maintain
+	 * the filesetlist so return.
+	 */
+	if (filesetlist == NIL)
+		return;
+
+	foreach(l, filesetlist)
+	{
+		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+		/* Remove the entry from the list */
+		if (input_fileset == fileset)
+		{
+			filesetlist = list_delete_cell(filesetlist, l);
+			found = true;
+			break;
+		}
+	}
+
+	Assert(found);
+}
+
+/*
  * Build the path for the directory holding the files backing a SharedFileSet
  * in a given tablespace.
  */
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 5517e59..788815c 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -78,6 +78,8 @@
 
 #include "postgres.h"
 
+#include <fcntl.h>
+
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
@@ -551,7 +553,7 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
-		file = BufFileOpenShared(fileset, filename);
+		file = BufFileOpenShared(fileset, filename, O_RDONLY);
 		filesize = BufFileSize(file);
 
 		/*
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a43..b83fb50 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -559,7 +559,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 
 				sts_filename(name, accessor, accessor->read_participant);
 				accessor->read_file =
-					BufFileOpenShared(accessor->fileset, name);
+					BufFileOpenShared(accessor->fileset, name, O_RDONLY);
 			}
 
 			/* Seek and load the chunk header. */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201..807a9c1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -916,6 +916,7 @@ typedef enum
 	WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
 	WAIT_EVENT_BUFFILE_READ,
 	WAIT_EVENT_BUFFILE_WRITE,
+	WAIT_EVENT_BUFFILE_TRUNCATE,
 	WAIT_EVENT_CONTROL_FILE_READ,
 	WAIT_EVENT_CONTROL_FILE_SYNC,
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f4752ba..fc34c49 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,9 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
+extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name,
+								  int mode);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
+extern void BufFileTruncateShared(BufFile *file, int fileno, off_t offset);
 
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d..e209f04 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -94,7 +94,7 @@ extern mode_t FileGetRawMode(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
-extern File PathNameOpenTemporaryFile(const char *name);
+extern File PathNameOpenTemporaryFile(const char *path, int mode);
 extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
 extern void PathNameCreateTemporaryDir(const char *base, const char *name);
 extern void PathNameDeleteTemporaryDir(const char *name);
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index 2d6cf07..d5edb60 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -37,9 +37,11 @@ typedef struct SharedFileSet
 extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
 extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
 extern File SharedFileSetCreate(SharedFileSet *fileset, const char *name);
-extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name);
+extern File SharedFileSetOpen(SharedFileSet *fileset, const char *name,
+							  int mode);
 extern bool SharedFileSetDelete(SharedFileSet *fileset, const char *name,
 								bool error_on_failure);
 extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetUnregister(SharedFileSet *input_fileset);
 
 #endif
-- 
1.8.3.1

v53-0002-Add-support-for-streaming-to-built-in-logical-re.patchapplication/octet-stream; name=v53-0002-Add-support-for-streaming-to-built-in-logical-re.patchDownload
From cbdb201303519392a09c4ca0b1179a1f78e14400 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v53 2/5] Add support for streaming to built-in logical
 replication.

To add support for streaming of in-progress transactions into the
built-in logical replication, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.

Author: Tomas Vondra, Dilip Kumar and Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 doc/src/sgml/monitoring.sgml                       |  16 +
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  46 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   4 +
 src/backend/replication/logical/proto.c            | 162 +++-
 src/backend/replication/logical/worker.c           | 960 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 348 +++++++-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  46 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/009_stream_simple.pl       |  86 ++
 src/test/subscription/t/010_stream_subxact.pl      | 102 +++
 src/test/subscription/t/011_stream_ddl.pl          |  95 ++
 .../subscription/t/012_stream_subxact_abort.pl     |  82 ++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |  84 ++
 src/test/subscription/t/015_stream_binary.pl       |  86 ++
 src/tools/pgindent/typedefs.list                   |   3 +
 21 files changed, 2112 insertions(+), 47 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/015_stream_binary.pl

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 17a0df6..7fa1d79 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1509,6 +1509,22 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>WALWrite</literal></entry>
       <entry>Waiting for a write to a WAL file.</entry>
      </row>
+     <row>
+      <entry><literal>LogicalChangesRead</literal></entry>
+      <entry>Waiting for a read from a logical changes file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalChangesWrite</literal></entry>
+      <entry>Waiting for a write to a logical changes file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalSubxactRead</literal></entry>
+      <entry>Waiting for a read from a logical subxact file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalSubxactWrite</literal></entry>
+      <entry>Waiting for a write to a logical subxact file.</entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70..a1666b3 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal>, and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c5..b7d7457 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf..311d462 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377..9426e1d 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,11 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+	{
+		*streaming_given = false;
+		*streaming = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +200,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +353,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +378,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -427,6 +446,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subowner - 1] = ObjectIdGetDatum(owner);
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
+	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -698,6 +718,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +729,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +762,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +786,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +831,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +874,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8116b23..5f4b168 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "LogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "LogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "LogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "LogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 8afa5a2..ad57409 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -425,6 +425,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097..f82236e 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,126 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+/*
+ * Write the information for the start stream message to the output stream.
+ */
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+/*
+ * Read the information about the start stream message from output stream.
+ */
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+/*
+ * Write the stop stream message to the output stream.
+ */
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+/*
+ * Write STREAM COMMIT to the output stream.
+ */
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+/*
+ * Read STREAM COMMIT from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+/*
+ * Write STREAM ABORT to the output stream. Note that xid and subxid will be
+ * same for the top-level transaction abort.
+ */
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+/*
+ * Read STREAM ABORT from the output stream.
+ */
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b576e34..1347031 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,43 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ *
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription.  This is necessary so that different workers processing a
+ * remote transaction with the same XID don't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close.  We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of file
+ * and if we decide to keep stream files open across the start/stop stream then
+ * it will consume a lot of memory (more than 8K).  Moreover, if we don't use
+ * SharedFileSet then we also need to invent a new way to pass filenames to
+ * BufFile APIs so that we can be allowed to open the file we desired across
+ * multiple stream open calls for the same transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +65,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +97,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +107,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +136,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry.  Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any suxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions. */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +164,68 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.  On every stream start we need to open
+ * the xid's files and for that we need the shared file set handle.  So storing
+ * it in xid hash make it faster to search.
+ */
+static HTAB *xidhash = NULL;
+
+/* Buf file handle of the current streaming file. */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+/* Sub-transaction data of the current streaming transaction. */
+typedef struct ApplySubXactData
+{
+	uint32		nsubxacts;		/* number of sub-transactions */
+	uint32		nsubxacts_max;	/* current capacity of subxact_last */
+	TransactionId subxact_last; /* xid of the last sub-transaction */
+	SubXactInfo *subxacts;		/* sub-xact offset in file */
+} ApplySubXactData;
+
+static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +297,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,17 +758,324 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside remote transaction or inside
+	 * streaming transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if ((!in_remote_transaction && !in_streamed_transaction) ||
+		((IsTransactionState() && !am_tablesync_worker()) &&
+		 !in_streamed_transaction))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop.  We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/* Initialize the xidhash table if we haven't yet */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+		 * would allow us to use binary search here.
+		 *
+		 * XXX Or perhaps we can rely on the aborts to arrive in the reverse
+		 * order, i.e. from the inner-most subxact (when nested)? In which
+		 * case we could simply check the last element.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		/* XXX optimize the search by bsearch on sorted data */
+		for (i = subxact_data.nsubxacts; i > 0; i--)
+		{
+			if (subxact_data.subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < subxact_data.nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxact_data.subxacts[subidx].fileno,
+							  subxact_data.subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		subxact_data.nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file '%s'", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file: %m")));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file '%s'",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid, false);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -635,6 +1088,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1106,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1145,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1263,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1415,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1788,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1929,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2057,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context used for per stream data when streaming mode is
+	 * enabled.  This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2169,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1947,6 +2442,20 @@ maybe_reread_subscription(void)
 		proc_exit(0);
 	}
 
+	/*
+	 * Exit if streaming option is changed. The launcher will start new
+	 * worker.
+	 */
+	if (newsub->stream != MySubscription->stream)
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will "
+						"restart because subscription's streaming option were changed",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
 	/* Check for other changes that should never happen too. */
 	if (newsub->dbid != MySubscription->dbid)
 	{
@@ -1979,6 +2488,446 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	subxact_filename(path, subid, xid);
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do, but if already have
+	 * subxact file then delete that.
+	 */
+	if (subxact_data.nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			BufFileDeleteShared(ent->subxact_fileset, path);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &subxact_data.nsubxacts, sizeof(subxact_data.nsubxacts));
+	BufFileWrite(fd, subxact_data.subxacts, len);
+
+	BufFileClose(fd);
+
+	/*
+	 * But we free the memory allocated for subxact info. There might be one
+	 * exceptional transaction with many subxacts, and we don't want to keep
+	 * the memory allocated forever.
+	 */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the global variables.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxact_data.subxacts);
+	Assert(subxact_data.nsubxacts == 0);
+	Assert(subxact_data.nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &subxact_data.nsubxacts,
+					sizeof(subxact_data.nsubxacts)) !=
+		sizeof(subxact_data.nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	subxact_data.nsubxacts_max = 1 << my_log2(subxact_data.nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context.  We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this.  On stream stop we will flush this
+	 * information to the subxact file and reset the logical streaming
+	 * context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxact_data.subxacts = palloc(subxact_data.nsubxacts_max *
+								   sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxact_data.subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	SubXactInfo *subxacts = subxact_data.subxacts;
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_data.subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_data.subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = subxact_data.nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (subxact_data.nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		subxact_data.nsubxacts_max = 128;
+
+		/* Allocate this in per-stream context */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (subxact_data.nsubxacts == subxact_data.nsubxacts_max)
+	{
+		subxact_data.nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts,
+							subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[subxact_data.nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd,
+				&subxacts[subxact_data.nsubxacts].fileno,
+				&subxacts[subxact_data.nsubxacts].offset);
+
+	subxact_data.nsubxacts++;
+	subxact_data.subxacts = subxacts;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ *
+ * Note: The files may not exists, so handle ENOENT as non-error.
+ *
+ * missing_ok - don't report error for missing file is the flag is passed true.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	SharedFileSetDeleteAll(ent->stream_fileset);
+	pfree(ent->stream_fileset);
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		SharedFileSetDeleteAll(ent->subxact_fileset);
+		pfree(ent->subxact_fileset);
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open file we'll use to serialize changes for a toplevel transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buf file, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file '%s' for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls.  So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
+ * XXX The subxact file includes CRC32C of the contents. Maybe we should
+ * include something like that here too, but doing so will not be as
+ * straighforward, because we write the file in chunks.
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxact_data.subxacts)
+		pfree(subxact_data.subxacts);
+
+	subxact_data.subxacts = NULL;
+	subxact_data.subxact_last = InvalidTransactionId;
+	subxact_data.nsubxacts = 0;
+	subxact_data.nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2151,6 +3100,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc..3360bd5 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,29 +47,57 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
+
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in.  Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort.  For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
 	Oid			relid;			/* relation oid */
-
+	TransactionId	xid;		/* transaction that created the record */
 	/*
 	 * Did we send the schema?  If ancestor relid is set, its schema must also
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,11 +123,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -115,16 +149,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +226,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +255,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +279,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +300,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +331,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +394,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may not be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +444,25 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+	relentry->xid = change->txn->xid;
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +486,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,6 +507,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -406,7 +539,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +559,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +584,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +605,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +631,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +663,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -606,6 +744,113 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -642,6 +887,38 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -771,12 +1048,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -811,7 +1121,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35..1d09154 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc..655144d 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,49 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbe..6c0a4e3 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..402df30
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/015_stream_binary.pl b/src/test/subscription/t/015_stream_binary.pl
new file mode 100644
index 0000000..fa2362e
--- /dev/null
+++ b/src/test/subscription/t/015_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3d99046..500623e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -111,6 +111,7 @@ Append
 AppendPath
 AppendRelInfo
 AppendState
+ApplySubXactData
 Archive
 ArchiveEntryPtrType
 ArchiveFormat
@@ -2370,6 +2371,7 @@ StopList
 StopWorkersData
 StrategyNumber
 StreamCtl
+StreamXidHash
 StringInfo
 StringInfoData
 StripnullState
@@ -2380,6 +2382,7 @@ SubPlanState
 SubTransactionId
 SubXactCallback
 SubXactCallbackItem
+SubXactInfo
 SubXactEvent
 SubplanResultRelHashElem
 SubqueryScan
-- 
1.8.3.1

v53-0003-Enable-streaming-for-all-subscription-TAP-tests.patchapplication/octet-stream; name=v53-0003-Enable-streaming-for-all-subscription-TAP-tests.patchDownload
From 0a9f193afac484e7d22a334a384f804590801cf7 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 20 Nov 2019 16:41:13 +0530
Subject: [PATCH v53 3/5] Enable streaming for all subscription TAP tests

---
 src/test/subscription/t/001_rep_changes.pl              | 2 +-
 src/test/subscription/t/002_types.pl                    | 2 +-
 src/test/subscription/t/003_constraints.pl              | 2 +-
 src/test/subscription/t/004_sync.pl                     | 8 ++++----
 src/test/subscription/t/005_encoding.pl                 | 2 +-
 src/test/subscription/t/006_rewrite.pl                  | 2 +-
 src/test/subscription/t/007_ddl.pl                      | 2 +-
 src/test/subscription/t/008_diff_schema.pl              | 2 +-
 src/test/subscription/t/009_matviews.pl                 | 2 +-
 src/test/subscription/t/009_stream_simple.pl            | 2 +-
 src/test/subscription/t/010_stream_subxact.pl           | 2 +-
 src/test/subscription/t/010_truncate.pl                 | 6 +++---
 src/test/subscription/t/011_generated.pl                | 2 +-
 src/test/subscription/t/011_stream_ddl.pl               | 2 +-
 src/test/subscription/t/012_collation.pl                | 2 +-
 src/test/subscription/t/012_stream_subxact_abort.pl     | 2 +-
 src/test/subscription/t/013_stream_subxact_ddl_abort.pl | 2 +-
 src/test/subscription/t/100_bugs.pl                     | 2 +-
 18 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 0680f44..4c9b48e 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -82,7 +82,7 @@ $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub, tap_pub_ins_only WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/002_types.pl b/src/test/subscription/t/002_types.pl
index aedcab2..94c71f8 100644
--- a/src/test/subscription/t/002_types.pl
+++ b/src/test/subscription/t/002_types.pl
@@ -108,7 +108,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl
index 9f140b5..21410fa 100644
--- a/src/test/subscription/t/003_constraints.pl
+++ b/src/test/subscription/t/003_constraints.pl
@@ -35,7 +35,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES;");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..a6fae9c 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
@@ -56,7 +56,7 @@ $node_publisher->safe_psql('postgres',
 
 # recreate the subscription, it will try to do initial copy
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # but it will be stuck on data copy as it will fail on constraint
@@ -78,7 +78,7 @@ is($result, qq(20), 'initial data synced for second sub');
 
 # now check another subscription for the same node pair
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false)"
+	"CREATE SUBSCRIPTION tap_sub2 CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (copy_data = false, streaming = on)"
 );
 
 # wait for it to start
@@ -100,7 +100,7 @@ $node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
 
 # recreate the subscription again
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 # and wait for data sync to finish again
diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
index aec7a17..202871a 100644
--- a/src/test/subscription/t/005_encoding.pl
+++ b/src/test/subscription/t/005_encoding.pl
@@ -26,7 +26,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/006_rewrite.pl b/src/test/subscription/t/006_rewrite.pl
index c6cda10..70c86b2 100644
--- a/src/test/subscription/t/006_rewrite.pl
+++ b/src/test/subscription/t/006_rewrite.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index 7fe6cc6..f9c8d1d 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -22,7 +22,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->wait_for_catchup('mysub');
diff --git a/src/test/subscription/t/008_diff_schema.pl b/src/test/subscription/t/008_diff_schema.pl
index 963334e..cdf9b8e 100644
--- a/src/test/subscription/t/008_diff_schema.pl
+++ b/src/test/subscription/t/008_diff_schema.pl
@@ -32,7 +32,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('tap_sub');
diff --git a/src/test/subscription/t/009_matviews.pl b/src/test/subscription/t/009_matviews.pl
index 7afc7bd..21f50c7 100644
--- a/src/test/subscription/t/009_matviews.pl
+++ b/src/test/subscription/t/009_matviews.pl
@@ -18,7 +18,7 @@ my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION mypub FOR ALL TABLES;");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub;"
+	"CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr' PUBLICATION mypub WITH (streaming = on);"
 );
 
 $node_publisher->safe_psql('postgres',
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
index 2f01133..30561d8 100644
--- a/src/test/subscription/t/009_stream_simple.pl
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
index d2ae385..9a6bac6 100644
--- a/src/test/subscription/t/010_stream_subxact.pl
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bd..ed56fbf 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -52,13 +52,13 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub3 FOR TABLE tab3, tab4");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2"
+	"CREATE SUBSCRIPTION sub2 CONNECTION '$publisher_connstr' PUBLICATION pub2 WITH (streaming = on)"
 );
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3"
+	"CREATE SUBSCRIPTION sub3 CONNECTION '$publisher_connstr' PUBLICATION pub3 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_generated.pl b/src/test/subscription/t/011_generated.pl
index f35d1cb..4df1dde 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -33,7 +33,7 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 # Wait for initial sync of all subscriptions
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
index 0da39a1..c3caff6 100644
--- a/src/test/subscription/t/011_stream_ddl.pl
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/012_collation.pl b/src/test/subscription/t/012_collation.pl
index 4bfcef7..c62eb52 100644
--- a/src/test/subscription/t/012_collation.pl
+++ b/src/test/subscription/t/012_collation.pl
@@ -80,7 +80,7 @@ $node_publisher->safe_psql('postgres',
 	q{CREATE PUBLICATION pub1 FOR ALL TABLES});
 
 $node_subscriber->safe_psql('postgres',
-	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false)}
+	qq{CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (copy_data = false, streaming = on)}
 );
 
 $node_publisher->wait_for_catchup('sub1');
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 402df30..2be7542 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
index becbdd0..2da9607 100644
--- a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -40,7 +40,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tes
 
 my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',
-"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
 );
 
 wait_for_caught_up($node_publisher, $appname);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 366a7a9..96ffc09 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -53,7 +53,7 @@ $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub1 FOR ALL TABLES");
 
 $node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
+	"CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1 WITH (streaming = on)"
 );
 
 $node_publisher->wait_for_catchup('sub1');
-- 
1.8.3.1

v53-0004-Add-TAP-test-for-streaming-vs.-DDL.patchapplication/octet-stream; name=v53-0004-Add-TAP-test-for-streaming-vs.-DDL.patchDownload
From 1bd3154b75f9db84bc350cda7ac29d24c14d953c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:15:35 +0200
Subject: [PATCH v53 4/5] Add TAP test for streaming vs. DDL

---
 src/test/subscription/t/014_stream_through_ddl.pl | 98 +++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 src/test/subscription/t/014_stream_through_ddl.pl

diff --git a/src/test/subscription/t/014_stream_through_ddl.pl b/src/test/subscription/t/014_stream_through_ddl.pl
new file mode 100644
index 0000000..b8d78b1
--- /dev/null
+++ b/src/test/subscription/t/014_stream_through_ddl.pl
@@ -0,0 +1,98 @@
+# Test streaming of large transaction with DDL, subtransactions and rollbacks.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d text, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 1000) s(i);
+SAVEPOINT s2;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(1001, 2000) s(i);
+SAVEPOINT s3;
+ALTER TABLE test_tab ADD COLUMN d text;
+SAVEPOINT s4;
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text) FROM generate_series(2001, 3000) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(3001, 4000) s(i);
+SAVEPOINT s10;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(4001, 5000) s(i);
+ALTER TABLE test_tab ADD COLUMN d text;
+ROLLBACK TO SAVEPOINT s10;
+RELEASE SAVEPOINT s10;
+SAVEPOINT s10;
+INSERT INTO test_tab SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(5001, 6000) s(i);
+SAVEPOINT s6;
+ALTER TABLE test_tab DROP d;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(6001, 7000) s(i);
+SAVEPOINT s7;
+ALTER TABLE test_tab ADD COLUMN d text;
+INSERT INTO test_tab (a, b, c, d, e) SELECT i, md5(i::text), i, md5(i::text), i FROM generate_series(7001, 8000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(a), count(b), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(7000|7000|7000|6000|4000|4000), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v53-0005-Add-streaming-option-in-pg_dump.patchapplication/octet-stream; name=v53-0005-Add-streaming-option-in-pg_dump.patchDownload
From 50252b9c04cacb6344ca9af843373058d141aa9c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 11:09:18 +0530
Subject: [PATCH v53 5/5] Add streaming option in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 17 +++++++++++++++--
 src/bin/pg_dump/pg_dump.h |  1 +
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 2cb3f9b..ca9d1fb 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4202,6 +4202,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4241,10 +4242,17 @@ getSubscriptions(Archive *fout)
 
 	if (fout->remoteVersion >= 140000)
 		appendPQExpBuffer(query,
-						  " s.subbinary\n");
+						  " s.subbinary,\n");
 	else
 		appendPQExpBuffer(query,
-						  " false AS subbinary\n");
+						  " false AS subbinary,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBuffer(query,
+						  " s.substream\n");
+	else
+		appendPQExpBuffer(query,
+						  " false AS substream\n");
 
 	appendPQExpBuffer(query,
 					  "FROM pg_subscription s\n"
@@ -4264,6 +4272,7 @@ getSubscriptions(Archive *fout)
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
+	i_substream = PQfnumber(res, "substream");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4287,6 +4296,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subpublications));
 		subinfo[i].subbinary =
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
+		subinfo[i].substream =
+			pg_strdup(PQgetvalue(res, i, i_substream));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4358,6 +4369,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subbinary, "t") == 0)
 		appendPQExpBuffer(query, ", binary = true");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index da97b73..cc10c7c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -626,6 +626,7 @@ typedef struct _SubscriptionInfo
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subbinary;
+	char       *substream;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
-- 
1.8.3.1

#496Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#495)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Aug 25, 2020 at 6:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Aug 25, 2020 at 10:41 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Aug 25, 2020 at 9:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think the existing design is superior as it allows the flexibility
to create transaction files in different temp_tablespaces which is
quite important to consider as we know the files will be created only
for large transactions. Once we fix the sharedfileset for a worker all
the files will be created in the temp_tablespaces chosen for the first
time apply worker creates it even if it got changed at some later
point of time (user can change its value and then do reload config
which I think will impact the worker settings as well). This all can
happen because we set the tablespaces at the time of
SharedFileSetInit.

Yeah, I agree with this point, that if we use the single shared
fileset then it will always use the same tablespace for all the
streaming transactions. And, we might get the benefit of concurrent
I/O if we use different tablespaces as we are not immediately flushing
the files to the disk.

Okay, so let's retain the original approach then. I have made a few
cosmetic modifications in the first two patches which include updating
docs, comments, slightly modify the commit message, and change the
code to match the nearby code. One change which you might have a
different opinion is below:

+ case WAIT_EVENT_LOGICAL_CHANGES_READ:
+ event_name = "ReorderLogicalChangesRead";
+ break;
+ case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+ event_name = "ReorderLogicalChangesWrite";
+ break;
+ case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+ event_name = "ReorderLogicalSubxactRead";
+ break;
+ case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+ event_name = "ReorderLogicalSubxactWrite";
+ break;

Why do we want to name these events starting with name as Reorder*? I
think these are used in subscriber-side, so no need to use the word
Reorder, so I have removed it from the attached patch. I am planning
to push the first patch (v53-0001-Extend-the-BufFile-interface) in
this series tomorrow unless you have any comments on the same.

Your changes in 0001 and 0002, looks fine to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#497Jeff Janes
jeff.janes@gmail.com
In reply to: Amit Kapila (#495)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I am planning
to push the first patch (v53-0001-Extend-the-BufFile-interface) in
this series tomorrow unless you have any comments on the same.

I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c
line 288 needs to be:

bool found PG_USED_FOR_ASSERTS_ONLY = false;

Cheers,

Jeff

#498Amit Kapila
amit.kapila16@gmail.com
In reply to: Jeff Janes (#497)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Aug 26, 2020 at 11:22 PM Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I am planning
to push the first patch (v53-0001-Extend-the-BufFile-interface) in
this series tomorrow unless you have any comments on the same.

I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c line 288 needs to be:

bool found PG_USED_FOR_ASSERTS_ONLY = false;

Thanks for the report. Tom Lane has already fixed this [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e942af7b8261cd8070d0eeaf518dbc1a664859fd.

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e942af7b8261cd8070d0eeaf518dbc1a664859fd

--
With Regards,
Amit Kapila.

#499Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#498)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, Aug 27, 2020 at 11:16 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Aug 26, 2020 at 11:22 PM Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I am planning
to push the first patch (v53-0001-Extend-the-BufFile-interface) in
this series tomorrow unless you have any comments on the same.

I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c line 288 needs to be:

bool found PG_USED_FOR_ASSERTS_ONLY = false;

Thanks for the report. Tom Lane has already fixed this [1].

[1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e942af7b8261cd8070d0eeaf518dbc1a664859fd

As discussed, I have added a another test case for covering the out of
order subtransaction rollback scenario.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

tap_test_for_out_of_order_subxact_abort.patchapplication/octet-stream; name=tap_test_for_out_of_order_subxact_abort.patchDownload
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
index 2be7542..5c4aa93 100644
--- a/src/test/subscription/t/012_stream_subxact_abort.pl
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 2;
+use Test::More tests => 3;
 
 sub wait_for_caught_up
 {
@@ -78,5 +78,27 @@ $result =
   $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
 is($result, qq(1000|0), 'check extra columns contain local defaults');
 
+# large (streamed) transaction with out of order subtransaction ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3001,3500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(4001,4500) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(5001,5500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(6001,6500) s(i);
+RELEASE s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(7001,7500) s(i);
+ROLLBACK TO s1;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1500|0), 'check extra columns contain local defaults');
+
 $node_subscriber->stop;
 $node_publisher->stop;
#500Neha Sharma
neha.sharma@enterprisedb.com
In reply to: Amit Kapila (#498)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Hi,

I have done code coverage analysis on the latest patches(v53) and below is
the report for the same.
Highlighted are the files where the coverage modifications were observed.

OS: Ubuntu 18.04
Patch applied on commit : 77c7267c37f7fa8e5e48abda4798afdbecb2b95a

File Name
Coverage
Without logical decoding patch On v53 (2,3,4,5) patch Without v53-0003 patch
%Line %Function %Line %Function %Line %Function
src/backend/access/transam/xact.c 86.2 92.9 86.2 92.9 86.2 92.9
src/backend/access/transam/xloginsert.c 90.2 94.1 90.2 94.1 90.2 94.1
src/backend/access/transam/xlogreader.c 73.3 93.3 73.8 93.3 73.8 93.3
src/backend/replication/logical/decode.c 93.4 100 93.4 100 93.4 100
src/backend/access/rmgrdesc/xactdesc.c 54.4 63.6 54.4 63.6 54.4 63.6
src/backend/replication/logical/reorderbuffer.c 93.4 96.7 93.4 96.7 93.4
96.7
src/backend/utils/cache/inval.c 98.1 100 98.1 100 98.1 100
contrib/test_decoding/test_decoding.c 86.8 95.2 86.8 95.2 86.8 95.2
src/backend/replication/logical/logical.c 90.9 93.5 90.9 93.5 91.8 93.5
src/backend/access/heap/heapam.c 86.1 94.5 86.1 94.5 86.1 94.5
src/backend/access/index/genam.c 90.7 91.7 91.2 91.7 91.2 91.7
src/backend/access/table/tableam.c 90.6 100 90.6 100 90.6 100
src/backend/utils/time/snapmgr.c 81.1 98.1 80.2 98.1 81.1 98.1
src/include/access/tableam.h 92.5 100 92.5 100 92.5 100
src/backend/access/heap/heapam_visibility.c 77.8 100 77.8 100 77.8 100
src/backend/replication/walsender.c 90.5 97.8 90.5 97.8 90.9 100
src/backend/catalog/pg_subscription.c 96 100 96 100 96 100
src/backend/commands/subscriptioncmds.c 93.2 90 92.7 90 92.7 90
src/backend/postmaster/pgstat.c 64.2 85.1 63.9 85.1 64.6 86.1
src/backend/replication/libpqwalreceiver/libpqwalreceiver.c 82.4 95 82.5 95
83.6 95
src/backend/replication/logical/proto.c 93.5 91.3 93.7 93.3 93.7 93.3
src/backend/replication/logical/worker.c 91.6 96 91.5 97.4 91.9 97.4
src/backend/replication/pgoutput/pgoutput.c 81.9 100 85.5 100 86.2 100
src/backend/replication/slotfuncs.c 93 93.8 93 93.8 93 93.8
src/include/pgstat.h 100 - 100 - 100 -
src/backend/replication/logical/logicalfuncs.c 87.1 90 87.1 90 87.1 90
src/backend/storage/file/buffile.c 68.3 85 69.6 85 69.6 85
src/backend/storage/file/fd.c 81.1 93 81.1 93 81.1 93
src/backend/storage/file/sharedfileset.c 77.7 90.9 93.2 100 93.2 100
src/backend/utils/sort/logtape.c 94.4 100 94.4 100 94.4 100
src/backend/utils/sort/sharedtuplestore.c 90.1 90.9 90.1 90.9 90.1 90.9

Thanks.
--
Regards,
Neha Sharma

On Thu, Aug 27, 2020 at 11:16 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:

Show quoted text

On Wed, Aug 26, 2020 at 11:22 PM Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila <amit.kapila16@gmail.com>

wrote:

I am planning
to push the first patch (v53-0001-Extend-the-BufFile-interface) in
this series tomorrow unless you have any comments on the same.

I'm getting compiler warnings now,

src/backend/storage/file/sharedfileset.c line 288 needs to be:

bool found PG_USED_FOR_ASSERTS_ONLY = false;

Thanks for the report. Tom Lane has already fixed this [1].

[1] -
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e942af7b8261cd8070d0eeaf518dbc1a664859fd

--
With Regards,
Amit Kapila.

#501Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#499)
2 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Aug 28, 2020 at 2:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

As discussed, I have added a another test case for covering the out of
order subtransaction rollback scenario.

+# large (streamed) transaction with out of order subtransaction ROLLBACKs
+$node_publisher->safe_psql('postgres', q{

How about writing a comment as: "large (streamed) transaction with
subscriber receiving out of order subtransaction ROLLBACKs"?

I have reviewed and modified the number of things in the attached patch:
1. In apply_handle_origin, improved the check streamed xacts.
2. In apply_handle_stream_commit() while applying changes in the loop,
added CHECK_FOR_INTERRUPTS.
3. In DEBUG messages, print the path with double-quotes as we are
doing in all other places.
4.
+ /*
+ * Exit if streaming option is changed. The launcher will start new
+ * worker.
+ */
+ if (newsub->stream != MySubscription->stream)
+ {
+ ereport(LOG,
+ (errmsg("logical replication apply worker for subscription \"%s\" will "
+ "restart because subscription's streaming option were changed",
+ MySubscription->name)));
+
+ proc_exit(0);
+ }
+
We don't need a separate check like this. I have merged this into one
of the existing checks.
5.
subxact_info_write()
{
..
+ if (subxact_data.nsubxacts == 0)
+ {
+ if (ent->subxact_fileset)
+ {
+ cleanup_subxact_info();
+ BufFileDeleteShared(ent->subxact_fileset, path);
+ pfree(ent->subxact_fileset);
+ ent->subxact_fileset = NULL;
+ }

I don't think it is right to use BufFileDeleteShared interface here
because it won't perform SharedFileSetUnregister which means if after
above code execution is the server exits it will crash in
SharedFileSetDeleteOnProcExit which will try to access already deleted
fileset entry. Fixed this by calling SharedFileSetDeleteAll() instead.
The another related problem is that in function
SharedFileSetDeleteOnProcExit, it tries to delete the list element
while traversing the list with 'foreach' construct which makes the
behavior of list traversal unpredictable. I have fixed this in a
separate patch v54-0001-Fix-the-SharedFileSetUnregister-API, if you
are fine with this, I would like to commit this as this fixes a
problem in the existing commit 808e13b282.
6. Function stream_cleanup_files() contains a missing_ok argument
which is not used so removed it.
7. In pgoutput.c, change the ordering of functions to make them
consistent with their declaration.
8.
typedef struct RelationSyncEntry
{
Oid relid; /* relation oid */
+ TransactionId xid; /* transaction that created the record */

Removed above parameter as this doesn't seem to be required as per the
new design in the patch.

Apart from above, I have added/changed quite a few comments and a few
other cosmetic changes. Kindly review and let me know what do you
think about the changes?

One more comment for which I haven't done anything yet.
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+ MemoryContext oldctx;
+
+ oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+ entry->streamed_txns = lappend_int(entry->streamed_txns, xid);

Is it a good idea to append xid with lappend_int? Won't we need
something equivalent for uint32? If so, I think we have a couple of
options (a) use lcons method and accordingly append the pointer to
xid, I think we need to allocate memory for xid if we want to use this
idea or (b) use an array instead. What do you think?

--
With Regards,
Amit Kapila.

Attachments:

v54-0001-Fix-the-SharedFileSetUnregister-API.patchapplication/octet-stream; name=v54-0001-Fix-the-SharedFileSetUnregister-API.patchDownload
From 58eac081e5099329ebda3b983fe9adfba0d569e9 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 28 Aug 2020 12:42:09 +0530
Subject: [PATCH v54 1/2] Fix the SharedFileSetUnregister API.

Commit 808e13b282 introduced a few APIs to extend the existing Buffile
interface. In SharedFileSetDeleteOnProcExit, it tries to delete the list
element while traversing the list with 'foreach' construct which makes the
behavior of list traversal unpredictable.
---
 src/backend/storage/file/sharedfileset.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 8b96e81fff..859c22e79b 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -266,12 +266,16 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 static void
 SharedFileSetDeleteOnProcExit(int status, Datum arg)
 {
-	ListCell   *l;
-
-	/* Loop over all the pending shared fileset entry */
-	foreach(l, filesetlist)
+	/*
+	 * Remove all the pending shared fileset entries. We don't use foreach() here
+	 * because SharedFileSetDeleteAll will remove the current element in
+	 * filesetlist. Though we have used foreach_delete_current() to remove the
+	 * element from filesetlist it could only fix up the state of one of the
+	 * loops, see SharedFileSetUnregister.
+	 */
+	while (list_length(filesetlist) > 0)
 	{
-		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSet *fileset = (SharedFileSet *) linitial(filesetlist);
 
 		SharedFileSetDeleteAll(fileset);
 	}
@@ -301,7 +305,7 @@ SharedFileSetUnregister(SharedFileSet *input_fileset)
 		/* Remove the entry from the list */
 		if (input_fileset == fileset)
 		{
-			filesetlist = list_delete_cell(filesetlist, l);
+			filesetlist = foreach_delete_current(filesetlist, l);
 			return;
 		}
 	}
-- 
2.28.0.windows.1

v54-0002-Add-support-for-streaming-to-built-in-logical-re.patchapplication/octet-stream; name=v54-0002-Add-support-for-streaming-to-built-in-logical-re.patchDownload
From 56d054f13de3e1caa901b2c19fabd2935ebf7ccc Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v54 2/2] Add support for streaming to built-in logical
 replication.

To add support for streaming of in-progress transactions into the
built-in logical replication, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.

Author: Tomas Vondra, Dilip Kumar and Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 doc/src/sgml/monitoring.sgml                  |  16 +
 doc/src/sgml/ref/alter_subscription.sgml      |   5 +-
 doc/src/sgml/ref/create_subscription.sgml     |  11 +
 src/backend/catalog/pg_subscription.c         |   1 +
 src/backend/commands/subscriptioncmds.c       |  46 +-
 src/backend/postmaster/pgstat.c               |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |   4 +
 src/backend/replication/logical/proto.c       | 162 ++-
 src/backend/replication/logical/worker.c      | 951 +++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c   | 348 ++++++-
 src/include/catalog/pg_subscription.h         |   3 +
 src/include/pgstat.h                          |   6 +-
 src/include/replication/logicalproto.h        |  46 +-
 src/include/replication/walreceiver.h         |   1 +
 src/test/subscription/t/009_stream_simple.pl  |  86 ++
 src/test/subscription/t/010_stream_subxact.pl | 102 ++
 src/test/subscription/t/011_stream_ddl.pl     |  95 ++
 .../t/012_stream_subxact_abort.pl             |  82 ++
 .../t/013_stream_subxact_ddl_abort.pl         |  84 ++
 src/test/subscription/t/015_stream_binary.pl  |  86 ++
 src/tools/pgindent/typedefs.list              |   3 +
 21 files changed, 2104 insertions(+), 46 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/015_stream_binary.pl

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 17a0df6978..7fa1d79cc0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1509,6 +1509,22 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>WALWrite</literal></entry>
       <entry>Waiting for a write to a WAL file.</entry>
      </row>
+     <row>
+      <entry><literal>LogicalChangesRead</literal></entry>
+      <entry>Waiting for a read from a logical changes file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalChangesWrite</literal></entry>
+      <entry>Waiting for a write to a logical changes file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalSubxactRead</literal></entry>
+      <entry>Waiting for a read from a logical subxact file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalSubxactWrite</literal></entry>
+      <entry>Waiting for a write to a logical subxact file.</entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70cdf..a1666b370b 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal>, and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c54fe..b7d7457d00 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf0c6..311d46225a 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377a85..9426e1d84b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,11 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+	{
+		*streaming_given = false;
+		*streaming = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +200,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +353,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +378,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -427,6 +446,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subowner - 1] = ObjectIdGetDatum(owner);
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
+	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -698,6 +718,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +729,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +762,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +786,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +831,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +874,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8116b23614..5f4b168fd1 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "LogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "LogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "LogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "LogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 8afa5a29b4..ad574099ff 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -425,6 +425,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097bf5..f82236ed93 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,126 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+/*
+ * Write the information for the start stream message to the output stream.
+ */
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+/*
+ * Read the information about the start stream message from output stream.
+ */
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+/*
+ * Write the stop stream message to the output stream.
+ */
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+/*
+ * Write STREAM COMMIT to the output stream.
+ */
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+/*
+ * Read STREAM COMMIT from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+/*
+ * Write STREAM ABORT to the output stream. Note that xid and subxid will be
+ * same for the top-level transaction abort.
+ */
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+/*
+ * Read STREAM ABORT from the output stream.
+ */
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b576e342cb..f022b81e76 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,45 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead, the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription. This is necessary so that different workers processing a
+ * remote transaction with the same XID doesn't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close. We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of the
+ * file and if we decide to keep stream files open across the start/stop stream
+ * then it will consume a lot of memory (more than 8K for each BufFile and
+ * there could be multiple such BufFiles as the subscriber could receive
+ * multiple start/stop streams for different transactions before getting the
+ * commit). Moreover, if we don't use SharedFileSet then we also need to invent
+ * a new way to pass filenames to BufFile APIs so that we are allowed to open
+ * the file we desired across multiple stream-open calls for the same
+ * transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +67,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +99,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +109,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +138,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry. Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any subxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles, so storing them in hash makes the search faster.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +166,65 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.
+ */
+static HTAB *xidhash = NULL;
+/* BufFile handle of the current streaming file */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+/* Sub-transaction data for the current streaming transaction */
+typedef struct ApplySubXactData
+{
+	uint32		nsubxacts;		/* number of sub-transactions */
+	uint32		nsubxacts_max;	/* current capacity of subxacts */
+	TransactionId subxact_last; /* xid of the last sub-transaction */
+	SubXactInfo *subxacts;		/* sub-xact offset in changes file */
+} ApplySubXactData;
+
+static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +296,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,16 +757,335 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside streaming transaction or inside
+	 * remote transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if (!in_streamed_transaction &&
+		(!in_remote_transaction ||
+		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop. We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/*
+	 * Initialize the xidhash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing subxact file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * We can't use the binary search here as subxact XIDs won't
+		 * necessarily arrive in sorted order, consider the case where we have
+		 * released the savepoint for multiple subtransactions and then
+		 * performed rollback to savepoint for one of the earlier
+		 * sub-transaction.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		for (i = subxact_data.nsubxacts; i > 0; i--)
+		{
+			if (subxact_data.subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < subxact_data.nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxact_data.subxacts[subidx].fileno,
+							  subxact_data.subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		subxact_data.nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	/*
+	 * Allocate file handle and memory required to process all the messages in
+	 * TopTransactionContext to avoid them getting reset after each message is
+	 * processed.
+	 */
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file \"%s\"", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -635,6 +1099,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1117,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1156,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1274,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1426,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1799,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1940,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2068,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context is used for per-stream data when the streaming mode
+	 * is enabled. This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2180,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1938,6 +2444,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->name, MySubscription->name) != 0 ||
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
+		newsub->stream != MySubscription->stream ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -1979,6 +2486,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do, but if already have
+	 * subxact file then delete that.
+	 */
+	if (subxact_data.nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			SharedFileSetDeleteAll(ent->subxact_fileset);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+		return;
+	}
+
+	subxact_filename(path, subid, xid);
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &subxact_data.nsubxacts, sizeof(subxact_data.nsubxacts));
+	BufFileWrite(fd, subxact_data.subxacts, len);
+
+	BufFileClose(fd);
+
+	/* free the memory allocated for subxact info */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the structure subxact_data that can be
+ * used later.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxact_data.subxacts);
+	Assert(subxact_data.nsubxacts == 0);
+	Assert(subxact_data.nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &subxact_data.nsubxacts,
+					sizeof(subxact_data.nsubxacts)) !=
+		sizeof(subxact_data.nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	subxact_data.nsubxacts_max = 1 << my_log2(subxact_data.nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context. We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this. On stream stop we will flush this information
+	 * to the subxact file and reset the logical streaming context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxact_data.subxacts = palloc(subxact_data.nsubxacts_max *
+								   sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxact_data.subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	SubXactInfo *subxacts = subxact_data.subxacts;
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_data.subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_data.subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = subxact_data.nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (subxact_data.nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		subxact_data.nsubxacts_max = 128;
+
+		/*
+		 * Allocate this memory for subxacts in per-stream context, see
+		 * subxact_info_read.
+		 */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (subxact_data.nsubxacts == subxact_data.nsubxacts_max)
+	{
+		subxact_data.nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts,
+							subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[subxact_data.nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd,
+				&subxacts[subxact_data.nsubxacts].fileno,
+				&subxacts[subxact_data.nsubxacts].offset);
+
+	subxact_data.nsubxacts++;
+	subxact_data.subxacts = subxacts;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	SharedFileSetDeleteAll(ent->stream_fileset);
+	pfree(ent->stream_fileset);
+	ent->stream_fileset = NULL;
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		SharedFileSetDeleteAll(ent->subxact_fileset);
+		pfree(ent->subxact_fileset);
+		ent->subxact_fileset = NULL;
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open a file that we'll use to serialize changes for a toplevel
+ * transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buffile, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file \"%s\" for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls. So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxact_data.subxacts)
+		pfree(subxact_data.subxacts);
+
+	subxact_data.subxacts = NULL;
+	subxact_data.subxact_last = InvalidTransactionId;
+	subxact_data.nsubxacts = 0;
+	subxact_data.nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2151,6 +3091,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc4c1..129395c072 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,17 +47,40 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort. For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
@@ -70,6 +93,8 @@ typedef struct RelationSyncEntry
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,11 +120,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -115,16 +146,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +223,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +252,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +276,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +297,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +328,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +391,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start of the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +441,24 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +482,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,6 +503,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -406,7 +535,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +555,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +580,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +601,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +627,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +659,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -605,6 +739,118 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * START STREAM callback
+ */
+static void
+pgoutput_stream_start(struct LogicalDecodingContext* ctx,
+					  ReorderBufferTXN* txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char* origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+/*
+ * STOP STREAM callback
+ */
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext* ctx,
+					 ReorderBufferTXN* txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -641,6 +887,39 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -771,11 +1050,44 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -811,7 +1123,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35000..1d091546bf 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1edf..0dfbac46b4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc85c..655144d03a 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,49 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbee54..6c0a4e30a8 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000000..d2ae38592b
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000000..402df30f59
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,82 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..becbdd0578
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/015_stream_binary.pl b/src/test/subscription/t/015_stream_binary.pl
new file mode 100644
index 0000000000..fa2362e32b
--- /dev/null
+++ b/src/test/subscription/t/015_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3d990463ce..500623e230 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -111,6 +111,7 @@ Append
 AppendPath
 AppendRelInfo
 AppendState
+ApplySubXactData
 Archive
 ArchiveEntryPtrType
 ArchiveFormat
@@ -2370,6 +2371,7 @@ StopList
 StopWorkersData
 StrategyNumber
 StreamCtl
+StreamXidHash
 StringInfo
 StringInfoData
 StripnullState
@@ -2380,6 +2382,7 @@ SubPlanState
 SubTransactionId
 SubXactCallback
 SubXactCallbackItem
+SubXactInfo
 SubXactEvent
 SubplanResultRelHashElem
 SubqueryScan
-- 
2.28.0.windows.1

#502Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#501)
2 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Aug 29, 2020 at 5:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Aug 28, 2020 at 2:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

As discussed, I have added a another test case for covering the out of
order subtransaction rollback scenario.

+# large (streamed) transaction with out of order subtransaction ROLLBACKs
+$node_publisher->safe_psql('postgres', q{

How about writing a comment as: "large (streamed) transaction with
subscriber receiving out of order subtransaction ROLLBACKs"?

I have fixed and merged with 0002.

I have reviewed and modified the number of things in the attached patch:
1. In apply_handle_origin, improved the check streamed xacts.
2. In apply_handle_stream_commit() while applying changes in the loop,
added CHECK_FOR_INTERRUPTS.
3. In DEBUG messages, print the path with double-quotes as we are
doing in all other places.
4.
+ /*
+ * Exit if streaming option is changed. The launcher will start new
+ * worker.
+ */
+ if (newsub->stream != MySubscription->stream)
+ {
+ ereport(LOG,
+ (errmsg("logical replication apply worker for subscription \"%s\" will "
+ "restart because subscription's streaming option were changed",
+ MySubscription->name)));
+
+ proc_exit(0);
+ }
+
We don't need a separate check like this. I have merged this into one
of the existing checks.
5.
subxact_info_write()
{
..
+ if (subxact_data.nsubxacts == 0)
+ {
+ if (ent->subxact_fileset)
+ {
+ cleanup_subxact_info();
+ BufFileDeleteShared(ent->subxact_fileset, path);
+ pfree(ent->subxact_fileset);
+ ent->subxact_fileset = NULL;
+ }

I don't think it is right to use BufFileDeleteShared interface here
because it won't perform SharedFileSetUnregister which means if after
above code execution is the server exits it will crash in
SharedFileSetDeleteOnProcExit which will try to access already deleted
fileset entry. Fixed this by calling SharedFileSetDeleteAll() instead.
The another related problem is that in function
SharedFileSetDeleteOnProcExit, it tries to delete the list element
while traversing the list with 'foreach' construct which makes the
behavior of list traversal unpredictable. I have fixed this in a
separate patch v54-0001-Fix-the-SharedFileSetUnregister-API, if you
are fine with this, I would like to commit this as this fixes a
problem in the existing commit 808e13b282.
6. Function stream_cleanup_files() contains a missing_ok argument
which is not used so removed it.
7. In pgoutput.c, change the ordering of functions to make them
consistent with their declaration.
8.
typedef struct RelationSyncEntry
{
Oid relid; /* relation oid */
+ TransactionId xid; /* transaction that created the record */

Removed above parameter as this doesn't seem to be required as per the
new design in the patch.

Apart from above, I have added/changed quite a few comments and a few
other cosmetic changes. Kindly review and let me know what do you
think about the changes?

I have reviewed your changes and look fine to me. And the bug fix in
0001 also looks fine.

One more comment for which I haven't done anything yet.
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+ MemoryContext oldctx;
+
+ oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+ entry->streamed_txns = lappend_int(entry->streamed_txns, xid);

Is it a good idea to append xid with lappend_int? Won't we need
something equivalent for uint32? If so, I think we have a couple of
options (a) use lcons method and accordingly append the pointer to
xid, I think we need to allocate memory for xid if we want to use this
idea or (b) use an array instead. What do you think?

BTW, OID is internally mapped to uint32, but using lappend_oid might
not look good. So maybe we can provide an option for lappend_uint32?
Using an array is also not a bad idea. Providing lappend_uint32
option looks more appealing to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v55-0001-Fix-the-SharedFileSetUnregister-API.patchapplication/octet-stream; name=v55-0001-Fix-the-SharedFileSetUnregister-API.patchDownload
From 8d0802b308d65c83bf285f38992f97f748dd9456 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 28 Aug 2020 12:42:09 +0530
Subject: [PATCH v55 1/5] Fix the SharedFileSetUnregister API.

Commit 808e13b282 introduced a few APIs to extend the existing Buffile
interface. In SharedFileSetDeleteOnProcExit, it tries to delete the list
element while traversing the list with 'foreach' construct which makes the
behavior of list traversal unpredictable.
---
 src/backend/storage/file/sharedfileset.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 8b96e81..859c22e 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -266,12 +266,16 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 static void
 SharedFileSetDeleteOnProcExit(int status, Datum arg)
 {
-	ListCell   *l;
-
-	/* Loop over all the pending shared fileset entry */
-	foreach(l, filesetlist)
+	/*
+	 * Remove all the pending shared fileset entries. We don't use foreach() here
+	 * because SharedFileSetDeleteAll will remove the current element in
+	 * filesetlist. Though we have used foreach_delete_current() to remove the
+	 * element from filesetlist it could only fix up the state of one of the
+	 * loops, see SharedFileSetUnregister.
+	 */
+	while (list_length(filesetlist) > 0)
 	{
-		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSet *fileset = (SharedFileSet *) linitial(filesetlist);
 
 		SharedFileSetDeleteAll(fileset);
 	}
@@ -301,7 +305,7 @@ SharedFileSetUnregister(SharedFileSet *input_fileset)
 		/* Remove the entry from the list */
 		if (input_fileset == fileset)
 		{
-			filesetlist = list_delete_cell(filesetlist, l);
+			filesetlist = foreach_delete_current(filesetlist, l);
 			return;
 		}
 	}
-- 
1.8.3.1

v55-0002-Add-support-for-streaming-to-built-in-logical-re.patchapplication/octet-stream; name=v55-0002-Add-support-for-streaming-to-built-in-logical-re.patchDownload
From 014b492d3d9baaf4aa6dcb40e836100f4ec4b6b6 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v55 2/5] Add support for streaming to built-in logical
 replication.

To add support for streaming of in-progress transactions into the
built-in logical replication, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.

Author: Tomas Vondra, Dilip Kumar and Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 doc/src/sgml/monitoring.sgml                       |  16 +
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  46 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   4 +
 src/backend/replication/logical/proto.c            | 162 +++-
 src/backend/replication/logical/worker.c           | 951 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 348 +++++++-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  46 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/009_stream_simple.pl       |  86 ++
 src/test/subscription/t/010_stream_subxact.pl      | 102 +++
 src/test/subscription/t/011_stream_ddl.pl          |  95 ++
 .../subscription/t/012_stream_subxact_abort.pl     | 105 +++
 .../subscription/t/013_stream_subxact_ddl_abort.pl |  84 ++
 src/test/subscription/t/015_stream_binary.pl       |  86 ++
 src/tools/pgindent/typedefs.list                   |   3 +
 21 files changed, 2127 insertions(+), 46 deletions(-)
 create mode 100644 src/test/subscription/t/009_stream_simple.pl
 create mode 100644 src/test/subscription/t/010_stream_subxact.pl
 create mode 100644 src/test/subscription/t/011_stream_ddl.pl
 create mode 100644 src/test/subscription/t/012_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/013_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/015_stream_binary.pl

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 17a0df6..7fa1d79 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1509,6 +1509,22 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>WALWrite</literal></entry>
       <entry>Waiting for a write to a WAL file.</entry>
      </row>
+     <row>
+      <entry><literal>LogicalChangesRead</literal></entry>
+      <entry>Waiting for a read from a logical changes file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalChangesWrite</literal></entry>
+      <entry>Waiting for a write to a logical changes file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalSubxactRead</literal></entry>
+      <entry>Waiting for a read from a logical subxact file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalSubxactWrite</literal></entry>
+      <entry>Waiting for a write to a logical subxact file.</entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70..a1666b3 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal>, and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c5..b7d7457 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf..311d462 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377..9426e1d 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,11 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+	{
+		*streaming_given = false;
+		*streaming = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +200,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +353,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +378,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -427,6 +446,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subowner - 1] = ObjectIdGetDatum(owner);
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
+	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -698,6 +718,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +729,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +762,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +786,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +831,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +874,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8116b23..5f4b168 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "LogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "LogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "LogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "LogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 8afa5a2..ad57409 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -425,6 +425,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097..f82236e 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,126 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+/*
+ * Write the information for the start stream message to the output stream.
+ */
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+/*
+ * Read the information about the start stream message from output stream.
+ */
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+/*
+ * Write the stop stream message to the output stream.
+ */
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+/*
+ * Write STREAM COMMIT to the output stream.
+ */
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+/*
+ * Read STREAM COMMIT from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+/*
+ * Write STREAM ABORT to the output stream. Note that xid and subxid will be
+ * same for the top-level transaction abort.
+ */
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+/*
+ * Read STREAM ABORT from the output stream.
+ */
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b576e34..f022b81 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,45 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead, the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription. This is necessary so that different workers processing a
+ * remote transaction with the same XID doesn't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close. We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of the
+ * file and if we decide to keep stream files open across the start/stop stream
+ * then it will consume a lot of memory (more than 8K for each BufFile and
+ * there could be multiple such BufFiles as the subscriber could receive
+ * multiple start/stop streams for different transactions before getting the
+ * commit). Moreover, if we don't use SharedFileSet then we also need to invent
+ * a new way to pass filenames to BufFile APIs so that we are allowed to open
+ * the file we desired across multiple stream-open calls for the same
+ * transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +67,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +99,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +109,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +138,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry. Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any subxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles, so storing them in hash makes the search faster.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +166,65 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.
+ */
+static HTAB *xidhash = NULL;
+/* BufFile handle of the current streaming file */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+/* Sub-transaction data for the current streaming transaction */
+typedef struct ApplySubXactData
+{
+	uint32		nsubxacts;		/* number of sub-transactions */
+	uint32		nsubxacts_max;	/* current capacity of subxacts */
+	TransactionId subxact_last; /* xid of the last sub-transaction */
+	SubXactInfo *subxacts;		/* sub-xact offset in changes file */
+} ApplySubXactData;
+
+static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +296,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,17 +757,336 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside streaming transaction or inside
+	 * remote transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if (!in_streamed_transaction &&
+		(!in_remote_transaction ||
+		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop. We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/*
+	 * Initialize the xidhash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing subxact file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * We can't use the binary search here as subxact XIDs won't
+		 * necessarily arrive in sorted order, consider the case where we have
+		 * released the savepoint for multiple subtransactions and then
+		 * performed rollback to savepoint for one of the earlier
+		 * sub-transaction.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		for (i = subxact_data.nsubxacts; i > 0; i--)
+		{
+			if (subxact_data.subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < subxact_data.nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxact_data.subxacts[subidx].fileno,
+							  subxact_data.subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		subxact_data.nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	/*
+	 * Allocate file handle and memory required to process all the messages in
+	 * TopTransactionContext to avoid them getting reset after each message is
+	 * processed.
+	 */
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file \"%s\"", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -635,6 +1099,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1117,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1156,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1274,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1426,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1799,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1940,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2068,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context is used for per-stream data when the streaming mode
+	 * is enabled. This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2180,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1938,6 +2444,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->name, MySubscription->name) != 0 ||
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
+		newsub->stream != MySubscription->stream ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -1979,6 +2486,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do, but if already have
+	 * subxact file then delete that.
+	 */
+	if (subxact_data.nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			SharedFileSetDeleteAll(ent->subxact_fileset);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+		return;
+	}
+
+	subxact_filename(path, subid, xid);
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &subxact_data.nsubxacts, sizeof(subxact_data.nsubxacts));
+	BufFileWrite(fd, subxact_data.subxacts, len);
+
+	BufFileClose(fd);
+
+	/* free the memory allocated for subxact info */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the structure subxact_data that can be
+ * used later.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxact_data.subxacts);
+	Assert(subxact_data.nsubxacts == 0);
+	Assert(subxact_data.nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &subxact_data.nsubxacts,
+					sizeof(subxact_data.nsubxacts)) !=
+		sizeof(subxact_data.nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	subxact_data.nsubxacts_max = 1 << my_log2(subxact_data.nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context. We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this. On stream stop we will flush this information
+	 * to the subxact file and reset the logical streaming context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxact_data.subxacts = palloc(subxact_data.nsubxacts_max *
+								   sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxact_data.subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	SubXactInfo *subxacts = subxact_data.subxacts;
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_data.subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_data.subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = subxact_data.nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (subxact_data.nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		subxact_data.nsubxacts_max = 128;
+
+		/*
+		 * Allocate this memory for subxacts in per-stream context, see
+		 * subxact_info_read.
+		 */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (subxact_data.nsubxacts == subxact_data.nsubxacts_max)
+	{
+		subxact_data.nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts,
+							subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[subxact_data.nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd,
+				&subxacts[subxact_data.nsubxacts].fileno,
+				&subxacts[subxact_data.nsubxacts].offset);
+
+	subxact_data.nsubxacts++;
+	subxact_data.subxacts = subxacts;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	SharedFileSetDeleteAll(ent->stream_fileset);
+	pfree(ent->stream_fileset);
+	ent->stream_fileset = NULL;
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		SharedFileSetDeleteAll(ent->subxact_fileset);
+		pfree(ent->subxact_fileset);
+		ent->subxact_fileset = NULL;
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open a file that we'll use to serialize changes for a toplevel
+ * transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buffile, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file \"%s\" for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls. So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxact_data.subxacts)
+		pfree(subxact_data.subxacts);
+
+	subxact_data.subxacts = NULL;
+	subxact_data.subxact_last = InvalidTransactionId;
+	subxact_data.nsubxacts = 0;
+	subxact_data.nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2151,6 +3091,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc..129395c 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,17 +47,40 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort. For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
@@ -70,6 +93,8 @@ typedef struct RelationSyncEntry
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,11 +120,17 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
 
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+
 /*
  * Specify output plugin callbacks
  */
@@ -115,16 +146,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +223,23 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			/* the value must be on/off */
+			if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid streaming value")));
+
+			/* enable streaming if it's 'on' */
+			*enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +252,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +276,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +297,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +328,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +391,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start of the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +441,24 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +482,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * XXX May be called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,6 +503,10 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -406,7 +535,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +555,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +580,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +601,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +627,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +659,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -606,6 +740,118 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * START STREAM callback
+ */
+static void
+pgoutput_stream_start(struct LogicalDecodingContext* ctx,
+					  ReorderBufferTXN* txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char* origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+/*
+ * STOP STREAM callback
+ */
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext* ctx,
+					 ReorderBufferTXN* txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -642,6 +888,39 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -771,12 +1050,45 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -811,7 +1123,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35..1d09154 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc..655144d 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,49 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
 
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbe..6c0a4e3 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/009_stream_simple.pl b/src/test/subscription/t/009_stream_simple.pl
new file mode 100644
index 0000000..2f01133
--- /dev/null
+++ b/src/test/subscription/t/009_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/010_stream_subxact.pl b/src/test/subscription/t/010_stream_subxact.pl
new file mode 100644
index 0000000..d2ae385
--- /dev/null
+++ b/src/test/subscription/t/010_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rowsto exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/011_stream_ddl.pl b/src/test/subscription/t/011_stream_ddl.pl
new file mode 100644
index 0000000..0da39a1
--- /dev/null
+++ b/src/test/subscription/t/011_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/012_stream_subxact_abort.pl b/src/test/subscription/t/012_stream_subxact_abort.pl
new file mode 100644
index 0000000..0fba368
--- /dev/null
+++ b/src/test/subscription/t/012_stream_subxact_abort.pl
@@ -0,0 +1,105 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+# large (streamed) transaction with subscriber receiving out of order
+# subtransaction ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3001,3500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(4001,4500) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(5001,5500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(6001,6500) s(i);
+RELEASE s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(7001,7500) s(i);
+ROLLBACK TO s1;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1500|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/013_stream_subxact_ddl_abort.pl b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..becbdd0
--- /dev/null
+++ b/src/test/subscription/t/013_stream_subxact_ddl_abort.pl
@@ -0,0 +1,84 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/015_stream_binary.pl b/src/test/subscription/t/015_stream_binary.pl
new file mode 100644
index 0000000..fa2362e
--- /dev/null
+++ b/src/test/subscription/t/015_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3d99046..500623e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -111,6 +111,7 @@ Append
 AppendPath
 AppendRelInfo
 AppendState
+ApplySubXactData
 Archive
 ArchiveEntryPtrType
 ArchiveFormat
@@ -2370,6 +2371,7 @@ StopList
 StopWorkersData
 StrategyNumber
 StreamCtl
+StreamXidHash
 StringInfo
 StringInfoData
 StripnullState
@@ -2380,6 +2382,7 @@ SubPlanState
 SubTransactionId
 SubXactCallback
 SubXactCallbackItem
+SubXactInfo
 SubXactEvent
 SubplanResultRelHashElem
 SubqueryScan
-- 
1.8.3.1

#503Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#502)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Aug 29, 2020 at 5:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

One more comment for which I haven't done anything yet.
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+ MemoryContext oldctx;
+
+ oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+ entry->streamed_txns = lappend_int(entry->streamed_txns, xid);

Is it a good idea to append xid with lappend_int? Won't we need
something equivalent for uint32? If so, I think we have a couple of
options (a) use lcons method and accordingly append the pointer to
xid, I think we need to allocate memory for xid if we want to use this
idea or (b) use an array instead. What do you think?

BTW, OID is internally mapped to uint32, but using lappend_oid might
not look good. So maybe we can provide an option for lappend_uint32?
Using an array is also not a bad idea. Providing lappend_uint32
option looks more appealing to me.

I thought about this again and I feel it might be okay to use it for
our case as after storing it in T_IntList, we primarily fetch it for
comparison with TrasnactionId (uint32), so this shouldn't create any
problem. I feel we can just discuss this in a separate thread and
check the opinion of others, what do you think?

Another comment:

+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+ HASH_SEQ_STATUS hash_seq;
+ RelationSyncEntry *entry;
+
+ Assert(RelationSyncCache != NULL);
+
+ hash_seq_init(&hash_seq, RelationSyncCache);
+ while ((entry = hash_seq_search(&hash_seq)) != NULL)
+ {
+ if (is_commit)
+ entry->schema_sent = true;

How is it correct to set 'entry->schema_sent' for all the entries in
RelationSyncCache? Consider a case where due to invalidation in an
unrelated transaction we have set the flag schema_sent for a
particular relation 'r1' as 'false' and that transaction is executed
before the current streamed transaction for which we are performing
commit and called this function. It will set the flag for unrelated
entry in this case 'r1' which doesn't seem correct to me. Or, if this
is correct, it would be a good idea to write some comments about it.

--
With Regards,
Amit Kapila.

#504Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#503)
2 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Aug 31, 2020 at 10:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Another comment:

+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+ HASH_SEQ_STATUS hash_seq;
+ RelationSyncEntry *entry;
+
+ Assert(RelationSyncCache != NULL);
+
+ hash_seq_init(&hash_seq, RelationSyncCache);
+ while ((entry = hash_seq_search(&hash_seq)) != NULL)
+ {
+ if (is_commit)
+ entry->schema_sent = true;

How is it correct to set 'entry->schema_sent' for all the entries in
RelationSyncCache? Consider a case where due to invalidation in an
unrelated transaction we have set the flag schema_sent for a
particular relation 'r1' as 'false' and that transaction is executed
before the current streamed transaction for which we are performing
commit and called this function. It will set the flag for unrelated
entry in this case 'r1' which doesn't seem correct to me. Or, if this
is correct, it would be a good idea to write some comments about it.

Few more comments:
1.
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr
application_name=$appname' PUBLICATION tap_pub"
+);

In most of the tests, we are using the above statement to create a
subscription. Don't we need (streaming = 'on') parameter while
creating a subscription? Is there a reason for not doing so in this
patch itself?

2.
009_stream_simple.pl
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});

How much above this data is 64kB limit? I just wanted to see that it
should not be on borderline and then due to some alignment issues the
streaming doesn't happen on some machines? Also, how such a test
ensures that the streaming has happened because the way we are
checking results, won't it be the same for the non-streaming case as
well?

3.
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*),
count(extract(epoch from c) = 987654321), count(d = 999) FROM
test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally
changed data');

Again, how this test is relevant to streaming mode?

4. I have checked that in one of the previous patches, we have a test
v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case
quite similar to what we have in
v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.
If there is any difference that can cover more scenarios then can we
consider merging them into one test?

Apart from the above, I have made a few changes in the attached patch
which are mainly to simplify the code at one place, added/edited few
comments, some other cosmetic changes, and renamed the test case files
as the initials of their name were matching other tests in the similar
directory.

--
With Regards,
Amit Kapila.

Attachments:

v56-0001-Fix-the-SharedFileSetUnregister-API.patchapplication/octet-stream; name=v56-0001-Fix-the-SharedFileSetUnregister-API.patchDownload
From 6e762fe0ac13c8d7e560b712f4194381a2e9fcc7 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 28 Aug 2020 12:42:09 +0530
Subject: [PATCH v56 1/2] Fix the SharedFileSetUnregister API.

Commit 808e13b282 introduced a few APIs to extend the existing Buffile
interface. In SharedFileSetDeleteOnProcExit, it tries to delete the list
element while traversing the list with 'foreach' construct which makes the
behavior of list traversal unpredictable.
---
 src/backend/storage/file/sharedfileset.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 8b96e81fff..859c22e79b 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -266,12 +266,16 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 static void
 SharedFileSetDeleteOnProcExit(int status, Datum arg)
 {
-	ListCell   *l;
-
-	/* Loop over all the pending shared fileset entry */
-	foreach(l, filesetlist)
+	/*
+	 * Remove all the pending shared fileset entries. We don't use foreach() here
+	 * because SharedFileSetDeleteAll will remove the current element in
+	 * filesetlist. Though we have used foreach_delete_current() to remove the
+	 * element from filesetlist it could only fix up the state of one of the
+	 * loops, see SharedFileSetUnregister.
+	 */
+	while (list_length(filesetlist) > 0)
 	{
-		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSet *fileset = (SharedFileSet *) linitial(filesetlist);
 
 		SharedFileSetDeleteAll(fileset);
 	}
@@ -301,7 +305,7 @@ SharedFileSetUnregister(SharedFileSet *input_fileset)
 		/* Remove the entry from the list */
 		if (input_fileset == fileset)
 		{
-			filesetlist = list_delete_cell(filesetlist, l);
+			filesetlist = foreach_delete_current(filesetlist, l);
 			return;
 		}
 	}
-- 
2.28.0.windows.1

v56-0002-Add-support-for-streaming-to-built-in-logical-re.patchapplication/octet-stream; name=v56-0002-Add-support-for-streaming-to-built-in-logical-re.patchDownload
From f595a1d1d19390e5878921a97026379e1a7433ab Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v56 2/2] Add support for streaming to built-in logical
 replication.

To add support for streaming of in-progress transactions into the
built-in logical replication, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.

Author: Tomas Vondra, Dilip Kumar and Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 doc/src/sgml/monitoring.sgml                  |  16 +
 doc/src/sgml/ref/alter_subscription.sgml      |   5 +-
 doc/src/sgml/ref/create_subscription.sgml     |  11 +
 src/backend/catalog/pg_subscription.c         |   1 +
 src/backend/commands/subscriptioncmds.c       |  46 +-
 src/backend/postmaster/pgstat.c               |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c       |   4 +
 src/backend/replication/logical/proto.c       | 162 ++-
 src/backend/replication/logical/worker.c      | 951 +++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c   | 346 ++++++-
 src/include/catalog/pg_subscription.h         |   3 +
 src/include/pgstat.h                          |   6 +-
 src/include/replication/logicalproto.h        |  42 +-
 src/include/replication/walreceiver.h         |   1 +
 src/test/subscription/t/015_stream_simple.pl  |  86 ++
 src/test/subscription/t/016_stream_subxact.pl | 102 ++
 src/test/subscription/t/017_stream_ddl.pl     |  95 ++
 .../t/018_stream_subxact_abort.pl             | 105 ++
 .../t/019_stream_subxact_ddl_abort.pl         |  85 ++
 src/test/subscription/t/020_stream_binary.pl  |  86 ++
 src/tools/pgindent/typedefs.list              |   3 +
 21 files changed, 2122 insertions(+), 46 deletions(-)
 create mode 100644 src/test/subscription/t/015_stream_simple.pl
 create mode 100644 src/test/subscription/t/016_stream_subxact.pl
 create mode 100644 src/test/subscription/t/017_stream_ddl.pl
 create mode 100644 src/test/subscription/t/018_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/019_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/020_stream_binary.pl

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 17a0df6978..7fa1d79cc0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1509,6 +1509,22 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>WALWrite</literal></entry>
       <entry>Waiting for a write to a WAL file.</entry>
      </row>
+     <row>
+      <entry><literal>LogicalChangesRead</literal></entry>
+      <entry>Waiting for a read from a logical changes file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalChangesWrite</literal></entry>
+      <entry>Waiting for a write to a logical changes file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalSubxactRead</literal></entry>
+      <entry>Waiting for a read from a logical subxact file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalSubxactWrite</literal></entry>
+      <entry>Waiting for a write to a logical subxact file.</entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70cdf..a1666b370b 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal>, and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c54fe..b7d7457d00 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf0c6..311d46225a 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377a85..9426e1d84b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,11 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+	{
+		*streaming_given = false;
+		*streaming = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +200,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +353,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +378,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -427,6 +446,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subowner - 1] = ObjectIdGetDatum(owner);
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
+	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -698,6 +718,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +729,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +762,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +786,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +831,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +874,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8116b23614..5f4b168fd1 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "LogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "LogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "LogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "LogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 8afa5a29b4..ad574099ff 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -425,6 +425,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097bf5..f82236ed93 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,126 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+/*
+ * Write the information for the start stream message to the output stream.
+ */
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+/*
+ * Read the information about the start stream message from output stream.
+ */
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+/*
+ * Write the stop stream message to the output stream.
+ */
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+/*
+ * Write STREAM COMMIT to the output stream.
+ */
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+/*
+ * Read STREAM COMMIT from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+/*
+ * Write STREAM ABORT to the output stream. Note that xid and subxid will be
+ * same for the top-level transaction abort.
+ */
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+/*
+ * Read STREAM ABORT from the output stream.
+ */
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b576e342cb..f022b81e76 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,45 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead, the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription. This is necessary so that different workers processing a
+ * remote transaction with the same XID doesn't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close. We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of the
+ * file and if we decide to keep stream files open across the start/stop stream
+ * then it will consume a lot of memory (more than 8K for each BufFile and
+ * there could be multiple such BufFiles as the subscriber could receive
+ * multiple start/stop streams for different transactions before getting the
+ * commit). Moreover, if we don't use SharedFileSet then we also need to invent
+ * a new way to pass filenames to BufFile APIs so that we are allowed to open
+ * the file we desired across multiple stream-open calls for the same
+ * transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +67,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +99,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +109,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +138,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry. Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any subxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles, so storing them in hash makes the search faster.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +166,65 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.
+ */
+static HTAB *xidhash = NULL;
+/* BufFile handle of the current streaming file */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+/* Sub-transaction data for the current streaming transaction */
+typedef struct ApplySubXactData
+{
+	uint32		nsubxacts;		/* number of sub-transactions */
+	uint32		nsubxacts_max;	/* current capacity of subxacts */
+	TransactionId subxact_last; /* xid of the last sub-transaction */
+	SubXactInfo *subxacts;		/* sub-xact offset in changes file */
+} ApplySubXactData;
+
+static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +296,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,16 +757,335 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside streaming transaction or inside
+	 * remote transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if (!in_streamed_transaction &&
+		(!in_remote_transaction ||
+		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
+/*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop. We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/*
+	 * Initialize the xidhash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing subxact file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * We can't use the binary search here as subxact XIDs won't
+		 * necessarily arrive in sorted order, consider the case where we have
+		 * released the savepoint for multiple subtransactions and then
+		 * performed rollback to savepoint for one of the earlier
+		 * sub-transaction.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		for (i = subxact_data.nsubxacts; i > 0; i--)
+		{
+			if (subxact_data.subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < subxact_data.nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxact_data.subxacts[subidx].fileno,
+							  subxact_data.subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		subxact_data.nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	/*
+	 * Allocate file handle and memory required to process all the messages in
+	 * TopTransactionContext to avoid them getting reset after each message is
+	 * processed.
+	 */
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file \"%s\"", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
 /*
  * Handle RELATION message.
  *
@@ -635,6 +1099,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1117,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1156,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1274,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1426,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1799,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1940,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2068,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context is used for per-stream data when the streaming mode
+	 * is enabled. This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2180,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1938,6 +2444,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->name, MySubscription->name) != 0 ||
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
+		newsub->stream != MySubscription->stream ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -1979,6 +2486,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do, but if already have
+	 * subxact file then delete that.
+	 */
+	if (subxact_data.nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			SharedFileSetDeleteAll(ent->subxact_fileset);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+		return;
+	}
+
+	subxact_filename(path, subid, xid);
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &subxact_data.nsubxacts, sizeof(subxact_data.nsubxacts));
+	BufFileWrite(fd, subxact_data.subxacts, len);
+
+	BufFileClose(fd);
+
+	/* free the memory allocated for subxact info */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the structure subxact_data that can be
+ * used later.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxact_data.subxacts);
+	Assert(subxact_data.nsubxacts == 0);
+	Assert(subxact_data.nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &subxact_data.nsubxacts,
+					sizeof(subxact_data.nsubxacts)) !=
+		sizeof(subxact_data.nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	subxact_data.nsubxacts_max = 1 << my_log2(subxact_data.nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context. We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this. On stream stop we will flush this information
+	 * to the subxact file and reset the logical streaming context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxact_data.subxacts = palloc(subxact_data.nsubxacts_max *
+								   sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxact_data.subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	SubXactInfo *subxacts = subxact_data.subxacts;
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_data.subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_data.subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = subxact_data.nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (subxact_data.nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		subxact_data.nsubxacts_max = 128;
+
+		/*
+		 * Allocate this memory for subxacts in per-stream context, see
+		 * subxact_info_read.
+		 */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (subxact_data.nsubxacts == subxact_data.nsubxacts_max)
+	{
+		subxact_data.nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts,
+							subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[subxact_data.nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd,
+				&subxacts[subxact_data.nsubxacts].fileno,
+				&subxacts[subxact_data.nsubxacts].offset);
+
+	subxact_data.nsubxacts++;
+	subxact_data.subxacts = subxacts;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	SharedFileSetDeleteAll(ent->stream_fileset);
+	pfree(ent->stream_fileset);
+	ent->stream_fileset = NULL;
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		SharedFileSetDeleteAll(ent->subxact_fileset);
+		pfree(ent->subxact_fileset);
+		ent->subxact_fileset = NULL;
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open a file that we'll use to serialize changes for a toplevel
+ * transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buffile, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file \"%s\" for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls. So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxact_data.subxacts)
+		pfree(subxact_data.subxacts);
+
+	subxact_data.subxacts = NULL;
+	subxact_data.subxact_last = InvalidTransactionId;
+	subxact_data.nsubxacts = 0;
+	subxact_data.nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2151,6 +3091,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc4c1..bf4c277ebf 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,17 +47,40 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort. For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
@@ -70,6 +93,8 @@ typedef struct RelationSyncEntry
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,10 +120,15 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
 
 /*
  * Specify output plugin callbacks
@@ -115,16 +145,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +222,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			*enable_streaming = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +244,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +268,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +289,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +320,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +383,41 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start of the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +433,24 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +474,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * This is called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,10 +495,19 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
 
 	if (!is_publishable_relation(relation))
 		return;
 
+	/*
+	 * Remember the xid for the change in streaming mode. We need to send xid
+	 * with each change in the streaming mode so that subscriber can make their
+	 * association and on aborts, it can discard the corresponding changes.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
 	relentry = get_rel_sync_entry(data, RelationGetRelid(relation));
 
 	/* First check the table filter */
@@ -406,7 +532,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +552,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +577,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +598,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +624,11 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +657,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -605,6 +737,118 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 	rel_sync_cache_publication_cb(arg, cacheid, hashvalue);
 }
 
+/*
+ * START STREAM callback
+ */
+static void
+pgoutput_stream_start(struct LogicalDecodingContext* ctx,
+					  ReorderBufferTXN* txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char* origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+/*
+ * STOP STREAM callback
+ */
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext* ctx,
+					 ReorderBufferTXN* txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -641,6 +885,39 @@ init_rel_sync_cache(MemoryContext cachectx)
 								  (Datum) 0);
 }
 
+/*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
 /*
  * Find or create entry in the relation schema cache.
  *
@@ -771,11 +1048,44 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
+/*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		if (is_commit)
+			entry->schema_sent = true;
+
+		/* Remove the xid from the schema sent list. */
+		entry->streamed_txns = list_delete_int(entry->streamed_txns, xid);
+	}
+}
+
 /*
  * Relcache invalidation callback
  */
@@ -811,7 +1121,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35000..1d091546bf 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1edf..0dfbac46b4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc85c..53905ee608 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
 
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbee54..6c0a4e30a8 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/015_stream_simple.pl b/src/test/subscription/t/015_stream_simple.pl
new file mode 100644
index 0000000000..2f01133f69
--- /dev/null
+++ b/src/test/subscription/t/015_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/016_stream_subxact.pl b/src/test/subscription/t/016_stream_subxact.pl
new file mode 100644
index 0000000000..4d1abc230d
--- /dev/null
+++ b/src/test/subscription/t/016_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/017_stream_ddl.pl b/src/test/subscription/t/017_stream_ddl.pl
new file mode 100644
index 0000000000..0da39a1a8a
--- /dev/null
+++ b/src/test/subscription/t/017_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/018_stream_subxact_abort.pl b/src/test/subscription/t/018_stream_subxact_abort.pl
new file mode 100644
index 0000000000..0fba36880e
--- /dev/null
+++ b/src/test/subscription/t/018_stream_subxact_abort.pl
@@ -0,0 +1,105 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+# large (streamed) transaction with subscriber receiving out of order
+# subtransaction ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3001,3500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(4001,4500) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(5001,5500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(6001,6500) s(i);
+RELEASE s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(7001,7500) s(i);
+ROLLBACK TO s1;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1500|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/019_stream_subxact_ddl_abort.pl b/src/test/subscription/t/019_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..c91abf7bf6
--- /dev/null
+++ b/src/test/subscription/t/019_stream_subxact_ddl_abort.pl
@@ -0,0 +1,85 @@
+# Test streaming of large transaction with subtransactions, DDLs, DMLs, and
+# rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/020_stream_binary.pl b/src/test/subscription/t/020_stream_binary.pl
new file mode 100644
index 0000000000..51ae6b02a6
--- /dev/null
+++ b/src/test/subscription/t/020_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction in binary mode
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3d990463ce..500623e230 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -111,6 +111,7 @@ Append
 AppendPath
 AppendRelInfo
 AppendState
+ApplySubXactData
 Archive
 ArchiveEntryPtrType
 ArchiveFormat
@@ -2370,6 +2371,7 @@ StopList
 StopWorkersData
 StrategyNumber
 StreamCtl
+StreamXidHash
 StringInfo
 StringInfoData
 StripnullState
@@ -2380,6 +2382,7 @@ SubPlanState
 SubTransactionId
 SubXactCallback
 SubXactCallbackItem
+SubXactInfo
 SubXactEvent
 SubplanResultRelHashElem
 SubqueryScan
-- 
2.28.0.windows.1

#505Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#504)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

2.
009_stream_simple.pl
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});

How much above this data is 64kB limit? I just wanted to see that it
should not be on borderline and then due to some alignment issues the
streaming doesn't happen on some machines?

I think we should find similar information for other tests added by
the patch as well.

Few other comments:
===================
+sub wait_for_caught_up
+{
+ my ($node, $appname) = @_;
+
+ $node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication
WHERE application_name = '$appname';"
+ ) or die "Timed ou

The patch has added this in all the test files if it is used in so
many tests then we need to add this in some generic place
(PostgresNode.pm) but actually, I am not sure if need this at all. Why
can't the existing wait_for_catchup in PostgresNode.pm serve the same
purpose.

2.
In system_views.sql,

-- All columns of pg_subscription except subconninfo are readable.
REVOKE ALL ON pg_subscription FROM public;
GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary,
subslotname, subpublications)
ON pg_subscription TO public;

Here, we need to update for substream column as well.

3. Update describeSubscriptions() to show the 'substream' value in \dRs.

4. Also, lets add few tests in subscription.sql as we have added
'binary' option in commit 9de77b5453.

5. I think we can merge pg_dump related changes (the last version
posted in mail thread is v53-0005-Add-streaming-option-in-pg_dump) in
the main patch, one minor comment on pg_dump related changes
@@ -4358,6 +4369,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
if (strcmp(subinfo->subbinary, "t") == 0)
appendPQExpBuffer(query, ", binary = true");

+ if (strcmp(subinfo->substream, "f") != 0)
+ appendPQExpBuffer(query, ", streaming = on");
  if (strcmp(subinfo->subsynccommit, "off") != 0)
  appendPQExpBuffer(query, ", synchronous_commit = %s",
fmtId(subinfo->subsynccommit));

Keep one line space between substream and subsynccommit option code to
keep it consistent with nearby code.

--
With Regards,
Amit Kapila.

#506Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#504)
3 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 31, 2020 at 10:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Another comment:

+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+ HASH_SEQ_STATUS hash_seq;
+ RelationSyncEntry *entry;
+
+ Assert(RelationSyncCache != NULL);
+
+ hash_seq_init(&hash_seq, RelationSyncCache);
+ while ((entry = hash_seq_search(&hash_seq)) != NULL)
+ {
+ if (is_commit)
+ entry->schema_sent = true;

How is it correct to set 'entry->schema_sent' for all the entries in
RelationSyncCache? Consider a case where due to invalidation in an
unrelated transaction we have set the flag schema_sent for a
particular relation 'r1' as 'false' and that transaction is executed
before the current streamed transaction for which we are performing
commit and called this function. It will set the flag for unrelated
entry in this case 'r1' which doesn't seem correct to me. Or, if this
is correct, it would be a good idea to write some comments about it.

Yeah, this is wrong, I have fixed this issue in the attached patch
and also added a new test for the same.

Few more comments:
1.
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr
application_name=$appname' PUBLICATION tap_pub"
+);

In most of the tests, we are using the above statement to create a
subscription. Don't we need (streaming = 'on') parameter while
creating a subscription? Is there a reason for not doing so in this
patch itself?

I have changed this.

2.
009_stream_simple.pl
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});

How much above this data is 64kB limit? I just wanted to see that it
should not be on borderline and then due to some alignment issues the
streaming doesn't happen on some machines? Also, how such a test
ensures that the streaming has happened because the way we are
checking results, won't it be the same for the non-streaming case as
well?

Only for this case, or you mean for all the tests?

3.
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*),
count(extract(epoch from c) = 987654321), count(d = 999) FROM
test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally
changed data');

Again, how this test is relevant to streaming mode?

I agree, it is not specific to the streaming.

4. I have checked that in one of the previous patches, we have a test
v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case
quite similar to what we have in
v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.
If there is any difference that can cover more scenarios then can we
consider merging them into one test?

I will have a look.

Apart from the above, I have made a few changes in the attached patch
which are mainly to simplify the code at one place, added/edited few
comments, some other cosmetic changes, and renamed the test case files
as the initials of their name were matching other tests in the similar
directory.

Changes look fine to me except this

+

+ /* the value must be on/off */
+ if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid streaming value")));
+
+ /* enable streaming if it's 'on' */
+ *enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);

I mean for streaming why we need to handle differently than the other
surrounding code for example "binary" option.

Apart from that for testing 0001, I have added a new test in the
attached contrib.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v57-0001-Fix-the-SharedFileSetUnregister-API.patchapplication/octet-stream; name=v57-0001-Fix-the-SharedFileSetUnregister-API.patchDownload
From 0a2419b4c1f3ef701e7911881dfa78340bc85b37 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 28 Aug 2020 12:42:09 +0530
Subject: [PATCH v57 1/2] Fix the SharedFileSetUnregister API.

Commit 808e13b282 introduced a few APIs to extend the existing Buffile
interface. In SharedFileSetDeleteOnProcExit, it tries to delete the list
element while traversing the list with 'foreach' construct which makes the
behavior of list traversal unpredictable.
---
 src/backend/storage/file/sharedfileset.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 8b96e81..859c22e 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -266,12 +266,16 @@ SharedFileSetOnDetach(dsm_segment *segment, Datum datum)
 static void
 SharedFileSetDeleteOnProcExit(int status, Datum arg)
 {
-	ListCell   *l;
-
-	/* Loop over all the pending shared fileset entry */
-	foreach(l, filesetlist)
+	/*
+	 * Remove all the pending shared fileset entries. We don't use foreach() here
+	 * because SharedFileSetDeleteAll will remove the current element in
+	 * filesetlist. Though we have used foreach_delete_current() to remove the
+	 * element from filesetlist it could only fix up the state of one of the
+	 * loops, see SharedFileSetUnregister.
+	 */
+	while (list_length(filesetlist) > 0)
 	{
-		SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+		SharedFileSet *fileset = (SharedFileSet *) linitial(filesetlist);
 
 		SharedFileSetDeleteAll(fileset);
 	}
@@ -301,7 +305,7 @@ SharedFileSetUnregister(SharedFileSet *input_fileset)
 		/* Remove the entry from the list */
 		if (input_fileset == fileset)
 		{
-			filesetlist = list_delete_cell(filesetlist, l);
+			filesetlist = foreach_delete_current(filesetlist, l);
 			return;
 		}
 	}
-- 
1.8.3.1

v57-0002-Add-support-for-streaming-to-built-in-logical-re.patchapplication/octet-stream; name=v57-0002-Add-support-for-streaming-to-built-in-logical-re.patchDownload
From a5907c56db7f6415be33be8bc371380871e524ce Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 20 Jul 2020 10:37:49 +0530
Subject: [PATCH v57 2/2] Add support for streaming to built-in logical
 replication.

To add support for streaming of in-progress transactions into the
built-in logical replication, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.

Author: Tomas Vondra, Dilip Kumar and Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 doc/src/sgml/monitoring.sgml                       |  16 +
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/commands/subscriptioncmds.c            |  46 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   4 +
 src/backend/replication/logical/proto.c            | 162 +++-
 src/backend/replication/logical/worker.c           | 951 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 366 +++++++-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  42 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/subscription/t/015_stream_simple.pl       |  86 ++
 src/test/subscription/t/016_stream_subxact.pl      | 102 +++
 src/test/subscription/t/017_stream_ddl.pl          |  95 ++
 .../subscription/t/018_stream_subxact_abort.pl     | 105 +++
 .../subscription/t/019_stream_subxact_ddl_abort.pl |  85 ++
 src/test/subscription/t/020_stream_binary.pl       |  86 ++
 src/test/subscription/t/021_stream_schema.pl       |  80 ++
 src/tools/pgindent/typedefs.list                   |   3 +
 22 files changed, 2222 insertions(+), 46 deletions(-)
 create mode 100644 src/test/subscription/t/015_stream_simple.pl
 create mode 100644 src/test/subscription/t/016_stream_subxact.pl
 create mode 100644 src/test/subscription/t/017_stream_ddl.pl
 create mode 100644 src/test/subscription/t/018_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/019_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/020_stream_binary.pl
 create mode 100644 src/test/subscription/t/021_stream_schema.pl

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 17a0df6..7fa1d79 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1509,6 +1509,22 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>WALWrite</literal></entry>
       <entry>Waiting for a write to a WAL file.</entry>
      </row>
+     <row>
+      <entry><literal>LogicalChangesRead</literal></entry>
+      <entry>Waiting for a read from a logical changes file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalChangesWrite</literal></entry>
+      <entry>Waiting for a write to a logical changes file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalSubxactRead</literal></entry>
+      <entry>Waiting for a read from a logical subxact file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalSubxactWrite</literal></entry>
+      <entry>Waiting for a write to a logical subxact file.</entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70..a1666b3 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal>, and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c5..b7d7457 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf..311d462 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377..9426e1d 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,11 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+	{
+		*streaming_given = false;
+		*streaming = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +200,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +353,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +378,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -427,6 +446,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subowner - 1] = ObjectIdGetDatum(owner);
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
+	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -698,6 +718,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +729,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +762,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +786,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +831,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +874,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8116b23..5f4b168 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "LogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "LogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "LogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "LogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 8afa5a2..ad57409 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -425,6 +425,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097..f82236e 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,126 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+/*
+ * Write the information for the start stream message to the output stream.
+ */
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+/*
+ * Read the information about the start stream message from output stream.
+ */
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+/*
+ * Write the stop stream message to the output stream.
+ */
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+/*
+ * Write STREAM COMMIT to the output stream.
+ */
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+/*
+ * Read STREAM COMMIT from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+/*
+ * Write STREAM ABORT to the output stream. Note that xid and subxid will be
+ * same for the top-level transaction abort.
+ */
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+/*
+ * Read STREAM ABORT from the output stream.
+ */
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b576e34..f022b81 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,45 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead, the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription. This is necessary so that different workers processing a
+ * remote transaction with the same XID doesn't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close. We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of the
+ * file and if we decide to keep stream files open across the start/stop stream
+ * then it will consume a lot of memory (more than 8K for each BufFile and
+ * there could be multiple such BufFiles as the subscriber could receive
+ * multiple start/stop streams for different transactions before getting the
+ * commit). Moreover, if we don't use SharedFileSet then we also need to invent
+ * a new way to pass filenames to BufFile APIs so that we are allowed to open
+ * the file we desired across multiple stream-open calls for the same
+ * transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +67,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +99,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +109,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +138,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry. Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any subxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles, so storing them in hash makes the search faster.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +166,65 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.
+ */
+static HTAB *xidhash = NULL;
+/* BufFile handle of the current streaming file */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+/* Sub-transaction data for the current streaming transaction */
+typedef struct ApplySubXactData
+{
+	uint32		nsubxacts;		/* number of sub-transactions */
+	uint32		nsubxacts_max;	/* current capacity of subxacts */
+	TransactionId subxact_last; /* xid of the last sub-transaction */
+	SubXactInfo *subxacts;		/* sub-xact offset in changes file */
+} ApplySubXactData;
+
+static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +296,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,17 +757,336 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside streaming transaction or inside
+	 * remote transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if (!in_streamed_transaction &&
+		(!in_remote_transaction ||
+		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop. We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/*
+	 * Initialize the xidhash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing subxact file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * We can't use the binary search here as subxact XIDs won't
+		 * necessarily arrive in sorted order, consider the case where we have
+		 * released the savepoint for multiple subtransactions and then
+		 * performed rollback to savepoint for one of the earlier
+		 * sub-transaction.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		for (i = subxact_data.nsubxacts; i > 0; i--)
+		{
+			if (subxact_data.subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < subxact_data.nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxact_data.subxacts[subidx].fileno,
+							  subxact_data.subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		subxact_data.nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	/*
+	 * Allocate file handle and memory required to process all the messages in
+	 * TopTransactionContext to avoid them getting reset after each message is
+	 * processed.
+	 */
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file \"%s\"", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -635,6 +1099,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1117,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1156,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1274,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1426,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1799,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1940,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2068,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context is used for per-stream data when the streaming mode
+	 * is enabled. This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2180,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1938,6 +2444,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->name, MySubscription->name) != 0 ||
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
+		newsub->stream != MySubscription->stream ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -1979,6 +2486,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do, but if already have
+	 * subxact file then delete that.
+	 */
+	if (subxact_data.nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			SharedFileSetDeleteAll(ent->subxact_fileset);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+		return;
+	}
+
+	subxact_filename(path, subid, xid);
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &subxact_data.nsubxacts, sizeof(subxact_data.nsubxacts));
+	BufFileWrite(fd, subxact_data.subxacts, len);
+
+	BufFileClose(fd);
+
+	/* free the memory allocated for subxact info */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the structure subxact_data that can be
+ * used later.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxact_data.subxacts);
+	Assert(subxact_data.nsubxacts == 0);
+	Assert(subxact_data.nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &subxact_data.nsubxacts,
+					sizeof(subxact_data.nsubxacts)) !=
+		sizeof(subxact_data.nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	subxact_data.nsubxacts_max = 1 << my_log2(subxact_data.nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context. We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this. On stream stop we will flush this information
+	 * to the subxact file and reset the logical streaming context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxact_data.subxacts = palloc(subxact_data.nsubxacts_max *
+								   sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxact_data.subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	SubXactInfo *subxacts = subxact_data.subxacts;
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_data.subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_data.subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = subxact_data.nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (subxact_data.nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		subxact_data.nsubxacts_max = 128;
+
+		/*
+		 * Allocate this memory for subxacts in per-stream context, see
+		 * subxact_info_read.
+		 */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (subxact_data.nsubxacts == subxact_data.nsubxacts_max)
+	{
+		subxact_data.nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts,
+							subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[subxact_data.nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd,
+				&subxacts[subxact_data.nsubxacts].fileno,
+				&subxacts[subxact_data.nsubxacts].offset);
+
+	subxact_data.nsubxacts++;
+	subxact_data.subxacts = subxacts;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	SharedFileSetDeleteAll(ent->stream_fileset);
+	pfree(ent->stream_fileset);
+	ent->stream_fileset = NULL;
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		SharedFileSetDeleteAll(ent->subxact_fileset);
+		pfree(ent->subxact_fileset);
+		ent->subxact_fileset = NULL;
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open a file that we'll use to serialize changes for a toplevel
+ * transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buffile, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file \"%s\" for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls. So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxact_data.subxacts)
+		pfree(subxact_data.subxacts);
+
+	subxact_data.subxacts = NULL;
+	subxact_data.subxact_last = InvalidTransactionId;
+	subxact_data.nsubxacts = 0;
+	subxact_data.nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2151,6 +3091,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc..4639518 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,17 +47,40 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort. For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
@@ -70,6 +93,8 @@ typedef struct RelationSyncEntry
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,10 +120,15 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
 
 /*
  * Specify output plugin callbacks
@@ -115,16 +145,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +222,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			*enable_streaming = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +244,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +268,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +289,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +320,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +383,45 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start of the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 *
+	 * XXX There is a scope of an optimization.  Basically, if
+	 * relentry->schema_sent flag is already set to true then in streaming case
+	 * also we can avoid sending it again.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +437,24 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +478,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * This is called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,10 +499,19 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
 
 	if (!is_publishable_relation(relation))
 		return;
 
+	/*
+	 * Remember the xid for the change in streaming mode. We need to send xid
+	 * with each change in the streaming mode so that subscriber can make their
+	 * association and on aborts, it can discard the corresponding changes.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
 	relentry = get_rel_sync_entry(data, RelationGetRelid(relation));
 
 	/* First check the table filter */
@@ -406,7 +536,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +556,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +581,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +602,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +628,11 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +661,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -606,6 +742,118 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * START STREAM callback
+ */
+static void
+pgoutput_stream_start(struct LogicalDecodingContext* ctx,
+					  ReorderBufferTXN* txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char* origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+/*
+ * STOP STREAM callback
+ */
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext* ctx,
+					 ReorderBufferTXN* txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -642,6 +890,39 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -771,12 +1052,61 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+	ListCell	*lc;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		/*
+		 * Look for the xid in the streamed_txns list.  If we find this
+		 * xid that mean which xid has send the schema of this relation.
+		 * So now we can set the schema_sent flag if this transaction
+		 * committed and the subscriber must have this relation schema.
+		 * Also remove the xid entry from the streamed txn list so that
+		 * list list doesn't grow too big.
+		 */
+		foreach(lc, entry->streamed_txns)
+		{
+			if (lfirst_int(lc) == xid)
+			{
+				if (is_commit)
+					entry->schema_sent = true;
+
+				entry->streamed_txns =
+					foreach_delete_current(entry->streamed_txns, lc);
+				break;
+			}
+		}
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -811,7 +1141,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35..1d09154 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc..53905ee 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
 
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbe..6c0a4e3 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/015_stream_simple.pl b/src/test/subscription/t/015_stream_simple.pl
new file mode 100644
index 0000000..86e3637
--- /dev/null
+++ b/src/test/subscription/t/015_stream_simple.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=on)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/016_stream_subxact.pl b/src/test/subscription/t/016_stream_subxact.pl
new file mode 100644
index 0000000..853a08d
--- /dev/null
+++ b/src/test/subscription/t/016_stream_subxact.pl
@@ -0,0 +1,102 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=on)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/017_stream_ddl.pl b/src/test/subscription/t/017_stream_ddl.pl
new file mode 100644
index 0000000..f09a9fa
--- /dev/null
+++ b/src/test/subscription/t/017_stream_ddl.pl
@@ -0,0 +1,95 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=on)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/018_stream_subxact_abort.pl b/src/test/subscription/t/018_stream_subxact_abort.pl
new file mode 100644
index 0000000..9550cce
--- /dev/null
+++ b/src/test/subscription/t/018_stream_subxact_abort.pl
@@ -0,0 +1,105 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=on)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+# large (streamed) transaction with subscriber receiving out of order
+# subtransaction ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3001,3500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(4001,4500) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(5001,5500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(6001,6500) s(i);
+RELEASE s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(7001,7500) s(i);
+ROLLBACK TO s1;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1500|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/019_stream_subxact_ddl_abort.pl b/src/test/subscription/t/019_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..085dc02
--- /dev/null
+++ b/src/test/subscription/t/019_stream_subxact_ddl_abort.pl
@@ -0,0 +1,85 @@
+# Test streaming of large transaction with subtransactions, DDLs, DMLs, and
+# rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=on)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/020_stream_binary.pl b/src/test/subscription/t/020_stream_binary.pl
new file mode 100644
index 0000000..51ae6b0
--- /dev/null
+++ b/src/test/subscription/t/020_stream_binary.pl
@@ -0,0 +1,86 @@
+# Test streaming of simple large transaction in binary mode
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/021_stream_schema.pl b/src/test/subscription/t/021_stream_schema.pl
new file mode 100644
index 0000000..1880607
--- /dev/null
+++ b/src/test/subscription/t/021_stream_schema.pl
@@ -0,0 +1,80 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';"
+	) or die "Timed out while waiting for subscriber to catch up";
+}
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
+);
+
+wait_for_caught_up($node_publisher, $appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(3,3000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+COMMIT;
+});
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(3001,3005) s(i);
+COMMIT;
+});
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(3005|3003|5), 'check the data inserted to the new colum is refected');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3d99046..500623e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -111,6 +111,7 @@ Append
 AppendPath
 AppendRelInfo
 AppendState
+ApplySubXactData
 Archive
 ArchiveEntryPtrType
 ArchiveFormat
@@ -2370,6 +2371,7 @@ StopList
 StopWorkersData
 StrategyNumber
 StreamCtl
+StreamXidHash
 StringInfo
 StringInfoData
 StripnullState
@@ -2380,6 +2382,7 @@ SubPlanState
 SubTransactionId
 SubXactCallback
 SubXactCallbackItem
+SubXactInfo
 SubXactEvent
 SubplanResultRelHashElem
 SubqueryScan
-- 
1.8.3.1

v2-0001-bufile_test.patchapplication/octet-stream; name=v2-0001-bufile_test.patchDownload
From dffb1316a43fc5a7e21964041c9e6cbddb8be2e4 Mon Sep 17 00:00:00 2001
From: dilip kumar <dilipbalaut@localhost.localdomain>
Date: Tue, 18 Aug 2020 13:44:53 +0530
Subject: [PATCH v2] bufile_test

---
 contrib/buffile_test/.gitignore            |   4 +
 contrib/buffile_test/Makefile              |  22 +++++
 contrib/buffile_test/buffile_test--1.0.sql |  13 +++
 contrib/buffile_test/buffile_test.c        | 132 +++++++++++++++++++++++++++++
 contrib/buffile_test/buffile_test.control  |   5 ++
 5 files changed, 176 insertions(+)
 create mode 100644 contrib/buffile_test/.gitignore
 create mode 100644 contrib/buffile_test/Makefile
 create mode 100644 contrib/buffile_test/buffile_test--1.0.sql
 create mode 100644 contrib/buffile_test/buffile_test.c
 create mode 100644 contrib/buffile_test/buffile_test.control

diff --git a/contrib/buffile_test/.gitignore b/contrib/buffile_test/.gitignore
new file mode 100644
index 0000000..5dcb3ff
--- /dev/null
+++ b/contrib/buffile_test/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/contrib/buffile_test/Makefile b/contrib/buffile_test/Makefile
new file mode 100644
index 0000000..96da192
--- /dev/null
+++ b/contrib/buffile_test/Makefile
@@ -0,0 +1,22 @@
+# contrib/amcheck/Makefile
+
+MODULE_big	= buffile_test
+OBJS = \
+	$(WIN32RES) \
+	buffile_test.o
+
+EXTENSION = buffile_test
+DATA = buffile_test--1.0.sql
+PGFILEDESC = "buffile_test"
+
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/amcheck
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/buffile_test/buffile_test--1.0.sql b/contrib/buffile_test/buffile_test--1.0.sql
new file mode 100644
index 0000000..6305f3e
--- /dev/null
+++ b/contrib/buffile_test/buffile_test--1.0.sql
@@ -0,0 +1,13 @@
+/* contrib/amcheck/amcheck--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION buffile_test" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+CREATE FUNCTION buffile_test()
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'buffile_test'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
diff --git a/contrib/buffile_test/buffile_test.c b/contrib/buffile_test/buffile_test.c
new file mode 100644
index 0000000..44bf157
--- /dev/null
+++ b/contrib/buffile_test/buffile_test.c
@@ -0,0 +1,132 @@
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/nbtree.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "catalog/index.h"
+#include "catalog/pg_am.h"
+#include "commands/tablecmds.h"
+#include "lib/bloomfilter.h"
+#include "miscadmin.h"
+#include "storage/buf_internals.h"
+#include "storage/buffile.h"
+#include "storage/fd.h"
+#include "utils/memutils.h"
+#include "utils/snapmgr.h"
+
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(buffile_test);
+
+/* test truncate */
+static void
+buffile_test1(SharedFileSet *fileset)
+{
+	BufFile    *fd;
+	int			fileno = 0;
+	off_t			offset = 0;
+	size_t  nread = 0;
+	char		readbuf[100];
+
+	fd = BufFileCreateShared(fileset, "test_file");
+	BufFileWrite(fd, "aaaaaaaaaa", 10);
+	BufFileTell(fd, &fileno, &offset);
+	BufFileWrite(fd, "bbbbbbbbbb", 10);
+	BufFileTruncateShared(fd, fileno, offset);
+	BufFileWrite(fd, "ccccc", 5);
+	BufFileSeek(fd, 0, 0, SEEK_SET);
+	nread = BufFileRead(fd, readbuf, 20);
+
+	if (nread != 15)
+		elog(ERROR, "FAILED: unexpected bytes read");
+	else if (strncmp(readbuf, "aaaaaaaaaaccccc", 15) != 0)
+		elog(ERROR, "FAILED: unexpected data read");
+	else
+		elog(WARNING, "PASSED: expected bytes read");
+	BufFileClose(fd);
+
+	BufFileDeleteShared(fileset, "test_file");
+}
+
+#define MAX_PHYSICAL_FILESIZE	0x40000000
+#define BUFFILE_SEG_SIZE		(MAX_PHYSICAL_FILESIZE / BLCKSZ)
+
+/* test truncate on multiple files*/
+static void
+buffile_test2(SharedFileSet *fileset)
+{
+	BufFile    *fd;
+	int			fileno = 0;
+	off_t		offset = 0;
+	size_t  	size = 0;
+	char		buf[BLCKSZ] = {'b'};
+	int			i;
+
+	fd = BufFileCreateShared(fileset, "test_file");
+	BufFileWrite(fd, "aaaaaaaaaa", 10);
+	BufFileTell(fd, &fileno, &offset);
+
+	/* create 3 files */
+	for (i = 0; i < 3* BUFFILE_SEG_SIZE; i++)
+	{
+		BufFileWrite(fd, buf, BLCKSZ);
+	}
+
+	/* seek to some location in the first file */
+	BufFileSeek(fd, 0, 10, SEEK_SET);
+	BufFileWrite(fd, "aaaaa", 5);
+
+	/* truncate within the first file and in same buffer */
+	BufFileTruncateShared(fd, fileno, offset + 7);
+	size = BufFileSize(fd);
+	if (size == 17)
+		elog(WARNING, "PASSED: expected file size");
+	else
+		elog(WARNING, "FAILED: unexpected file size");
+
+	BufFileClose(fd);
+
+	BufFileDeleteShared(fileset, "test_file");
+}
+
+static void
+buffile_test3(SharedFileSet *fileset)
+{
+	BufFile    *fd;
+	int			fileno = 0;
+	off_t			offset = 0;
+	size_t  nread = 0;
+	char		readbuf[100];
+	SharedFileSet *fileset1;
+
+	fileset1 = palloc0(sizeof(SharedFileSet));
+	fd = BufFileCreateShared(fileset, "test_file");
+	BufFileWrite(fd, "aaaaaaaaaa", 10);
+	SharedFileSetInit(fileset1, NULL);
+	BufFileCreateShared(fileset1, "test_file1");
+	SharedFileSetDeleteAll(fileset1);
+	proc_exit(1);
+}
+
+Datum
+buffile_test(PG_FUNCTION_ARGS)
+{
+	SharedFileSet *fileset;
+
+	MemoryContextSwitchTo(TopMemoryContext);
+	fileset = palloc(sizeof(SharedFileSet));
+	SharedFileSetInit(fileset, NULL);
+
+	buffile_test1(fileset);
+	//buffile_test2(fileset);
+	buffile_test3(fileset);
+
+
+	SharedFileSetDeleteAll(fileset);
+
+	PG_RETURN_VOID();
+}
diff --git a/contrib/buffile_test/buffile_test.control b/contrib/buffile_test/buffile_test.control
new file mode 100644
index 0000000..a7c6fa2
--- /dev/null
+++ b/contrib/buffile_test/buffile_test.control
@@ -0,0 +1,5 @@
+# buffile_test extension
+comment = 'test buffile'
+default_version = '1.0'
+module_pathname = '$libdir/buffile_test'
+relocatable = true
-- 
1.8.3.1

#507Neha Sharma
neha.sharma@enterprisedb.com
In reply to: Amit Kapila (#504)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Hi Amit/Dilip,

I have tested a few scenarios on top of the v56 patches, where the
replication worker still had few subtransactions in uncommitted state and
we restart the publisher server.
No crash or data discrepancies were observed, attached are the test
scenarios verified.

*Data Setup:*
*Publication Server postgresql.conf :*
echo "wal_level = logical
max_wal_senders = 10
max_replication_slots = 15
wal_log_hints = on
hot_standby_feedback = on
wal_receiver_status_interval = 1
listen_addresses='*'
log_min_messages=debug1
wal_sender_timeout = 0
logical_decoding_work_mem=64kB

*Subscription Server postgresql.conf :*
wal_level = logical
max_wal_senders = 10
max_replication_slots = 15
wal_log_hints = on
hot_standby_feedback = on
wal_receiver_status_interval = 1
listen_addresses='*'
log_min_messages=debug1
wal_sender_timeout = 0
logical_decoding_work_mem=64kB
port=5433

*Initial setup:*
*Publication Server:*
create table t(a int PRIMARY KEY ,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select
array_agg(md5(g::text))::text from generate_series(1, 256) g';
create publication test_pub for table t
with(PUBLISH='insert,delete,update,truncate');
alter table t replica identity FULL ;
insert into t values (generate_series(1,20),large_val()) ON CONFLICT (a) DO
UPDATE SET a=EXCLUDED.a*300;

*Subscription server:*
create table t(a int,b text);
create subscription test_sub CONNECTION 'host=localhost port=5432
dbname=postgres user=edb' PUBLICATION test_pub WITH ( slot_name =
test_slot_sub1,streaming=on);

Thanks.
--
Regards,
Neha Sharma

On Mon, Aug 31, 2020 at 1:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Show quoted text

On Mon, Aug 31, 2020 at 10:49 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com>

wrote:

Another comment:

+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+ HASH_SEQ_STATUS hash_seq;
+ RelationSyncEntry *entry;
+
+ Assert(RelationSyncCache != NULL);
+
+ hash_seq_init(&hash_seq, RelationSyncCache);
+ while ((entry = hash_seq_search(&hash_seq)) != NULL)
+ {
+ if (is_commit)
+ entry->schema_sent = true;

How is it correct to set 'entry->schema_sent' for all the entries in
RelationSyncCache? Consider a case where due to invalidation in an
unrelated transaction we have set the flag schema_sent for a
particular relation 'r1' as 'false' and that transaction is executed
before the current streamed transaction for which we are performing
commit and called this function. It will set the flag for unrelated
entry in this case 'r1' which doesn't seem correct to me. Or, if this
is correct, it would be a good idea to write some comments about it.

Few more comments:
1.
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr
application_name=$appname' PUBLICATION tap_pub"
+);

In most of the tests, we are using the above statement to create a
subscription. Don't we need (streaming = 'on') parameter while
creating a subscription? Is there a reason for not doing so in this
patch itself?

2.
009_stream_simple.pl
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000)
s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});

How much above this data is 64kB limit? I just wanted to see that it
should not be on borderline and then due to some alignment issues the
streaming doesn't happen on some machines? Also, how such a test
ensures that the streaming has happened because the way we are
checking results, won't it be the same for the non-streaming case as
well?

3.
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b =
md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*),
count(extract(epoch from c) = 987654321), count(d = 999) FROM
test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally
changed data');

Again, how this test is relevant to streaming mode?

4. I have checked that in one of the previous patches, we have a test
v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case
quite similar to what we have in

v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.
If there is any difference that can cover more scenarios then can we
consider merging them into one test?

Apart from the above, I have made a few changes in the attached patch
which are mainly to simplify the code at one place, added/edited few
comments, some other cosmetic changes, and renamed the test case files
as the initials of their name were matching other tests in the similar
directory.

--
With Regards,
Amit Kapila.

Attachments:

test_caseapplication/octet-stream; name=test_caseDownload
#508Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#506)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Aug 31, 2020 at 7:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 31, 2020 at 10:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Another comment:

+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+ HASH_SEQ_STATUS hash_seq;
+ RelationSyncEntry *entry;
+
+ Assert(RelationSyncCache != NULL);
+
+ hash_seq_init(&hash_seq, RelationSyncCache);
+ while ((entry = hash_seq_search(&hash_seq)) != NULL)
+ {
+ if (is_commit)
+ entry->schema_sent = true;

How is it correct to set 'entry->schema_sent' for all the entries in
RelationSyncCache? Consider a case where due to invalidation in an
unrelated transaction we have set the flag schema_sent for a
particular relation 'r1' as 'false' and that transaction is executed
before the current streamed transaction for which we are performing
commit and called this function. It will set the flag for unrelated
entry in this case 'r1' which doesn't seem correct to me. Or, if this
is correct, it would be a good idea to write some comments about it.

Yeah, this is wrong, I have fixed this issue in the attached patch
and also added a new test for the same.

In functions cleanup_rel_sync_cache and
get_schema_sent_in_streamed_txn, lets cast the result of lfirst_int to
uint32 as suggested by Tom [1]/messages/by-id/3955127.1598880523@sss.pgh.pa.us. Also, lets keep the way we compare
xids consistent in both functions, i.e, if (xid == lfirst_int(lc)).

The behavior tested by the test case added for this is not clear
primarily because of comments.

+++ b/src/test/subscription/t/021_stream_schema.pl
@@ -0,0 +1,80 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
...
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM
generate_series(3,3000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+COMMIT;
+});
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM
generate_series(3001,3005) s(i);
+COMMIT;
+});
+wait_for_caught_up($node_publisher, $appname);

I understand that how this test will test the functionality related to
schema_sent stuff but neither the comments atop of file nor atop the
test case explains it clearly.

Few more comments:

2.
009_stream_simple.pl
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});

How much above this data is 64kB limit? I just wanted to see that it
should not be on borderline and then due to some alignment issues the
streaming doesn't happen on some machines? Also, how such a test
ensures that the streaming has happened because the way we are
checking results, won't it be the same for the non-streaming case as
well?

Only for this case, or you mean for all the tests?

It is better to do it for all tests and I have clarified this in my
next email sent yesterday [2]/messages/by-id/CAA4eK1JjrcK6bk+ur3J+kLsfz4+ipJFN7VcRd3cXr4gG5ZWWig@mail.gmail.com where I have raised a few more comments
as well. I hope you have not missed that email.

3.
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*),
count(extract(epoch from c) = 987654321), count(d = 999) FROM
test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally
changed data');

Again, how this test is relevant to streaming mode?

I agree, it is not specific to the streaming.

Apart from the above, I have made a few changes in the attached patch
which are mainly to simplify the code at one place, added/edited few
comments, some other cosmetic changes, and renamed the test case files
as the initials of their name were matching other tests in the similar
directory.

Changes look fine to me except this

+

+ /* the value must be on/off */
+ if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid streaming value")));
+
+ /* enable streaming if it's 'on' */
+ *enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);

I mean for streaming why we need to handle differently than the other
surrounding code for example "binary" option.

Hmm, I think the code changed by me is to make it look similar to the
binary option. The code you have quoted above is from the patch
version prior to what I have sent. See the code snippet after my
changes:
@@ -182,6 +222,16 @@ parse_output_parameters(List *options, uint32
*protocol_version,

  *binary = defGetBoolean(defel);
  }
+ else if (strcmp(defel->defname, "streaming") == 0)
+ {
+ if (streaming_given)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("conflicting or redundant options")));
+ streaming_given = true;
+
+ *enable_streaming = defGetBoolean(defel);
+ }

This looks exactly similar to the binary option. Can you please check
it once again and confirm back?

[1]: /messages/by-id/3955127.1598880523@sss.pgh.pa.us
[2]: /messages/by-id/CAA4eK1JjrcK6bk+ur3J+kLsfz4+ipJFN7VcRd3cXr4gG5ZWWig@mail.gmail.com

--
With Regards,
Amit Kapila.

#509Amit Kapila
amit.kapila16@gmail.com
In reply to: Neha Sharma (#507)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Aug 31, 2020 at 10:27 PM Neha Sharma
<neha.sharma@enterprisedb.com> wrote:

Hi Amit/Dilip,

I have tested a few scenarios on top of the v56 patches, where the replication worker still had few subtransactions in uncommitted state and we restart the publisher server.
No crash or data discrepancies were observed, attached are the test scenarios verified.

Thanks, I have pushed the fix
(https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=4ab77697f67aa5b90b032b9175b46901859da6d7).

--
With Regards,
Amit Kapila.

#510Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#508)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Sep 1, 2020 at 9:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 31, 2020 at 7:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

In functions cleanup_rel_sync_cache and
get_schema_sent_in_streamed_txn, lets cast the result of lfirst_int to
uint32 as suggested by Tom [1]. Also, lets keep the way we compare
xids consistent in both functions, i.e, if (xid == lfirst_int(lc)).

Fixed this in the attached patch.

The behavior tested by the test case added for this is not clear
primarily because of comments.

+++ b/src/test/subscription/t/021_stream_schema.pl
@@ -0,0 +1,80 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
...
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM
generate_series(3,3000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+COMMIT;
+});
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM
generate_series(3001,3005) s(i);
+COMMIT;
+});
+wait_for_caught_up($node_publisher, $appname);

I understand that how this test will test the functionality related to
schema_sent stuff but neither the comments atop of file nor atop the
test case explains it clearly.

Added comments for this test.

Few more comments:

2.
009_stream_simple.pl
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});

How much above this data is 64kB limit? I just wanted to see that it
should not be on borderline and then due to some alignment issues the
streaming doesn't happen on some machines? Also, how such a test
ensures that the streaming has happened because the way we are
checking results, won't it be the same for the non-streaming case as
well?

Only for this case, or you mean for all the tests?

I have not done this yet.

It is better to do it for all tests and I have clarified this in my
next email sent yesterday [2] where I have raised a few more comments
as well. I hope you have not missed that email.

3.
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*),
count(extract(epoch from c) = 987654321), count(d = 999) FROM
test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally
changed data');

Again, how this test is relevant to streaming mode?

I agree, it is not specific to the streaming.

I think we can leave this as of now. After committing the stats
patches by Sawada-San and Ajin, we might be able to improve this test.

+sub wait_for_caught_up
+{
+ my ($node, $appname) = @_;
+
+ $node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication
WHERE application_name = '$appname';"
+ ) or die "Timed ou

The patch has added this in all the test files if it is used in so
many tests then we need to add this in some generic place
(PostgresNode.pm) but actually, I am not sure if need this at all. Why
can't the existing wait_for_catchup in PostgresNode.pm serve the same
purpose.

Changed as per this suggestion.

2.
In system_views.sql,

-- All columns of pg_subscription except subconninfo are readable.
REVOKE ALL ON pg_subscription FROM public;
GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary,
subslotname, subpublications)
ON pg_subscription TO public;

Here, we need to update for substream column as well.

Fixed.

3. Update describeSubscriptions() to show the 'substream' value in \dRs.

4. Also, lets add few tests in subscription.sql as we have added
'binary' option in commit 9de77b5453.

Fixed both the above comments.

5. I think we can merge pg_dump related changes (the last version
posted in mail thread is v53-0005-Add-streaming-option-in-pg_dump) in
the main patch, one minor comment on pg_dump related changes
@@ -4358,6 +4369,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
if (strcmp(subinfo->subbinary, "t") == 0)
appendPQExpBuffer(query, ", binary = true");

+ if (strcmp(subinfo->substream, "f") != 0)
+ appendPQExpBuffer(query, ", streaming = on");
if (strcmp(subinfo->subsynccommit, "off") != 0)
appendPQExpBuffer(query, ", synchronous_commit = %s",
fmtId(subinfo->subsynccommit));

Keep one line space between substream and subsynccommit option code to
keep it consistent with nearby code.

Changed as per this suggestion.

I have fixed all the comments except the below comments.
1. verify the size of various tests to ensure that it is above
logical_decoding_work_mem.
2. I have checked that in one of the previous patches, we have a test
v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case
quite similar to what we have in
v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.
If there is any difference that can cover more scenarios then can we
consider merging them into one test?
3. +# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*),
count(extract(epoch from c) = 987654321), count(d = 999) FROM
test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally
changed data');

Again, how this test is relevant to streaming mode?
4. Apart from the above, I think we should think of minimizing the
test cases which can be committed with the base patch. We can later
add more tests.

Kindly verify the changes.

--
With Regards,
Amit Kapila.

Attachments:

v58-0001-Add-support-for-streaming-to-built-in-logical-re.patchapplication/octet-stream; name=v58-0001-Add-support-for-streaming-to-built-in-logical-re.patchDownload
From e3777881af3ad237ec8af5f4a59805b4c9ab2952 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 1 Sep 2020 19:19:59 +0530
Subject: [PATCH v58] Add support for streaming to built-in logical
 replication.

To add support for streaming of in-progress transactions into the
built-in logical replication, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.

Author: Tomas Vondra, Dilip Kumar and Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 doc/src/sgml/monitoring.sgml                       |  16 +
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            |  46 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   4 +
 src/backend/replication/logical/proto.c            | 162 +++-
 src/backend/replication/logical/worker.c           | 951 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 366 +++++++-
 src/bin/pg_dump/pg_dump.c                          |  18 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  10 +-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  42 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/regress/expected/subscription.out         |  63 +-
 src/test/regress/sql/subscription.sql              |  15 +
 src/test/subscription/t/015_stream_simple.pl       |  77 ++
 src/test/subscription/t/016_stream_subxact.pl      |  93 ++
 src/test/subscription/t/017_stream_ddl.pl          |  86 ++
 .../subscription/t/018_stream_subxact_abort.pl     |  96 +++
 .../subscription/t/019_stream_subxact_ddl_abort.pl |  76 ++
 src/test/subscription/t/020_stream_binary.pl       |  77 ++
 src/test/subscription/t/021_stream_schema.pl       |  76 ++
 src/tools/pgindent/typedefs.list                   |   3 +
 28 files changed, 2246 insertions(+), 73 deletions(-)
 create mode 100644 src/test/subscription/t/015_stream_simple.pl
 create mode 100644 src/test/subscription/t/016_stream_subxact.pl
 create mode 100644 src/test/subscription/t/017_stream_ddl.pl
 create mode 100644 src/test/subscription/t/018_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/019_stream_subxact_ddl_abort.pl
 create mode 100644 src/test/subscription/t/020_stream_binary.pl
 create mode 100644 src/test/subscription/t/021_stream_schema.pl

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d973e11..673a0e7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1509,6 +1509,22 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>WALWrite</literal></entry>
       <entry>Waiting for a write to a WAL file.</entry>
      </row>
+     <row>
+      <entry><literal>LogicalChangesRead</literal></entry>
+      <entry>Waiting for a read from a logical changes file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalChangesWrite</literal></entry>
+      <entry>Waiting for a write to a logical changes file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalSubxactRead</literal></entry>
+      <entry>Waiting for a read from a logical subxact file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalSubxactWrite</literal></entry>
+      <entry>Waiting for a write to a logical subxact file.</entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70..a1666b3 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal>, and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c5..b7d7457 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf..311d462 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a2d6130..ed4f3f1 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1128,7 +1128,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377..9426e1d 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,11 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+	{
+		*streaming_given = false;
+		*streaming = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +200,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +353,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +378,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -427,6 +446,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subowner - 1] = ObjectIdGetDatum(owner);
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
+	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -698,6 +718,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +729,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +762,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +786,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +831,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +874,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8116b23..5f4b168 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "LogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "LogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "LogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "LogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 8afa5a2..ad57409 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -425,6 +425,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097..f82236e 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,126 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+/*
+ * Write the information for the start stream message to the output stream.
+ */
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+/*
+ * Read the information about the start stream message from output stream.
+ */
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+/*
+ * Write the stop stream message to the output stream.
+ */
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+/*
+ * Write STREAM COMMIT to the output stream.
+ */
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+/*
+ * Read STREAM COMMIT from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+/*
+ * Write STREAM ABORT to the output stream. Note that xid and subxid will be
+ * same for the top-level transaction abort.
+ */
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+/*
+ * Read STREAM ABORT from the output stream.
+ */
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b576e34..f022b81 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,45 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead, the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription. This is necessary so that different workers processing a
+ * remote transaction with the same XID doesn't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close. We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of the
+ * file and if we decide to keep stream files open across the start/stop stream
+ * then it will consume a lot of memory (more than 8K for each BufFile and
+ * there could be multiple such BufFiles as the subscriber could receive
+ * multiple start/stop streams for different transactions before getting the
+ * commit). Moreover, if we don't use SharedFileSet then we also need to invent
+ * a new way to pass filenames to BufFile APIs so that we are allowed to open
+ * the file we desired across multiple stream-open calls for the same
+ * transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +67,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +99,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +109,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +138,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry. Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any subxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles, so storing them in hash makes the search faster.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +166,65 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.
+ */
+static HTAB *xidhash = NULL;
+/* BufFile handle of the current streaming file */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+/* Sub-transaction data for the current streaming transaction */
+typedef struct ApplySubXactData
+{
+	uint32		nsubxacts;		/* number of sub-transactions */
+	uint32		nsubxacts_max;	/* current capacity of subxacts */
+	TransactionId subxact_last; /* xid of the last sub-transaction */
+	SubXactInfo *subxacts;		/* sub-xact offset in changes file */
+} ApplySubXactData;
+
+static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +296,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,17 +757,336 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside streaming transaction or inside
+	 * remote transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if (!in_streamed_transaction &&
+		(!in_remote_transaction ||
+		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop. We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/*
+	 * Initialize the xidhash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing subxact file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * We can't use the binary search here as subxact XIDs won't
+		 * necessarily arrive in sorted order, consider the case where we have
+		 * released the savepoint for multiple subtransactions and then
+		 * performed rollback to savepoint for one of the earlier
+		 * sub-transaction.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		for (i = subxact_data.nsubxacts; i > 0; i--)
+		{
+			if (subxact_data.subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < subxact_data.nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxact_data.subxacts[subidx].fileno,
+							  subxact_data.subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		subxact_data.nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	/*
+	 * Allocate file handle and memory required to process all the messages in
+	 * TopTransactionContext to avoid them getting reset after each message is
+	 * processed.
+	 */
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file \"%s\"", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -635,6 +1099,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1117,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1156,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1274,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1426,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1799,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1940,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2068,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context is used for per-stream data when the streaming mode
+	 * is enabled. This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2180,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1938,6 +2444,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->name, MySubscription->name) != 0 ||
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
+		newsub->stream != MySubscription->stream ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -1979,6 +2486,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do, but if already have
+	 * subxact file then delete that.
+	 */
+	if (subxact_data.nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			SharedFileSetDeleteAll(ent->subxact_fileset);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+		return;
+	}
+
+	subxact_filename(path, subid, xid);
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &subxact_data.nsubxacts, sizeof(subxact_data.nsubxacts));
+	BufFileWrite(fd, subxact_data.subxacts, len);
+
+	BufFileClose(fd);
+
+	/* free the memory allocated for subxact info */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the structure subxact_data that can be
+ * used later.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxact_data.subxacts);
+	Assert(subxact_data.nsubxacts == 0);
+	Assert(subxact_data.nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &subxact_data.nsubxacts,
+					sizeof(subxact_data.nsubxacts)) !=
+		sizeof(subxact_data.nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	subxact_data.nsubxacts_max = 1 << my_log2(subxact_data.nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context. We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this. On stream stop we will flush this information
+	 * to the subxact file and reset the logical streaming context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxact_data.subxacts = palloc(subxact_data.nsubxacts_max *
+								   sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxact_data.subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	SubXactInfo *subxacts = subxact_data.subxacts;
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_data.subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_data.subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = subxact_data.nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (subxact_data.nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		subxact_data.nsubxacts_max = 128;
+
+		/*
+		 * Allocate this memory for subxacts in per-stream context, see
+		 * subxact_info_read.
+		 */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (subxact_data.nsubxacts == subxact_data.nsubxacts_max)
+	{
+		subxact_data.nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts,
+							subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[subxact_data.nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd,
+				&subxacts[subxact_data.nsubxacts].fileno,
+				&subxacts[subxact_data.nsubxacts].offset);
+
+	subxact_data.nsubxacts++;
+	subxact_data.subxacts = subxacts;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	SharedFileSetDeleteAll(ent->stream_fileset);
+	pfree(ent->stream_fileset);
+	ent->stream_fileset = NULL;
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		SharedFileSetDeleteAll(ent->subxact_fileset);
+		pfree(ent->subxact_fileset);
+		ent->subxact_fileset = NULL;
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open a file that we'll use to serialize changes for a toplevel
+ * transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buffile, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file \"%s\" for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls. So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxact_data.subxacts)
+		pfree(subxact_data.subxacts);
+
+	subxact_data.subxacts = NULL;
+	subxact_data.subxact_last = InvalidTransactionId;
+	subxact_data.nsubxacts = 0;
+	subxact_data.nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2151,6 +3091,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc..558b52f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,17 +47,40 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort. For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
@@ -70,6 +93,8 @@ typedef struct RelationSyncEntry
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,10 +120,15 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
 
 /*
  * Specify output plugin callbacks
@@ -115,16 +145,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +222,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			*enable_streaming = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +244,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +268,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +289,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +320,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +383,47 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start of the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 *
+	 * XXX There is a scope of optimization here. Currently, we always send
+	 * the schema first time in a streaming transaction but we can probably
+	 * avoid that by checking 'relentry->schema_sent' flag. However, before
+	 * doing that we need to study its impact on the case where we have a mix
+	 * of streaming and non-streaming transactions.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +439,24 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +480,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * This is called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,10 +501,19 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
 
 	if (!is_publishable_relation(relation))
 		return;
 
+	/*
+	 * Remember the xid for the change in streaming mode. We need to send xid
+	 * with each change in the streaming mode so that subscriber can make their
+	 * association and on aborts, it can discard the corresponding changes.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
 	relentry = get_rel_sync_entry(data, RelationGetRelid(relation));
 
 	/* First check the table filter */
@@ -406,7 +538,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +558,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +583,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +604,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +630,11 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +663,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -606,6 +744,118 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * START STREAM callback
+ */
+static void
+pgoutput_stream_start(struct LogicalDecodingContext* ctx,
+					  ReorderBufferTXN* txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char* origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+/*
+ * STOP STREAM callback
+ */
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext* ctx,
+					 ReorderBufferTXN* txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -642,6 +892,39 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == (uint32) lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -771,12 +1054,59 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction is committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+	ListCell	*lc;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		/*
+		 * We can set the schema_sent flag for an entry that has committed xid
+		 * in the list as that ensures that the subscriber would have the
+		 * corresponding schema and we don't need to send it unless there is any
+		 * invalidation for that relation.
+		 */
+		foreach(lc, entry->streamed_txns)
+		{
+			if (xid == (uint32) lfirst_int(lc))
+			{
+				if (is_commit)
+					entry->schema_sent = true;
+
+				entry->streamed_txns =
+					foreach_delete_current(entry->streamed_txns, lc);
+				break;
+			}
+		}
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -811,7 +1141,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 2cb3f9b..d3ca54e 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4202,6 +4202,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4241,10 +4242,17 @@ getSubscriptions(Archive *fout)
 
 	if (fout->remoteVersion >= 140000)
 		appendPQExpBuffer(query,
-						  " s.subbinary\n");
+						  " s.subbinary,\n");
 	else
 		appendPQExpBuffer(query,
-						  " false AS subbinary\n");
+						  " false AS subbinary,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBuffer(query,
+						  " s.substream\n");
+	else
+		appendPQExpBuffer(query,
+						  " false AS substream\n");
 
 	appendPQExpBuffer(query,
 					  "FROM pg_subscription s\n"
@@ -4264,6 +4272,7 @@ getSubscriptions(Archive *fout)
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
+	i_substream = PQfnumber(res, "substream");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4287,6 +4296,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subpublications));
 		subinfo[i].subbinary =
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
+		subinfo[i].substream =
+			pg_strdup(PQgetvalue(res, i, i_substream));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4358,6 +4369,9 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subbinary, "t") == 0)
 		appendPQExpBuffer(query, ", binary = true");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index da97b73..cc10c7c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -626,6 +626,7 @@ typedef struct _SubscriptionInfo
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subbinary;
+	char       *substream;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index d81f157..03e3ec0 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -5963,7 +5963,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false};
+	false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -5989,11 +5989,13 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode is only supported in v14 and higher */
+		/* Binary mode and streaming are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
-							  ", subbinary AS \"%s\"\n",
-							  gettext_noop("Binary"));
+							  ", subbinary AS \"%s\"\n"
+							  ", substream AS \"%s\"\n",
+							  gettext_noop("Binary"),
+							  gettext_noop("Streaming"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35..1d09154 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc..53905ee 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
 
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbe..6c0a4e3 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index d71db0d..2fa9bce 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                      List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off                | dbname=regress_doesnotexist
+                                                            List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                          List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off                | dbname=regress_doesnotexist2
+                                                                List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                            List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | local              | dbname=regress_doesnotexist2
+                                                                  List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,42 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                      List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | off                | dbname=regress_doesnotexist
+                                                            List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                      List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off                | dbname=regress_doesnotexist
+                                                            List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - streaming must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = foo);
+ERROR:  streaming requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                            List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                            List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index eeb2ec0..14fa0b2 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -132,6 +132,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 
 DROP SUBSCRIPTION regress_testsub;
 
+-- fail - streaming must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/015_stream_simple.pl b/src/test/subscription/t/015_stream_simple.pl
new file mode 100644
index 0000000..5cd30d4
--- /dev/null
+++ b/src/test/subscription/t/015_stream_simple.pl
@@ -0,0 +1,77 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=on)"
+);
+
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/016_stream_subxact.pl b/src/test/subscription/t/016_stream_subxact.pl
new file mode 100644
index 0000000..c7eac40
--- /dev/null
+++ b/src/test/subscription/t/016_stream_subxact.pl
@@ -0,0 +1,93 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=on)"
+);
+
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/017_stream_ddl.pl b/src/test/subscription/t/017_stream_ddl.pl
new file mode 100644
index 0000000..b400511
--- /dev/null
+++ b/src/test/subscription/t/017_stream_ddl.pl
@@ -0,0 +1,86 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=on)"
+);
+
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/018_stream_subxact_abort.pl b/src/test/subscription/t/018_stream_subxact_abort.pl
new file mode 100644
index 0000000..2fa9efb
--- /dev/null
+++ b/src/test/subscription/t/018_stream_subxact_abort.pl
@@ -0,0 +1,96 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=on)"
+);
+
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+# large (streamed) transaction with subscriber receiving out of order
+# subtransaction ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3001,3500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(4001,4500) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(5001,5500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(6001,6500) s(i);
+RELEASE s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(7001,7500) s(i);
+ROLLBACK TO s1;
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1500|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/019_stream_subxact_ddl_abort.pl b/src/test/subscription/t/019_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..255c93d
--- /dev/null
+++ b/src/test/subscription/t/019_stream_subxact_ddl_abort.pl
@@ -0,0 +1,76 @@
+# Test streaming of large transaction with subtransactions, DDLs, DMLs, and
+# rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming=on)"
+);
+
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/020_stream_binary.pl b/src/test/subscription/t/020_stream_binary.pl
new file mode 100644
index 0000000..fb738f1
--- /dev/null
+++ b/src/test/subscription/t/020_stream_binary.pl
@@ -0,0 +1,77 @@
+# Test streaming of simple large transaction in binary mode
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on, binary = true)"
+);
+
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/021_stream_schema.pl b/src/test/subscription/t/021_stream_schema.pl
new file mode 100644
index 0000000..aed0626
--- /dev/null
+++ b/src/test/subscription/t/021_stream_schema.pl
@@ -0,0 +1,76 @@
+# Test whether the schema is sent appropriately when there is a mix of
+# streaming and non-streaming transactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
+);
+
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# A large (streamed) transaction with DDL and DML. One of the DDL is performed
+# after DML to ensure that we invalidate the schema sent for test_tab so that
+# the next transaction has to send the schema again.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(3,3000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+COMMIT;
+});
+
+# A small transaction that won't get streamed. This is just to ensure that we
+# send the schema again to reflect the last column added in the previous test.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(3001,3005) s(i);
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d) FROM test_tab");
+is($result, qq(3005|3003|5), 'check the data inserted to the new colum is refected');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3d99046..500623e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -111,6 +111,7 @@ Append
 AppendPath
 AppendRelInfo
 AppendState
+ApplySubXactData
 Archive
 ArchiveEntryPtrType
 ArchiveFormat
@@ -2370,6 +2371,7 @@ StopList
 StopWorkersData
 StrategyNumber
 StreamCtl
+StreamXidHash
 StringInfo
 StringInfoData
 StripnullState
@@ -2380,6 +2382,7 @@ SubPlanState
 SubTransactionId
 SubXactCallback
 SubXactCallbackItem
+SubXactInfo
 SubXactEvent
 SubplanResultRelHashElem
 SubqueryScan
-- 
1.8.3.1

#511Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#510)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have fixed all the comments except

..

3. +# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*),
count(extract(epoch from c) = 987654321), count(d = 999) FROM
test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally
changed data');

Again, how this test is relevant to streaming mode?

I think we can keep this test in one of the newly added tests say in
015_stream_simple.pl to ensure that after streaming transaction, the
non-streaming one behaves expectedly. So we can change the comment as
"Change the local values of the extra columns on the subscriber,
update publisher, and check that subscriber retains the expected
values. This is to ensure that non-streaming transactions behave
properly after a streaming transaction."

We can remove this test from the other two places
016_stream_subxact.pl and 020_stream_binary.pl.

4. Apart from the above, I think we should think of minimizing the
test cases which can be committed with the base patch. We can later
add more tests.

We can combine the tests in 015_stream_simple.pl and
020_stream_binary.pl as I can't see a good reason to keep them
separate. Then, I think we can keep only this part with the main patch
and extract other tests into a separate patch. Basically, we can
commit the basic tests with the main patch and then keep the advanced
tests separately. I am afraid that there are some tests that don't add
much value so we can review them separately.

One minor comment for option 'streaming = on', spacing-wise it should
be consistent in all the tests.

Similarly, we can combine 017_stream_ddl.pl and 021_stream_schema.pl
as both contains similar tests. As per the above suggestion, this will
be in a separate patch though.

If you agree with the above suggestions then kindly make these
adjustments and send the updated patch.

--
With Regards,
Amit Kapila.

#512Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#511)
2 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Sep 2, 2020 at 10:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have fixed all the comments except

..

3. +# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*),
count(extract(epoch from c) = 987654321), count(d = 999) FROM
test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally
changed data');

Again, how this test is relevant to streaming mode?

I think we can keep this test in one of the newly added tests say in
015_stream_simple.pl to ensure that after streaming transaction, the
non-streaming one behaves expectedly. So we can change the comment as
"Change the local values of the extra columns on the subscriber,
update publisher, and check that subscriber retains the expected
values. This is to ensure that non-streaming transactions behave
properly after a streaming transaction."

We can remove this test from the other two places
016_stream_subxact.pl and 020_stream_binary.pl.

4. Apart from the above, I think we should think of minimizing the
test cases which can be committed with the base patch. We can later
add more tests.

We can combine the tests in 015_stream_simple.pl and
020_stream_binary.pl as I can't see a good reason to keep them
separate. Then, I think we can keep only this part with the main patch
and extract other tests into a separate patch. Basically, we can
commit the basic tests with the main patch and then keep the advanced
tests separately. I am afraid that there are some tests that don't add
much value so we can review them separately.

Fixed

One minor comment for option 'streaming = on', spacing-wise it should
be consistent in all the tests.

Similarly, we can combine 017_stream_ddl.pl and 021_stream_schema.pl
as both contains similar tests. As per the above suggestion, this will
be in a separate patch though.

If you agree with the above suggestions then kindly make these
adjustments and send the updated patch.

Done that way.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v59-0002-Additional-test-cases-for-testing-the-streaming-.patchapplication/octet-stream; name=v59-0002-Additional-test-cases-for-testing-the-streaming-.patchDownload
From 8ff8791e2cc034c9b8845128c11357829bd05b46 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 2 Sep 2020 15:15:49 +0530
Subject: [PATCH v59 2/2] Additional test cases for testing the streaming mode

---
 src/test/subscription/t/016_stream_subxact.pl      |  93 +++++++++++++++++
 src/test/subscription/t/017_stream_ddl.pl          | 110 +++++++++++++++++++++
 .../subscription/t/018_stream_subxact_abort.pl     |  96 ++++++++++++++++++
 .../subscription/t/019_stream_subxact_ddl_abort.pl |  76 ++++++++++++++
 4 files changed, 375 insertions(+)
 create mode 100644 src/test/subscription/t/016_stream_subxact.pl
 create mode 100644 src/test/subscription/t/017_stream_ddl.pl
 create mode 100644 src/test/subscription/t/018_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/019_stream_subxact_ddl_abort.pl

diff --git a/src/test/subscription/t/016_stream_subxact.pl b/src/test/subscription/t/016_stream_subxact.pl
new file mode 100644
index 0000000..a40e29f
--- /dev/null
+++ b/src/test/subscription/t/016_stream_subxact.pl
@@ -0,0 +1,93 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
+);
+
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/017_stream_ddl.pl b/src/test/subscription/t/017_stream_ddl.pl
new file mode 100644
index 0000000..23989a4
--- /dev/null
+++ b/src/test/subscription/t/017_stream_ddl.pl
@@ -0,0 +1,110 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT, f INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
+);
+
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check extra columns contain local defaults');
+
+# A large (streamed) transaction with DDL and DML. One of the DDL is performed
+# after DML to ensure that we invalidate the schema sent for test_tab so that
+# the next transaction has to send the schema again.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(2003,5000) s(i);
+ALTER TABLE test_tab ADD COLUMN f INT;
+COMMIT;
+});
+
+# A small transaction that won't get streamed. This is just to ensure that we
+# send the schema again to reflect the last column added in the previous test.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i, 4*i FROM generate_series(5001,5005) s(i);
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e), count(f) FROM test_tab");
+is($result, qq(5005|5002|4005|3004|5), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/018_stream_subxact_abort.pl b/src/test/subscription/t/018_stream_subxact_abort.pl
new file mode 100644
index 0000000..318644f
--- /dev/null
+++ b/src/test/subscription/t/018_stream_subxact_abort.pl
@@ -0,0 +1,96 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
+);
+
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|0), 'check extra columns contain local defaults');
+
+# large (streamed) transaction with subscriber receiving out of order
+# subtransaction ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3001,3500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(4001,4500) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(5001,5500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(6001,6500) s(i);
+RELEASE s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(7001,7500) s(i);
+ROLLBACK TO s1;
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1500|0), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/019_stream_subxact_ddl_abort.pl b/src/test/subscription/t/019_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000..1dcdbc0
--- /dev/null
+++ b/src/test/subscription/t/019_stream_subxact_ddl_abort.pl
@@ -0,0 +1,76 @@
+# Test streaming of large transaction with subtransactions, DDLs, DMLs, and
+# rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
+);
+
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
1.8.3.1

v59-0001-Add-support-for-streaming-to-built-in-logical-re.patchapplication/octet-stream; name=v59-0001-Add-support-for-streaming-to-built-in-logical-re.patchDownload
From 4f46c98fbe18b9b10bca851bcd54b151663b4824 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 1 Sep 2020 19:19:59 +0530
Subject: [PATCH v59 1/2] Add support for streaming to built-in logical
 replication.

To add support for streaming of in-progress transactions into the
built-in logical replication, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.

Author: Tomas Vondra, Dilip Kumar and Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 doc/src/sgml/monitoring.sgml                       |  16 +
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            |  46 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   4 +
 src/backend/replication/logical/proto.c            | 162 +++-
 src/backend/replication/logical/worker.c           | 951 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 366 +++++++-
 src/bin/pg_dump/pg_dump.c                          |  18 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  10 +-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  42 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/regress/expected/subscription.out         |  63 +-
 src/test/regress/sql/subscription.sql              |  15 +
 src/test/subscription/t/015_stream.pl              |  97 +++
 src/tools/pgindent/typedefs.list                   |   3 +
 22 files changed, 1762 insertions(+), 73 deletions(-)
 create mode 100644 src/test/subscription/t/015_stream.pl

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d973e11..673a0e7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1509,6 +1509,22 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>WALWrite</literal></entry>
       <entry>Waiting for a write to a WAL file.</entry>
      </row>
+     <row>
+      <entry><literal>LogicalChangesRead</literal></entry>
+      <entry>Waiting for a read from a logical changes file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalChangesWrite</literal></entry>
+      <entry>Waiting for a write to a logical changes file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalSubxactRead</literal></entry>
+      <entry>Waiting for a read from a logical subxact file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalSubxactWrite</literal></entry>
+      <entry>Waiting for a write to a logical subxact file.</entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70..a1666b3 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal>, and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c5..b7d7457 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf..311d462 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a2d6130..ed4f3f1 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1128,7 +1128,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377..9426e1d 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,11 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+	{
+		*streaming_given = false;
+		*streaming = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +200,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +353,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +378,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -427,6 +446,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subowner - 1] = ObjectIdGetDatum(owner);
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
+	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -698,6 +718,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +729,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +762,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +786,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +831,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +874,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL, /* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8116b23..5f4b168 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "LogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "LogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "LogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "LogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 8afa5a2..ad57409 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -425,6 +425,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097..f82236e 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,126 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+/*
+ * Write the information for the start stream message to the output stream.
+ */
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+/*
+ * Read the information about the start stream message from output stream.
+ */
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+/*
+ * Write the stop stream message to the output stream.
+ */
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+/*
+ * Write STREAM COMMIT to the output stream.
+ */
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+/*
+ * Read STREAM COMMIT from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+/*
+ * Write STREAM ABORT to the output stream. Note that xid and subxid will be
+ * same for the top-level transaction abort.
+ */
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+/*
+ * Read STREAM ABORT from the output stream.
+ */
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b576e34..f022b81 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,45 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead, the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription. This is necessary so that different workers processing a
+ * remote transaction with the same XID doesn't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close. We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of the
+ * file and if we decide to keep stream files open across the start/stop stream
+ * then it will consume a lot of memory (more than 8K for each BufFile and
+ * there could be multiple such BufFiles as the subscriber could receive
+ * multiple start/stop streams for different transactions before getting the
+ * commit). Moreover, if we don't use SharedFileSet then we also need to invent
+ * a new way to pass filenames to BufFile APIs so that we are allowed to open
+ * the file we desired across multiple stream-open calls for the same
+ * transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +67,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +99,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +109,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +138,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry. Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any subxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles, so storing them in hash makes the search faster.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +166,65 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.
+ */
+static HTAB *xidhash = NULL;
+/* BufFile handle of the current streaming file */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+/* Sub-transaction data for the current streaming transaction */
+typedef struct ApplySubXactData
+{
+	uint32		nsubxacts;		/* number of sub-transactions */
+	uint32		nsubxacts_max;	/* current capacity of subxacts */
+	TransactionId subxact_last; /* xid of the last sub-transaction */
+	SubXactInfo *subxacts;		/* sub-xact offset in changes file */
+} ApplySubXactData;
+
+static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +296,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,17 +757,336 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside streaming transaction or inside
+	 * remote transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if (!in_streamed_transaction &&
+		(!in_remote_transaction ||
+		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop. We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/*
+	 * Initialize the xidhash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing subxact file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * We can't use the binary search here as subxact XIDs won't
+		 * necessarily arrive in sorted order, consider the case where we have
+		 * released the savepoint for multiple subtransactions and then
+		 * performed rollback to savepoint for one of the earlier
+		 * sub-transaction.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		for (i = subxact_data.nsubxacts; i > 0; i--)
+		{
+			if (subxact_data.subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < subxact_data.nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxact_data.subxacts[subidx].fileno,
+							  subxact_data.subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		subxact_data.nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	/*
+	 * Allocate file handle and memory required to process all the messages in
+	 * TopTransactionContext to avoid them getting reset after each message is
+	 * processed.
+	 */
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file \"%s\"", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -635,6 +1099,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1117,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1156,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1274,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1426,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1799,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1940,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2068,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context is used for per-stream data when the streaming mode
+	 * is enabled. This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2180,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1938,6 +2444,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->name, MySubscription->name) != 0 ||
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
+		newsub->stream != MySubscription->stream ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -1979,6 +2486,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do, but if already have
+	 * subxact file then delete that.
+	 */
+	if (subxact_data.nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			SharedFileSetDeleteAll(ent->subxact_fileset);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+		return;
+	}
+
+	subxact_filename(path, subid, xid);
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &subxact_data.nsubxacts, sizeof(subxact_data.nsubxacts));
+	BufFileWrite(fd, subxact_data.subxacts, len);
+
+	BufFileClose(fd);
+
+	/* free the memory allocated for subxact info */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the structure subxact_data that can be
+ * used later.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxact_data.subxacts);
+	Assert(subxact_data.nsubxacts == 0);
+	Assert(subxact_data.nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &subxact_data.nsubxacts,
+					sizeof(subxact_data.nsubxacts)) !=
+		sizeof(subxact_data.nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	subxact_data.nsubxacts_max = 1 << my_log2(subxact_data.nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context. We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this. On stream stop we will flush this information
+	 * to the subxact file and reset the logical streaming context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxact_data.subxacts = palloc(subxact_data.nsubxacts_max *
+								   sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxact_data.subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	SubXactInfo *subxacts = subxact_data.subxacts;
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_data.subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_data.subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = subxact_data.nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (subxact_data.nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		subxact_data.nsubxacts_max = 128;
+
+		/*
+		 * Allocate this memory for subxacts in per-stream context, see
+		 * subxact_info_read.
+		 */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (subxact_data.nsubxacts == subxact_data.nsubxacts_max)
+	{
+		subxact_data.nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts,
+							subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[subxact_data.nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd,
+				&subxacts[subxact_data.nsubxacts].fileno,
+				&subxacts[subxact_data.nsubxacts].offset);
+
+	subxact_data.nsubxacts++;
+	subxact_data.subxacts = subxacts;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	SharedFileSetDeleteAll(ent->stream_fileset);
+	pfree(ent->stream_fileset);
+	ent->stream_fileset = NULL;
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		SharedFileSetDeleteAll(ent->subxact_fileset);
+		pfree(ent->subxact_fileset);
+		ent->subxact_fileset = NULL;
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open a file that we'll use to serialize changes for a toplevel
+ * transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buffile, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file \"%s\" for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls. So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxact_data.subxacts)
+		pfree(subxact_data.subxacts);
+
+	subxact_data.subxacts = NULL;
+	subxact_data.subxact_last = InvalidTransactionId;
+	subxact_data.nsubxacts = 0;
+	subxact_data.nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2151,6 +3091,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc..558b52f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,17 +47,40 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort. For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
@@ -70,6 +93,8 @@ typedef struct RelationSyncEntry
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,10 +120,15 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
 
 /*
  * Specify output plugin callbacks
@@ -115,16 +145,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +222,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			*enable_streaming = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +244,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +268,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +289,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +320,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +383,47 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool	schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start of the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order
+	 * that we don't know at this point.
+	 *
+	 * XXX There is a scope of optimization here. Currently, we always send
+	 * the schema first time in a streaming transaction but we can probably
+	 * avoid that by checking 'relentry->schema_sent' flag. However, before
+	 * doing that we need to study its impact on the case where we have a mix
+	 * of streaming and non-streaming transactions.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +439,24 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +480,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * This is called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,10 +501,19 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId	xid = InvalidTransactionId;
 
 	if (!is_publishable_relation(relation))
 		return;
 
+	/*
+	 * Remember the xid for the change in streaming mode. We need to send xid
+	 * with each change in the streaming mode so that subscriber can make their
+	 * association and on aborts, it can discard the corresponding changes.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
 	relentry = get_rel_sync_entry(data, RelationGetRelid(relation));
 
 	/* First check the table filter */
@@ -406,7 +538,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +558,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +583,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +604,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +630,11 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId	xid = InvalidTransactionId;
+
+	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +663,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -606,6 +744,118 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * START STREAM callback
+ */
+static void
+pgoutput_stream_start(struct LogicalDecodingContext* ctx,
+					  ReorderBufferTXN* txn)
+{
+	bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char* origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+/*
+ * STOP STREAM callback
+ */
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext* ctx,
+					 ReorderBufferTXN* txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -642,6 +892,39 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell *lc;
+
+	foreach (lc, entry->streamed_txns)
+	{
+		if (xid == (uint32) lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext	oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -771,12 +1054,59 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction is committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+	ListCell	*lc;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		/*
+		 * We can set the schema_sent flag for an entry that has committed xid
+		 * in the list as that ensures that the subscriber would have the
+		 * corresponding schema and we don't need to send it unless there is any
+		 * invalidation for that relation.
+		 */
+		foreach(lc, entry->streamed_txns)
+		{
+			if (xid == (uint32) lfirst_int(lc))
+			{
+				if (is_commit)
+					entry->schema_sent = true;
+
+				entry->streamed_txns =
+					foreach_delete_current(entry->streamed_txns, lc);
+				break;
+			}
+		}
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -811,7 +1141,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 2cb3f9b..d3ca54e 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4202,6 +4202,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4241,10 +4242,17 @@ getSubscriptions(Archive *fout)
 
 	if (fout->remoteVersion >= 140000)
 		appendPQExpBuffer(query,
-						  " s.subbinary\n");
+						  " s.subbinary,\n");
 	else
 		appendPQExpBuffer(query,
-						  " false AS subbinary\n");
+						  " false AS subbinary,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBuffer(query,
+						  " s.substream\n");
+	else
+		appendPQExpBuffer(query,
+						  " false AS substream\n");
 
 	appendPQExpBuffer(query,
 					  "FROM pg_subscription s\n"
@@ -4264,6 +4272,7 @@ getSubscriptions(Archive *fout)
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
+	i_substream = PQfnumber(res, "substream");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4287,6 +4296,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subpublications));
 		subinfo[i].subbinary =
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
+		subinfo[i].substream =
+			pg_strdup(PQgetvalue(res, i, i_substream));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4358,6 +4369,9 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subbinary, "t") == 0)
 		appendPQExpBuffer(query, ", binary = true");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 2f051b8..30602d5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -626,6 +626,7 @@ typedef struct _SubscriptionInfo
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subbinary;
+	char       *substream;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index d81f157..03e3ec0 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -5963,7 +5963,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false};
+	false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -5989,11 +5989,13 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode is only supported in v14 and higher */
+		/* Binary mode and streaming are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
-							  ", subbinary AS \"%s\"\n",
-							  gettext_noop("Binary"));
+							  ", subbinary AS \"%s\"\n"
+							  ", substream AS \"%s\"\n",
+							  gettext_noop("Binary"),
+							  gettext_noop("Streaming"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35..1d09154 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;                 /* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc..53905ee 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
 
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbe..6c0a4e3 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;		/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index d71db0d..2fa9bce 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                      List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off                | dbname=regress_doesnotexist
+                                                            List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                          List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off                | dbname=regress_doesnotexist2
+                                                                List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                            List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | local              | dbname=regress_doesnotexist2
+                                                                  List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,42 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                      List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | off                | dbname=regress_doesnotexist
+                                                            List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                      List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off                | dbname=regress_doesnotexist
+                                                            List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - streaming must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = foo);
+ERROR:  streaming requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                            List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                            List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index eeb2ec0..14fa0b2 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -132,6 +132,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 
 DROP SUBSCRIPTION regress_testsub;
 
+-- fail - streaming must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/015_stream.pl b/src/test/subscription/t/015_stream.pl
new file mode 100644
index 0000000..7f7f5c7
--- /dev/null
+++ b/src/test/subscription/t/015_stream.pl
@@ -0,0 +1,97 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 4;
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
+);
+
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Test the streaming in binary mode
+$node_subscriber->safe_psql('postgres',
+"ALTER SUBSCRIPTION tap_sub SET (binary = on)"
+);
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(5001, 10000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(8334|8334|8334), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values. This is to ensure that non-streaming transactions behave
+# properly after a streaming transaction.
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(8334|8334|8334), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3d99046..500623e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -111,6 +111,7 @@ Append
 AppendPath
 AppendRelInfo
 AppendState
+ApplySubXactData
 Archive
 ArchiveEntryPtrType
 ArchiveFormat
@@ -2370,6 +2371,7 @@ StopList
 StopWorkersData
 StrategyNumber
 StreamCtl
+StreamXidHash
 StringInfo
 StringInfoData
 StripnullState
@@ -2380,6 +2382,7 @@ SubPlanState
 SubTransactionId
 SubXactCallback
 SubXactCallbackItem
+SubXactInfo
 SubXactEvent
 SubplanResultRelHashElem
 SubqueryScan
-- 
1.8.3.1

#513Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#510)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Sep 1, 2020 at 9:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 31, 2020 at 7:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

In functions cleanup_rel_sync_cache and
get_schema_sent_in_streamed_txn, lets cast the result of lfirst_int to
uint32 as suggested by Tom [1]. Also, lets keep the way we compare
xids consistent in both functions, i.e, if (xid == lfirst_int(lc)).

Fixed this in the attached patch.

The behavior tested by the test case added for this is not clear
primarily because of comments.

+++ b/src/test/subscription/t/021_stream_schema.pl
@@ -0,0 +1,80 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
...
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM
generate_series(3,3000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+COMMIT;
+});
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM
generate_series(3001,3005) s(i);
+COMMIT;
+});
+wait_for_caught_up($node_publisher, $appname);

I understand that how this test will test the functionality related to
schema_sent stuff but neither the comments atop of file nor atop the
test case explains it clearly.

Added comments for this test.

Few more comments:

2.
009_stream_simple.pl
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});

How much above this data is 64kB limit? I just wanted to see that it
should not be on borderline and then due to some alignment issues the
streaming doesn't happen on some machines? Also, how such a test
ensures that the streaming has happened because the way we are
checking results, won't it be the same for the non-streaming case as
well?

Only for this case, or you mean for all the tests?

I have not done this yet.

Most of the test cases are generating above 100kb and a few are around
72kb, Please find the test case wise data size.

015 - 200kb
016 - 150kb
017 - 72kb
018 - 72kb before first rollback to sb and total ~100kb
019 - 76kb before first rollback to sb and total ~100kb
020 - 150kb
021 - 100kb

It is better to do it for all tests and I have clarified this in my
next email sent yesterday [2] where I have raised a few more comments
as well. I hope you have not missed that email.

I saw that I think I replied to this before seeing that.

3.
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*),
count(extract(epoch from c) = 987654321), count(d = 999) FROM
test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally
changed data');

Again, how this test is relevant to streaming mode?

I agree, it is not specific to the streaming.

I think we can leave this as of now. After committing the stats
patches by Sawada-San and Ajin, we might be able to improve this test.

Make sense to me.

+sub wait_for_caught_up
+{
+ my ($node, $appname) = @_;
+
+ $node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication
WHERE application_name = '$appname';"
+ ) or die "Timed ou

The patch has added this in all the test files if it is used in so
many tests then we need to add this in some generic place
(PostgresNode.pm) but actually, I am not sure if need this at all. Why
can't the existing wait_for_catchup in PostgresNode.pm serve the same
purpose.

Changed as per this suggestion.

Okay.

2.
In system_views.sql,

-- All columns of pg_subscription except subconninfo are readable.
REVOKE ALL ON pg_subscription FROM public;
GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary,
subslotname, subpublications)
ON pg_subscription TO public;

Here, we need to update for substream column as well.

Fixed.

LGTM

3. Update describeSubscriptions() to show the 'substream' value in \dRs.

4. Also, lets add few tests in subscription.sql as we have added
'binary' option in commit 9de77b5453.

Fixed both the above comments.

Ok

5. I think we can merge pg_dump related changes (the last version
posted in mail thread is v53-0005-Add-streaming-option-in-pg_dump) in
the main patch, one minor comment on pg_dump related changes
@@ -4358,6 +4369,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
if (strcmp(subinfo->subbinary, "t") == 0)
appendPQExpBuffer(query, ", binary = true");

+ if (strcmp(subinfo->substream, "f") != 0)
+ appendPQExpBuffer(query, ", streaming = on");
if (strcmp(subinfo->subsynccommit, "off") != 0)
appendPQExpBuffer(query, ", synchronous_commit = %s",
fmtId(subinfo->subsynccommit));

Keep one line space between substream and subsynccommit option code to
keep it consistent with nearby code.

Changed as per this suggestion.

Ok

I have fixed all the comments except the below comments.
1. verify the size of various tests to ensure that it is above
logical_decoding_work_mem.
2. I have checked that in one of the previous patches, we have a test
v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case
quite similar to what we have in
v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.
If there is any difference that can cover more scenarios then can we
consider merging them into one test?
3. +# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*),
count(extract(epoch from c) = 987654321), count(d = 999) FROM
test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally
changed data');

Again, how this test is relevant to streaming mode?
4. Apart from the above, I think we should think of minimizing the
test cases which can be committed with the base patch. We can later
add more tests.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#514Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#512)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Sep 2, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Sep 2, 2020 at 10:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

We can combine the tests in 015_stream_simple.pl and
020_stream_binary.pl as I can't see a good reason to keep them
separate. Then, I think we can keep only this part with the main patch
and extract other tests into a separate patch. Basically, we can
commit the basic tests with the main patch and then keep the advanced
tests separately. I am afraid that there are some tests that don't add
much value so we can review them separately.

Fixed

I have slightly adjusted this test and ran pgindent on the patch. I am
planning to push this tomorrow unless you have more comments.

--
With Regards,
Amit Kapila.

Attachments:

v60-0001-Add-support-for-streaming-to-built-in-logical-re.patchapplication/octet-stream; name=v60-0001-Add-support-for-streaming-to-built-in-logical-re.patchDownload
From 6eb262248bcdc4aba021bbade17d97ae1b95d2a5 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 2 Sep 2020 16:39:39 +0530
Subject: [PATCH v60] Add support for streaming to built-in logical
 replication.

To add support for streaming of in-progress transactions into the
built-in logical replication, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.

Author: Tomas Vondra, Dilip Kumar and Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 doc/src/sgml/monitoring.sgml                       |  16 +
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +-
 doc/src/sgml/ref/create_subscription.sgml          |  11 +
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            |  46 +-
 src/backend/postmaster/pgstat.c                    |  12 +
 .../libpqwalreceiver/libpqwalreceiver.c            |   4 +
 src/backend/replication/logical/proto.c            | 162 +++-
 src/backend/replication/logical/worker.c           | 952 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 367 +++++++-
 src/bin/pg_dump/pg_dump.c                          |  18 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  10 +-
 src/include/catalog/pg_subscription.h              |   3 +
 src/include/pgstat.h                               |   6 +-
 src/include/replication/logicalproto.h             |  42 +-
 src/include/replication/walreceiver.h              |   1 +
 src/test/regress/expected/subscription.out         |  63 +-
 src/test/regress/sql/subscription.sql              |  15 +
 src/test/subscription/t/015_stream.pl              |  98 +++
 src/tools/pgindent/typedefs.list                   |   3 +
 22 files changed, 1765 insertions(+), 73 deletions(-)
 create mode 100644 src/test/subscription/t/015_stream.pl

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d973e11..673a0e7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1509,6 +1509,22 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>WALWrite</literal></entry>
       <entry>Waiting for a write to a WAL file.</entry>
      </row>
+     <row>
+      <entry><literal>LogicalChangesRead</literal></entry>
+      <entry>Waiting for a read from a logical changes file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalChangesWrite</literal></entry>
+      <entry>Waiting for a write to a logical changes file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalSubxactRead</literal></entry>
+      <entry>Waiting for a read from a logical subxact file.</entry>
+     </row>
+     <row>
+      <entry><literal>LogicalSubxactWrite</literal></entry>
+      <entry>Waiting for a write to a logical subxact file.</entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 81c4e70..a1666b3 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -165,8 +165,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       <xref linkend="sql-createsubscription"/>.  See there for more
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
-      <literal>synchronous_commit</literal>, and
-      <literal>binary</literal>.
+      <literal>synchronous_commit</literal>,
+      <literal>binary</literal>, and
+      <literal>streaming</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index cdb22c5..b7d7457 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -228,6 +228,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>streaming</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether streaming of in-progress transactions should
+          be enabled for this subscription.  By default, all transactions
+          are fully decoded on the publisher, and only then sent to the
+          subscriber as a whole.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 90bf5cf..311d462 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->owner = subform->subowner;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
+	sub->stream = subform->substream;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a2d6130..ed4f3f1 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1128,7 +1128,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 40b6377..1696454 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -63,7 +63,8 @@ parse_subscription_options(List *options,
 						   bool *copy_data,
 						   char **synchronous_commit,
 						   bool *refresh,
-						   bool *binary_given, bool *binary)
+						   bool *binary_given, bool *binary,
+						   bool *streaming_given, bool *streaming)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -99,6 +100,11 @@ parse_subscription_options(List *options,
 		*binary_given = false;
 		*binary = false;
 	}
+	if (streaming)
+	{
+		*streaming_given = false;
+		*streaming = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -194,6 +200,16 @@ parse_subscription_options(List *options,
 			*binary_given = true;
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0 && streaming)
+		{
+			if (*streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*streaming_given = true;
+			*streaming = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +353,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		enabled_given;
 	bool		enabled;
 	bool		copy_data;
+	bool		streaming;
+	bool		streaming_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -360,7 +378,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &copy_data,
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
-							   &binary_given, &binary);
+							   &binary_given, &binary,
+							   &streaming_given, &streaming);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -427,6 +446,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subowner - 1] = ObjectIdGetDatum(owner);
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
+	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -698,6 +718,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				char	   *synchronous_commit;
 				bool		binary_given;
 				bool		binary;
+				bool		streaming_given;
+				bool		streaming;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -707,7 +729,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
-										   &binary_given, &binary);
+										   &binary_given, &binary,
+										   &streaming_given, &streaming);
 
 				if (slotname_given)
 				{
@@ -739,6 +762,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_subbinary - 1] = true;
 				}
 
+				if (streaming_given)
+				{
+					values[Anum_pg_subscription_substream - 1] =
+						BoolGetDatum(streaming);
+					replaces[Anum_pg_subscription_substream - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -756,7 +786,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "copy_data" */
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL,	/* no "binary" */
+										   NULL, NULL); /* no streaming */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -800,8 +831,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
-										   NULL, NULL); /* no "binary" */
-
+										   NULL, NULL,	/* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -843,7 +874,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &copy_data,
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
-										   NULL, NULL); /* no "binary" */
+										   NULL, NULL,	/* no "binary" */
+										   NULL, NULL); /* no "streaming" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8116b23..5f4b168 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4141,6 +4141,18 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_WRITE:
 			event_name = "WALWrite";
 			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_READ:
+			event_name = "LogicalChangesRead";
+			break;
+		case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+			event_name = "LogicalChangesWrite";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+			event_name = "LogicalSubxactRead";
+			break;
+		case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+			event_name = "LogicalSubxactWrite";
+			break;
 
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 8afa5a2..ad57409 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -425,6 +425,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, "proto_version '%u'",
 						 options->proto.logical.proto_version);
 
+		if (options->proto.logical.streaming &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfo(&cmd, ", streaming 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9ff8097..eb19142 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -138,10 +138,15 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  * Write INSERT to the output stream.
  */
 void
-logicalrep_write_insert(StringInfo out, Relation rel, HeapTuple newtuple, bool binary)
+logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'I');		/* action INSERT */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -177,8 +182,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
  * Write UPDATE to the output stream.
  */
 void
-logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
-						HeapTuple newtuple, bool binary)
+logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
 	pq_sendbyte(out, 'U');		/* action UPDATE */
 
@@ -186,6 +191,10 @@ logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -247,7 +256,8 @@ logicalrep_read_update(StringInfo in, bool *has_oldtuple,
  * Write DELETE to the output stream.
  */
 void
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool binary)
+logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
+						HeapTuple oldtuple, bool binary)
 {
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -255,6 +265,10 @@ logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple, bool b
 
 	pq_sendbyte(out, 'D');		/* action DELETE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -295,6 +309,7 @@ logicalrep_read_delete(StringInfo in, LogicalRepTupleData *oldtup)
  */
 void
 logicalrep_write_truncate(StringInfo out,
+						  TransactionId xid,
 						  int nrelids,
 						  Oid relids[],
 						  bool cascade, bool restart_seqs)
@@ -304,6 +319,10 @@ logicalrep_write_truncate(StringInfo out,
 
 	pq_sendbyte(out, 'T');		/* action TRUNCATE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	pq_sendint32(out, nrelids);
 
 	/* encode and send truncate flags */
@@ -346,12 +365,16 @@ logicalrep_read_truncate(StringInfo in,
  * Write relation description to the output stream.
  */
 void
-logicalrep_write_rel(StringInfo out, Relation rel)
+logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
 	pq_sendbyte(out, 'R');		/* sending RELATION */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	/* use Oid as relation identifier */
 	pq_sendint32(out, RelationGetRelid(rel));
 
@@ -396,7 +419,7 @@ logicalrep_read_rel(StringInfo in)
  * This function will always write base type info.
  */
 void
-logicalrep_write_typ(StringInfo out, Oid typoid)
+logicalrep_write_typ(StringInfo out, TransactionId xid, Oid typoid)
 {
 	Oid			basetypoid = getBaseType(typoid);
 	HeapTuple	tup;
@@ -404,6 +427,10 @@ logicalrep_write_typ(StringInfo out, Oid typoid)
 
 	pq_sendbyte(out, 'Y');		/* sending TYPE */
 
+	/* transaction ID (if not valid, we're not streaming) */
+	if (TransactionIdIsValid(xid))
+		pq_sendint32(out, xid);
+
 	tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(basetypoid));
 	if (!HeapTupleIsValid(tup))
 		elog(ERROR, "cache lookup failed for type %u", basetypoid);
@@ -720,3 +747,126 @@ logicalrep_read_namespace(StringInfo in)
 
 	return nspname;
 }
+
+/*
+ * Write the information for the start stream message to the output stream.
+ */
+void
+logicalrep_write_stream_start(StringInfo out,
+							  TransactionId xid, bool first_segment)
+{
+	pq_sendbyte(out, 'S');		/* action STREAM START */
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* transaction ID (we're starting to stream, so must be valid) */
+	pq_sendint32(out, xid);
+
+	/* 1 if this is the first streaming segment for this xid */
+	pq_sendbyte(out, first_segment ? 1 : 0);
+}
+
+/*
+ * Read the information about the start stream message from output stream.
+ */
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+	TransactionId xid;
+
+	Assert(first_segment);
+
+	xid = pq_getmsgint(in, 4);
+	*first_segment = (pq_getmsgbyte(in) == 1);
+
+	return xid;
+}
+
+/*
+ * Write the stop stream message to the output stream.
+ */
+void
+logicalrep_write_stream_stop(StringInfo out)
+{
+	pq_sendbyte(out, 'E');		/* action STREAM END */
+}
+
+/*
+ * Write STREAM COMMIT to the output stream.
+ */
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+							   XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'c');		/* action STREAM COMMIT */
+
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* transaction ID */
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field (unused for now) */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+}
+
+/*
+ * Read STREAM COMMIT from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags (unused for now) */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit message", flags);
+
+	/* read fields */
+	commit_data->commit_lsn = pq_getmsgint64(in);
+	commit_data->end_lsn = pq_getmsgint64(in);
+	commit_data->committime = pq_getmsgint64(in);
+
+	return xid;
+}
+
+/*
+ * Write STREAM ABORT to the output stream. Note that xid and subxid will be
+ * same for the top-level transaction abort.
+ */
+void
+logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+							  TransactionId subxid)
+{
+	pq_sendbyte(out, 'A');		/* action STREAM ABORT */
+
+	Assert(TransactionIdIsValid(xid) && TransactionIdIsValid(subxid));
+
+	/* transaction ID */
+	pq_sendint32(out, xid);
+	pq_sendint32(out, subxid);
+}
+
+/*
+ * Read STREAM ABORT from the output stream.
+ */
+void
+logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+							 TransactionId *subxid)
+{
+	Assert(xid && subxid);
+
+	*xid = pq_getmsgint(in, 4);
+	*subxid = pq_getmsgint(in, 4);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b576e34..812aca8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -18,11 +18,45 @@
  *	  This module includes server facing code and shares libpqwalreceiver
  *	  module with walreceiver for providing the libpq specific functionality.
  *
+ *
+ * STREAMED TRANSACTIONS
+ * ---------------------
+ * Streamed transactions (large transactions exceeding a memory limit on the
+ * upstream) are not applied immediately, but instead, the data is written
+ * to temporary files and then applied at once when the final commit arrives.
+ *
+ * Unlike the regular (non-streamed) case, handling streamed transactions has
+ * to handle aborts of both the toplevel transaction and subtransactions. This
+ * is achieved by tracking offsets for subtransactions, which is then used
+ * to truncate the file with serialized changes.
+ *
+ * The files are placed in tmp file directory by default, and the filenames
+ * include both the XID of the toplevel transaction and OID of the
+ * subscription. This is necessary so that different workers processing a
+ * remote transaction with the same XID doesn't interfere.
+ *
+ * We use BufFiles instead of using normal temporary files because (a) the
+ * BufFile infrastructure supports temporary files that exceed the OS file size
+ * limit, (b) provides a way for automatic clean up on the error and (c) provides
+ * a way to survive these files across local transactions and allow to open and
+ * close at stream start and close. We decided to use SharedFileSet
+ * infrastructure as without that it deletes the files on the closure of the
+ * file and if we decide to keep stream files open across the start/stop stream
+ * then it will consume a lot of memory (more than 8K for each BufFile and
+ * there could be multiple such BufFiles as the subscriber could receive
+ * multiple start/stop streams for different transactions before getting the
+ * commit). Moreover, if we don't use SharedFileSet then we also need to invent
+ * a new way to pass filenames to BufFile APIs so that we are allowed to open
+ * the file we desired across multiple stream-open calls for the same
+ * transaction.
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/table.h"
 #include "access/tableam.h"
 #include "access/xact.h"
@@ -33,7 +67,9 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "catalog/pg_tablespace.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/execPartition.h"
@@ -63,7 +99,9 @@
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -71,6 +109,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
+#include "utils/dynahash.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -99,9 +138,26 @@ typedef struct SlotErrCallbackArg
 	int			remote_attnum;
 } SlotErrCallbackArg;
 
+/*
+ * Stream xid hash entry. Whenever we see a new xid we create this entry in the
+ * xidhash and along with it create the streaming file and store the fileset handle.
+ * The subxact file is created iff there is any subxact info under this xid. This
+ * entry is used on the subsequent streams for the xid to get the corresponding
+ * fileset handles, so storing them in hash makes the search faster.
+ */
+typedef struct StreamXidHash
+{
+	TransactionId xid;			/* xid is the hash key and must be first */
+	SharedFileSet *stream_fileset;	/* shared file set for stream data */
+	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+} StreamXidHash;
+
 static MemoryContext ApplyMessageContext = NULL;
 MemoryContext ApplyContext = NULL;
 
+/* per stream context for streaming transactions */
+static MemoryContext LogicalStreamingContext = NULL;
+
 WalReceiverConn *wrconn = NULL;
 
 Subscription *MySubscription = NULL;
@@ -110,12 +166,66 @@ bool		MySubscriptionValid = false;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/* fields valid only when processing streamed transaction */
+bool		in_streamed_transaction = false;
+
+static TransactionId stream_xid = InvalidTransactionId;
+
+/*
+ * Hash table for storing the streaming xid information along with shared file
+ * set for streaming and subxact files.
+ */
+static HTAB *xidhash = NULL;
+
+/* BufFile handle of the current streaming file */
+static BufFile *stream_fd = NULL;
+
+typedef struct SubXactInfo
+{
+	TransactionId xid;			/* XID of the subxact */
+	int			fileno;			/* file number in the buffile */
+	off_t		offset;			/* offset in the file */
+} SubXactInfo;
+
+/* Sub-transaction data for the current streaming transaction */
+typedef struct ApplySubXactData
+{
+	uint32		nsubxacts;		/* number of sub-transactions */
+	uint32		nsubxacts_max;	/* current capacity of subxacts */
+	TransactionId subxact_last; /* xid of the last sub-transaction */
+	SubXactInfo *subxacts;		/* sub-xact offset in changes file */
+} ApplySubXactData;
+
+static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+
+static void subxact_filename(char *path, Oid subid, TransactionId xid);
+static void changes_filename(char *path, Oid subid, TransactionId xid);
+
+/*
+ * Information about subtransactions of a given toplevel transaction.
+ */
+static void subxact_info_write(Oid subid, TransactionId xid);
+static void subxact_info_read(Oid subid, TransactionId xid);
+static void subxact_info_add(TransactionId xid);
+static inline void cleanup_subxact_info(void);
+
+/*
+ * Serialize and deserialize changes for a toplevel transaction.
+ */
+static void stream_cleanup_files(Oid subid, TransactionId xid);
+static void stream_open_file(Oid subid, TransactionId xid, bool first);
+static void stream_write_change(char action, StringInfo s);
+static void stream_close_file(void);
+
 static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
 
 static void store_flush_position(XLogRecPtr remote_lsn);
 
 static void maybe_reread_subscription(void);
 
+/* prototype needed because of stream_commit */
+static void apply_dispatch(StringInfo s);
+
 static void apply_handle_insert_internal(ResultRelInfo *relinfo,
 										 EState *estate, TupleTableSlot *remoteslot);
 static void apply_handle_update_internal(ResultRelInfo *relinfo,
@@ -187,6 +297,42 @@ ensure_transaction(void)
 	return true;
 }
 
+/*
+ * Handle streamed transactions.
+ *
+ * If in streaming mode (receiving a block of streamed transaction), we
+ * simply redirect it to a file for the proper toplevel transaction.
+ *
+ * Returns true for streamed transactions, false otherwise (regular mode).
+ */
+static bool
+handle_streamed_transaction(const char action, StringInfo s)
+{
+	TransactionId xid;
+
+	/* not in streaming mode */
+	if (!in_streamed_transaction)
+		return false;
+
+	Assert(stream_fd != NULL);
+	Assert(TransactionIdIsValid(stream_xid));
+
+	/*
+	 * We should have received XID of the subxact as the first part of the
+	 * message, so extract it.
+	 */
+	xid = pq_getmsgint(s, 4);
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Add the new subxact to the array (unless already there). */
+	subxact_info_add(xid);
+
+	/* write the change to the current file */
+	stream_write_change(action, s);
+
+	return true;
+}
 
 /*
  * Executor state preparation for evaluation of constraint expressions,
@@ -612,17 +758,336 @@ static void
 apply_handle_origin(StringInfo s)
 {
 	/*
-	 * ORIGIN message can only come inside remote transaction and before any
-	 * actual writes.
+	 * ORIGIN message can only come inside streaming transaction or inside
+	 * remote transaction and before any actual writes.
 	 */
-	if (!in_remote_transaction ||
-		(IsTransactionState() && !am_tablesync_worker()))
+	if (!in_streamed_transaction &&
+		(!in_remote_transaction ||
+		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
 				 errmsg("ORIGIN message sent out of order")));
 }
 
 /*
+ * Handle STREAM START message.
+ */
+static void
+apply_handle_stream_start(StringInfo s)
+{
+	bool		first_segment;
+	HASHCTL		hash_ctl;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * Start a transaction on stream start, this transaction will be committed
+	 * on the stream stop. We need the transaction for handling the buffile,
+	 * used for serializing the streaming data and subxact info.
+	 */
+	ensure_transaction();
+
+	/* notify handle methods we're processing a remote transaction */
+	in_streamed_transaction = true;
+
+	/* extract XID of the top-level transaction */
+	stream_xid = logicalrep_read_stream_start(s, &first_segment);
+
+	/*
+	 * Initialize the xidhash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (xidhash == NULL)
+	{
+		hash_ctl.keysize = sizeof(TransactionId);
+		hash_ctl.entrysize = sizeof(StreamXidHash);
+		hash_ctl.hcxt = ApplyContext;
+		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_CONTEXT);
+	}
+
+	/* open the spool file for this transaction */
+	stream_open_file(MyLogicalRepWorker->subid, stream_xid, first_segment);
+
+	/* if this is not the first segment, open existing subxact file */
+	if (!first_segment)
+		subxact_info_read(MyLogicalRepWorker->subid, stream_xid);
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle STREAM STOP message.
+ */
+static void
+apply_handle_stream_stop(StringInfo s)
+{
+	Assert(in_streamed_transaction);
+
+	/*
+	 * Close the file with serialized changes, and serialize information about
+	 * subxacts for the toplevel transaction.
+	 */
+	subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
+	stream_close_file();
+
+	/* We must be in a valid transaction state */
+	Assert(IsTransactionState());
+
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
+
+	in_streamed_transaction = false;
+
+	/* Reset per-stream context */
+	MemoryContextReset(LogicalStreamingContext);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM abort message.
+ */
+static void
+apply_handle_stream_abort(StringInfo s)
+{
+	TransactionId xid;
+	TransactionId subxid;
+
+	Assert(!in_streamed_transaction);
+
+	logicalrep_read_stream_abort(s, &xid, &subxid);
+
+	/*
+	 * If the two XIDs are the same, it's in fact abort of toplevel xact, so
+	 * just delete the files with serialized info.
+	 */
+	if (xid == subxid)
+		stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+	else
+	{
+		/*
+		 * OK, so it's a subxact. We need to read the subxact file for the
+		 * toplevel transaction, determine the offset tracked for the subxact,
+		 * and truncate the file with changes. We also remove the subxacts
+		 * with higher offsets (or rather higher XIDs).
+		 *
+		 * We intentionally scan the array from the tail, because we're likely
+		 * aborting a change for the most recent subtransactions.
+		 *
+		 * We can't use the binary search here as subxact XIDs won't
+		 * necessarily arrive in sorted order, consider the case where we have
+		 * released the savepoint for multiple subtransactions and then
+		 * performed rollback to savepoint for one of the earlier
+		 * sub-transaction.
+		 */
+
+		int64		i;
+		int64		subidx;
+		BufFile    *fd;
+		bool		found = false;
+		char		path[MAXPGPATH];
+		StreamXidHash *ent;
+
+		subidx = -1;
+		ensure_transaction();
+		subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+		for (i = subxact_data.nsubxacts; i > 0; i--)
+		{
+			if (subxact_data.subxacts[i - 1].xid == subxid)
+			{
+				subidx = (i - 1);
+				found = true;
+				break;
+			}
+		}
+
+		/*
+		 * If it's an empty sub-transaction then we will not find the subxid
+		 * here so just cleanup the subxact info and return.
+		 */
+		if (!found)
+		{
+			/* Cleanup the subxact info */
+			cleanup_subxact_info();
+			CommitTransactionCommand();
+			return;
+		}
+
+		Assert((subidx >= 0) && (subidx < subxact_data.nsubxacts));
+
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+
+		/* open the changes file */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+
+		/* OK, truncate the file at the right offset */
+		BufFileTruncateShared(fd, subxact_data.subxacts[subidx].fileno,
+							  subxact_data.subxacts[subidx].offset);
+		BufFileClose(fd);
+
+		/* discard the subxacts added later */
+		subxact_data.nsubxacts = subidx;
+
+		/* write the updated subxact list */
+		subxact_info_write(MyLogicalRepWorker->subid, xid);
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	LogicalRepCommitData commit_data;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	/*
+	 * Allocate file handle and memory required to process all the messages in
+	 * TopTransactionContext to avoid them getting reset after each message is
+	 * processed.
+	 */
+	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* open the spool file for the committed transaction */
+	changes_filename(path, MyLogicalRepWorker->subid, xid);
+	elog(DEBUG1, "replaying changes from file \"%s\"", path);
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	Assert(found);
+	fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldcxt);
+
+	remote_final_lsn = commit_data.commit_lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	nchanges = 0;
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from streaming transaction's changes file \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = commit_data.end_lsn;
+	replorigin_session_origin_timestamp = commit_data.committime;
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(commit_data.end_lsn);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle RELATION message.
  *
  * Note we don't do validation against local schema here. The validation
@@ -635,6 +1100,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (handle_streamed_transaction('R', s))
+		return;
+
 	rel = logicalrep_read_rel(s);
 	logicalrep_relmap_update(rel);
 }
@@ -650,6 +1118,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (handle_streamed_transaction('Y', s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -686,6 +1157,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('I', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_insert(s, &newtup);
@@ -801,6 +1275,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('U', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_update(s, &has_oldtup, &oldtup,
@@ -950,6 +1427,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (handle_streamed_transaction('D', s))
+		return;
+
 	ensure_transaction();
 
 	relid = logicalrep_read_delete(s, &oldtup);
@@ -1320,6 +1800,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (handle_streamed_transaction('T', s))
+		return;
+
 	ensure_transaction();
 
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
@@ -1458,6 +1941,22 @@ apply_dispatch(StringInfo s)
 		case 'O':
 			apply_handle_origin(s);
 			break;
+			/* STREAM START */
+		case 'S':
+			apply_handle_stream_start(s);
+			break;
+			/* STREAM END */
+		case 'E':
+			apply_handle_stream_stop(s);
+			break;
+			/* STREAM ABORT */
+		case 'A':
+			apply_handle_stream_abort(s);
+			break;
+			/* STREAM COMMIT */
+		case 'c':
+			apply_handle_stream_commit(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1570,6 +2069,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 												"ApplyMessageContext",
 												ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context is used for per-stream data when the streaming mode
+	 * is enabled. This context is reset on each stream stop.
+	 */
+	LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
+													"LogicalStreamingContext",
+													ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -1674,7 +2181,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -1938,6 +2445,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->name, MySubscription->name) != 0 ||
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
+		newsub->stream != MySubscription->stream ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -1979,6 +2487,439 @@ subscription_change_cb(Datum arg, int cacheid, uint32 hashvalue)
 	MySubscriptionValid = false;
 }
 
+/*
+ * subxact_info_write
+ *	  Store information about subxacts for a toplevel transaction.
+ *
+ * For each subxact we store offset of it's first change in the main file.
+ * The file is always over-written as a whole.
+ *
+ * XXX We should only store subxacts that were not aborted yet.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	StreamXidHash *ent;
+	BufFile    *fd;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+	/* we must found the entry for its top transaction by this time */
+	Assert(found);
+
+	/*
+	 * If there is no subtransaction then nothing to do, but if already have
+	 * subxact file then delete that.
+	 */
+	if (subxact_data.nsubxacts == 0)
+	{
+		if (ent->subxact_fileset)
+		{
+			cleanup_subxact_info();
+			SharedFileSetDeleteAll(ent->subxact_fileset);
+			pfree(ent->subxact_fileset);
+			ent->subxact_fileset = NULL;
+		}
+		return;
+	}
+
+	subxact_filename(path, subid, xid);
+
+	/*
+	 * Create the subxact file if it not already created, otherwise open the
+	 * existing file.
+	 */
+	if (ent->subxact_fileset == NULL)
+	{
+		MemoryContext oldctx;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls.  So, need to allocate it in a persistent context.
+		 */
+		oldctx = MemoryContextSwitchTo(ApplyContext);
+		ent->subxact_fileset = palloc(sizeof(SharedFileSet));
+		SharedFileSetInit(ent->subxact_fileset, NULL);
+		MemoryContextSwitchTo(oldctx);
+
+		fd = BufFileCreateShared(ent->subxact_fileset, path);
+	}
+	else
+		fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDWR);
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* Write the subxact count and subxact info */
+	BufFileWrite(fd, &subxact_data.nsubxacts, sizeof(subxact_data.nsubxacts));
+	BufFileWrite(fd, subxact_data.subxacts, len);
+
+	BufFileClose(fd);
+
+	/* free the memory allocated for subxact info */
+	cleanup_subxact_info();
+}
+
+/*
+ * subxact_info_read
+ *	  Restore information about subxacts of a streamed transaction.
+ *
+ * Read information about subxacts into the structure subxact_data that can be
+ * used later.
+ */
+static void
+subxact_info_read(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	Size		len;
+	BufFile    *fd;
+	StreamXidHash *ent;
+	MemoryContext oldctx;
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(!subxact_data.subxacts);
+	Assert(subxact_data.nsubxacts == 0);
+	Assert(subxact_data.nsubxacts_max == 0);
+
+	/* Find the stream xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_FIND,
+										&found);
+
+	/*
+	 * If subxact_fileset is not valid that mean we don't have any subxact
+	 * info
+	 */
+	if (ent->subxact_fileset == NULL)
+		return;
+
+	subxact_filename(path, subid, xid);
+
+	fd = BufFileOpenShared(ent->subxact_fileset, path, O_RDONLY);
+
+	/* read number of subxact items */
+	if (BufFileRead(fd, &subxact_data.nsubxacts,
+					sizeof(subxact_data.nsubxacts)) !=
+		sizeof(subxact_data.nsubxacts))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	len = sizeof(SubXactInfo) * subxact_data.nsubxacts;
+
+	/* we keep the maximum as a power of 2 */
+	subxact_data.nsubxacts_max = 1 << my_log2(subxact_data.nsubxacts);
+
+	/*
+	 * Allocate subxact information in the logical streaming context. We need
+	 * this information during the complete stream so that we can add the sub
+	 * transaction info to this. On stream stop we will flush this information
+	 * to the subxact file and reset the logical streaming context.
+	 */
+	oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+	subxact_data.subxacts = palloc(subxact_data.nsubxacts_max *
+								   sizeof(SubXactInfo));
+	MemoryContextSwitchTo(oldctx);
+
+	if ((len > 0) && ((BufFileRead(fd, subxact_data.subxacts, len)) != len))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from streaming transaction's subxact file \"%s\": %m",
+						path)));
+
+	BufFileClose(fd);
+}
+
+/*
+ * subxact_info_add
+ *	  Add information about a subxact (offset in the main file).
+ */
+static void
+subxact_info_add(TransactionId xid)
+{
+	SubXactInfo *subxacts = subxact_data.subxacts;
+	int64		i;
+
+	/* We must have a valid top level stream xid and a stream fd. */
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/*
+	 * If the XID matches the toplevel transaction, we don't want to add it.
+	 */
+	if (stream_xid == xid)
+		return;
+
+	/*
+	 * In most cases we're checking the same subxact as we've already seen in
+	 * the last call, so make sure to ignore it (this change comes later).
+	 */
+	if (subxact_data.subxact_last == xid)
+		return;
+
+	/* OK, remember we're processing this XID. */
+	subxact_data.subxact_last = xid;
+
+	/*
+	 * Check if the transaction is already present in the array of subxact. We
+	 * intentionally scan the array from the tail, because we're likely adding
+	 * a change for the most recent subtransactions.
+	 *
+	 * XXX Can we rely on the subxact XIDs arriving in sorted order? That
+	 * would allow us to use binary search here.
+	 */
+	for (i = subxact_data.nsubxacts; i > 0; i--)
+	{
+		/* found, so we're done */
+		if (subxacts[i - 1].xid == xid)
+			return;
+	}
+
+	/* This is a new subxact, so we need to add it to the array. */
+	if (subxact_data.nsubxacts == 0)
+	{
+		MemoryContext oldctx;
+
+		subxact_data.nsubxacts_max = 128;
+
+		/*
+		 * Allocate this memory for subxacts in per-stream context, see
+		 * subxact_info_read.
+		 */
+		oldctx = MemoryContextSwitchTo(LogicalStreamingContext);
+		subxacts = palloc(subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+		MemoryContextSwitchTo(oldctx);
+	}
+	else if (subxact_data.nsubxacts == subxact_data.nsubxacts_max)
+	{
+		subxact_data.nsubxacts_max *= 2;
+		subxacts = repalloc(subxacts,
+							subxact_data.nsubxacts_max * sizeof(SubXactInfo));
+	}
+
+	subxacts[subxact_data.nsubxacts].xid = xid;
+
+	/*
+	 * Get the current offset of the stream file and store it as offset of
+	 * this subxact.
+	 */
+	BufFileTell(stream_fd,
+				&subxacts[subxact_data.nsubxacts].fileno,
+				&subxacts[subxact_data.nsubxacts].offset);
+
+	subxact_data.nsubxacts++;
+	subxact_data.subxacts = subxacts;
+}
+
+/* format filename for file containing the info about subxacts */
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.subxacts", subid, xid);
+}
+
+/* format filename for file containing serialized changes */
+static inline void
+changes_filename(char *path, Oid subid, TransactionId xid)
+{
+	snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);
+}
+
+/*
+ * stream_cleanup_files
+ *	  Cleanup files for a subscription / toplevel transaction.
+ *
+ * Remove files with serialized changes and subxact info for a particular
+ * toplevel transaction. Each subscription has a separate set of files.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)
+{
+	char		path[MAXPGPATH];
+	StreamXidHash *ent;
+
+	/* Remove the xid entry from the stream xid hash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_REMOVE,
+										NULL);
+	/* By this time we must have created the transaction entry */
+	Assert(ent != NULL);
+
+	/* Delete the change file and release the stream fileset memory */
+	changes_filename(path, subid, xid);
+	SharedFileSetDeleteAll(ent->stream_fileset);
+	pfree(ent->stream_fileset);
+	ent->stream_fileset = NULL;
+
+	/* Delete the subxact file and release the memory, if it exist */
+	if (ent->subxact_fileset)
+	{
+		subxact_filename(path, subid, xid);
+		SharedFileSetDeleteAll(ent->subxact_fileset);
+		pfree(ent->subxact_fileset);
+		ent->subxact_fileset = NULL;
+	}
+}
+
+/*
+ * stream_open_file
+ *	  Open a file that we'll use to serialize changes for a toplevel
+ * transaction.
+ *
+ * Open a file for streamed changes from a toplevel transaction identified
+ * by stream_xid (global variable). If it's the first chunk of streamed
+ * changes for this transaction, initialize the shared fileset and create the
+ * buffile, otherwise open the previously created file.
+ *
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_open_file(Oid subid, TransactionId xid, bool first_segment)
+{
+	char		path[MAXPGPATH];
+	bool		found;
+	MemoryContext oldcxt;
+	StreamXidHash *ent;
+
+	Assert(in_streamed_transaction);
+	Assert(OidIsValid(subid));
+	Assert(TransactionIdIsValid(xid));
+	Assert(stream_fd == NULL);
+
+	/* create or find the xid entry in the xidhash */
+	ent = (StreamXidHash *) hash_search(xidhash,
+										(void *) &xid,
+										HASH_ENTER | HASH_FIND,
+										&found);
+	Assert(first_segment || found);
+	changes_filename(path, subid, xid);
+	elog(DEBUG1, "opening file \"%s\" for streamed changes", path);
+
+	/*
+	 * Create/open the buffiles under the logical streaming context so that we
+	 * have those files until stream stop.
+	 */
+	oldcxt = MemoryContextSwitchTo(LogicalStreamingContext);
+
+	/*
+	 * If this is the first streamed segment, the file must not exist, so make
+	 * sure we're the ones creating it. Otherwise just open the file for
+	 * writing, in append mode.
+	 */
+	if (first_segment)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		/*
+		 * We need to maintain shared fileset across multiple stream
+		 * start/stop calls. So, need to allocate it in a persistent context.
+		 */
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		stream_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember the fileset for the next stream of the same transaction */
+		ent->xid = xid;
+		ent->stream_fileset = fileset;
+		ent->subxact_fileset = NULL;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		stream_fd = BufFileOpenShared(ent->stream_fileset, path, O_RDWR);
+		BufFileSeek(stream_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * stream_close_file
+ *	  Close the currently open file with streamed changes.
+ *
+ * This can only be called at the end of a streaming block, i.e. at stream_stop
+ * message from the upstream.
+ */
+static void
+stream_close_file(void)
+{
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	BufFileClose(stream_fd);
+
+	stream_xid = InvalidTransactionId;
+	stream_fd = NULL;
+}
+
+/*
+ * stream_write_change
+ *	  Serialize a change to a file for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+stream_write_change(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(in_streamed_transaction);
+	Assert(TransactionIdIsValid(stream_xid));
+	Assert(stream_fd != NULL);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(stream_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(stream_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(stream_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Cleanup the memory for subxacts and reset the related variables.
+ */
+static inline void
+cleanup_subxact_info()
+{
+	if (subxact_data.subxacts)
+		pfree(subxact_data.subxacts);
+
+	subxact_data.subxacts = NULL;
+	subxact_data.subxact_last = InvalidTransactionId;
+	subxact_data.nsubxacts = 0;
+	subxact_data.nsubxacts_max = 0;
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -2151,6 +3092,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
+	options.proto.logical.streaming = MySubscription->stream;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 81ef7dc..c29c088 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,17 +47,40 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn);
+static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn);
+static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr abort_lsn);
+static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+								   ReorderBufferTXN *txn,
+								   XLogRecPtr commit_lsn);
 
 static bool publications_valid;
+static bool in_streaming;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+									LogicalDecodingContext *ctx);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
  *
+ * The schema_sent flag determines if the current schema record was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
+ * The schema cache on downstream is however updated only at commit time,
+ * and with streamed transactions the commit order may be different from
+ * the order the transactions are sent in. Also, the (sub) transactions
+ * might get aborted so we need to send the schema for each (sub) transaction
+ * so that we don't loose the schema information on abort. For handling this,
+ * we maintain the list of xids (streamed_txns) for those we have already sent
+ * the schema.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
@@ -70,6 +93,8 @@ typedef struct RelationSyncEntry
 	 * have been sent for this to be true.
 	 */
 	bool		schema_sent;
+	List	   *streamed_txns;	/* streamed toplevel transactions with this
+								 * schema */
 
 	bool		replicate_valid;
 	PublicationActions pubactions;
@@ -95,10 +120,15 @@ typedef struct RelationSyncEntry
 static HTAB *RelationSyncCache = NULL;
 
 static void init_rel_sync_cache(MemoryContext decoding_context);
+static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
+static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
+static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
+											TransactionId xid);
 
 /*
  * Specify output plugin callbacks
@@ -115,16 +145,26 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->commit_cb = pgoutput_commit_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
+
+	/* transaction streaming */
+	cb->stream_start_cb = pgoutput_stream_start;
+	cb->stream_stop_cb = pgoutput_stream_stop;
+	cb->stream_abort_cb = pgoutput_stream_abort;
+	cb->stream_commit_cb = pgoutput_stream_commit;
+	cb->stream_change_cb = pgoutput_change;
+	cb->stream_truncate_cb = pgoutput_truncate;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
-						List **publication_names, bool *binary)
+						List **publication_names, bool *binary,
+						bool *enable_streaming)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
+	bool		streaming_given = false;
 
 	*binary = false;
 
@@ -182,6 +222,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*binary = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "streaming") == 0)
+		{
+			if (streaming_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			streaming_given = true;
+
+			*enable_streaming = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -194,6 +244,7 @@ static void
 pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
+	bool		enable_streaming = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -217,7 +268,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		parse_output_parameters(ctx->output_plugin_options,
 								&data->protocol_version,
 								&data->publication_names,
-								&data->binary);
+								&data->binary,
+								&enable_streaming);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -237,6 +289,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("publication_names parameter missing")));
 
+		/*
+		 * Decide whether to enable streaming. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_streaming)
+			ctx->streaming = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_STREAM_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support streaming, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_STREAM_VERSION_NUM)));
+		else if (!ctx->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("streaming requested, but not supported by output plugin")));
+
+		/* Also remember we're currently not streaming any transaction. */
+		in_streaming = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -247,6 +320,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Initialize relation schema cache. */
 		init_rel_sync_cache(CacheMemoryContext);
 	}
+	else
+	{
+		/* Disable the streaming during the slot initialization mode. */
+		ctx->streaming = false;
+	}
 }
 
 /*
@@ -305,9 +383,47 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 maybe_send_schema(LogicalDecodingContext *ctx,
+				  ReorderBufferTXN *txn, ReorderBufferChange *change,
 				  Relation relation, RelationSyncEntry *relentry)
 {
-	if (relentry->schema_sent)
+	bool		schema_sent;
+	TransactionId xid = InvalidTransactionId;
+	TransactionId topxid = InvalidTransactionId;
+
+	/*
+	 * Remember XID of the (sub)transaction for the change. We don't care if
+	 * it's top-level transaction or not (we have already sent that XID in
+	 * start of the current streaming block).
+	 *
+	 * If we're not in a streaming block, just use InvalidTransactionId and
+	 * the write methods will not include it.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
+	if (change->txn->toptxn)
+		topxid = change->txn->toptxn->xid;
+	else
+		topxid = xid;
+
+	/*
+	 * Do we need to send the schema? We do track streamed transactions
+	 * separately, because those may be applied later (and the regular
+	 * transactions won't see their effects until then) and in an order that
+	 * we don't know at this point.
+	 *
+	 * XXX There is a scope of optimization here. Currently, we always send
+	 * the schema first time in a streaming transaction but we can probably
+	 * avoid that by checking 'relentry->schema_sent' flag. However, before
+	 * doing that we need to study its impact on the case where we have a mix
+	 * of streaming and non-streaming transactions.
+	 */
+	if (in_streaming)
+		schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		schema_sent = relentry->schema_sent;
+
+	if (schema_sent)
 		return;
 
 	/* If needed, send the ancestor's schema first. */
@@ -323,19 +439,24 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
 											   CreateTupleDescCopy(outdesc));
 		MemoryContextSwitchTo(oldctx);
-		send_relation_and_attrs(ancestor, ctx);
+		send_relation_and_attrs(ancestor, xid, ctx);
 		RelationClose(ancestor);
 	}
 
-	send_relation_and_attrs(relation, ctx);
-	relentry->schema_sent = true;
+	send_relation_and_attrs(relation, xid, ctx);
+
+	if (in_streaming)
+		set_schema_sent_in_streamed_txn(relentry, topxid);
+	else
+		relentry->schema_sent = true;
 }
 
 /*
  * Sends a relation
  */
 static void
-send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
+send_relation_and_attrs(Relation relation, TransactionId xid,
+						LogicalDecodingContext *ctx)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
 	int			i;
@@ -359,17 +480,19 @@ send_relation_and_attrs(Relation relation, LogicalDecodingContext *ctx)
 			continue;
 
 		OutputPluginPrepareWrite(ctx, false);
-		logicalrep_write_typ(ctx->out, att->atttypid);
+		logicalrep_write_typ(ctx->out, xid, att->atttypid);
 		OutputPluginWrite(ctx, false);
 	}
 
 	OutputPluginPrepareWrite(ctx, false);
-	logicalrep_write_rel(ctx->out, relation);
+	logicalrep_write_rel(ctx->out, xid, relation);
 	OutputPluginWrite(ctx, false);
 }
 
 /*
  * Sends the decoded DML over wire.
+ *
+ * This is called both in streaming and non-streaming modes.
  */
 static void
 pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
@@ -378,10 +501,20 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
+	TransactionId xid = InvalidTransactionId;
 
 	if (!is_publishable_relation(relation))
 		return;
 
+	/*
+	 * Remember the xid for the change in streaming mode. We need to send xid
+	 * with each change in the streaming mode so that subscriber can make
+	 * their association and on aborts, it can discard the corresponding
+	 * changes.
+	 */
+	if (in_streaming)
+		xid = change->txn->xid;
+
 	relentry = get_rel_sync_entry(data, RelationGetRelid(relation));
 
 	/* First check the table filter */
@@ -406,7 +539,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
-	maybe_send_schema(ctx, relation, relentry);
+	maybe_send_schema(ctx, txn, change, relation, relentry);
 
 	/* Send the data */
 	switch (change->action)
@@ -426,7 +559,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_insert(ctx->out, relation, tuple,
+				logicalrep_write_insert(ctx->out, xid, relation, tuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
@@ -451,8 +584,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_update(ctx->out, relation, oldtuple, newtuple,
-										data->binary);
+				logicalrep_write_update(ctx->out, xid, relation, oldtuple,
+										newtuple, data->binary);
 				OutputPluginWrite(ctx, true);
 				break;
 			}
@@ -472,7 +605,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				}
 
 				OutputPluginPrepareWrite(ctx, true);
-				logicalrep_write_delete(ctx->out, relation, oldtuple,
+				logicalrep_write_delete(ctx->out, xid, relation, oldtuple,
 										data->binary);
 				OutputPluginWrite(ctx, true);
 			}
@@ -498,6 +631,11 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	int			i;
 	int			nrelids;
 	Oid		   *relids;
+	TransactionId xid = InvalidTransactionId;
+
+	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
+	if (in_streaming)
+		xid = change->txn->xid;
 
 	old = MemoryContextSwitchTo(data->context);
 
@@ -526,13 +664,14 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			continue;
 
 		relids[nrelids++] = relid;
-		maybe_send_schema(ctx, relation, relentry);
+		maybe_send_schema(ctx, txn, change, relation, relentry);
 	}
 
 	if (nrelids > 0)
 	{
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
+								  xid,
 								  nrelids,
 								  relids,
 								  change->data.truncate.cascade,
@@ -606,6 +745,118 @@ publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * START STREAM callback
+ */
+static void
+pgoutput_stream_start(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	/* we can't nest streaming of transactions */
+	Assert(!in_streaming);
+
+	/*
+	 * If we already sent the first stream for this transaction then don't
+	 * send the origin id in the subsequent streams.
+	 */
+	if (rbtxn_is_streamed(txn))
+		send_replication_origin = false;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
+
+	if (send_replication_origin)
+	{
+		char	   *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		if (replorigin_by_oid(txn->origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+	}
+
+	OutputPluginWrite(ctx, true);
+
+	/* we're streaming a chunk of transaction now */
+	in_streaming = true;
+}
+
+/*
+ * STOP STREAM callback
+ */
+static void
+pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn)
+{
+	/* we should be streaming a trasanction */
+	Assert(in_streaming);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_stop(ctx->out);
+	OutputPluginWrite(ctx, true);
+
+	/* we've stopped streaming a transaction */
+	in_streaming = false;
+}
+
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr abort_lsn)
+{
+	ReorderBufferTXN *toptxn;
+
+	/*
+	 * The abort should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+
+	/* determine the toplevel transaction */
+	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+
+	Assert(rbtxn_is_streamed(toptxn));
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_abort(ctx->out, toptxn->xid, txn->xid);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(toptxn->xid, false);
+}
+
+/*
+ * Notify downstream to apply the streamed transaction (along with all
+ * it's subtransactions).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+					   ReorderBufferTXN *txn,
+					   XLogRecPtr commit_lsn)
+{
+	/*
+	 * The commit should happen outside streaming block, even for streamed
+	 * transactions. The transaction has to be marked as streamed, though.
+	 */
+	Assert(!in_streaming);
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_commit(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+
+	cleanup_rel_sync_cache(txn->xid, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -642,6 +893,39 @@ init_rel_sync_cache(MemoryContext cachectx)
 }
 
 /*
+ * We expect relatively small number of streamed transactions.
+ */
+static bool
+get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	ListCell   *lc;
+
+	foreach(lc, entry->streamed_txns)
+	{
+		if (xid == (uint32) lfirst_int(lc))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add the xid in the rel sync entry for which we have already sent the schema
+ * of the relation.
+ */
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+	MemoryContext oldctx;
+
+	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+	MemoryContextSwitchTo(oldctx);
+}
+
+/*
  * Find or create entry in the relation schema cache.
  *
  * This looks up publications that the given relation is directly or
@@ -771,12 +1055,59 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	}
 
 	if (!found)
+	{
 		entry->schema_sent = false;
+		entry->streamed_txns = NULL;
+	}
 
 	return entry;
 }
 
 /*
+ * Cleanup list of streamed transactions and update the schema_sent flag.
+ *
+ * When a streamed transaction commits or aborts, we need to remove the
+ * toplevel XID from the schema cache. If the transaction aborted, the
+ * subscriber will simply throw away the schema records we streamed, so
+ * we don't need to do anything else.
+ *
+ * If the transaction is committed, the subscriber will update the relation
+ * cache - so tweak the schema_sent flag accordingly.
+ */
+static void
+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+	HASH_SEQ_STATUS hash_seq;
+	RelationSyncEntry *entry;
+	ListCell   *lc;
+
+	Assert(RelationSyncCache != NULL);
+
+	hash_seq_init(&hash_seq, RelationSyncCache);
+	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	{
+		/*
+		 * We can set the schema_sent flag for an entry that has committed xid
+		 * in the list as that ensures that the subscriber would have the
+		 * corresponding schema and we don't need to send it unless there is
+		 * any invalidation for that relation.
+		 */
+		foreach(lc, entry->streamed_txns)
+		{
+			if (xid == (uint32) lfirst_int(lc))
+			{
+				if (is_commit)
+					entry->schema_sent = true;
+
+				entry->streamed_txns =
+					foreach_delete_current(entry->streamed_txns, lc);
+				break;
+			}
+		}
+	}
+}
+
+/*
  * Relcache invalidation callback
  */
 static void
@@ -811,7 +1142,11 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	 * Reset schema sent status as the relation definition may have changed.
 	 */
 	if (entry != NULL)
+	{
 		entry->schema_sent = false;
+		list_free(entry->streamed_txns);
+		entry->streamed_txns = NULL;
+	}
 }
 
 /*
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 2cb3f9b..d3ca54e 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4202,6 +4202,7 @@ getSubscriptions(Archive *fout)
 	int			i_oid;
 	int			i_subname;
 	int			i_rolname;
+	int			i_substream;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4241,10 +4242,17 @@ getSubscriptions(Archive *fout)
 
 	if (fout->remoteVersion >= 140000)
 		appendPQExpBuffer(query,
-						  " s.subbinary\n");
+						  " s.subbinary,\n");
 	else
 		appendPQExpBuffer(query,
-						  " false AS subbinary\n");
+						  " false AS subbinary,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBuffer(query,
+						  " s.substream\n");
+	else
+		appendPQExpBuffer(query,
+						  " false AS substream\n");
 
 	appendPQExpBuffer(query,
 					  "FROM pg_subscription s\n"
@@ -4264,6 +4272,7 @@ getSubscriptions(Archive *fout)
 	i_subsynccommit = PQfnumber(res, "subsynccommit");
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
+	i_substream = PQfnumber(res, "substream");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4287,6 +4296,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subpublications));
 		subinfo[i].subbinary =
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
+		subinfo[i].substream =
+			pg_strdup(PQgetvalue(res, i, i_substream));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4358,6 +4369,9 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subbinary, "t") == 0)
 		appendPQExpBuffer(query, ", binary = true");
 
+	if (strcmp(subinfo->substream, "f") != 0)
+		appendPQExpBuffer(query, ", streaming = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 2f051b8..e0b42e8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -626,6 +626,7 @@ typedef struct _SubscriptionInfo
 	char	   *subconninfo;
 	char	   *subslotname;
 	char	   *subbinary;
+	char	   *substream;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 0266fc5..0861d74 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -5979,7 +5979,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false};
+	false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6005,11 +6005,13 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode is only supported in v14 and higher */
+		/* Binary mode and streaming are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
-							  ", subbinary AS \"%s\"\n",
-							  gettext_noop("Binary"));
+							  ", subbinary AS \"%s\"\n"
+							  ", substream AS \"%s\"\n",
+							  gettext_noop("Binary"),
+							  gettext_noop("Streaming"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 9795c35..9ebec7b 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -51,6 +51,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 	bool		subbinary;		/* True if the subscription wants the
 								 * publisher to send data in binary */
 
+	bool		substream;		/* Stream in-progress transactions. */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -78,6 +80,7 @@ typedef struct Subscription
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
+	bool		stream;			/* Allow streaming in-progress transactions. */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1..0dfbac4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -982,7 +982,11 @@ typedef enum
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-	WAIT_EVENT_WAL_WRITE
+	WAIT_EVENT_WAL_WRITE,
+	WAIT_EVENT_LOGICAL_CHANGES_READ,
+	WAIT_EVENT_LOGICAL_CHANGES_WRITE,
+	WAIT_EVENT_LOGICAL_SUBXACT_READ,
+	WAIT_EVENT_LOGICAL_SUBXACT_WRITE
 } WaitEventIO;
 
 /* ----------
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 60a76bc..53905ee 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -23,9 +23,13 @@
  * we can support. LOGICALREP_PROTO_MIN_VERSION_NUM is the oldest version we
  * have backwards compatibility for. The client requests protocol version at
  * connect time.
+ *
+ * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
+ * support for streaming large transactions.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_VERSION_NUM 2
 
 /*
  * This struct stores a tuple received via logical replication.
@@ -98,25 +102,45 @@ extern void logicalrep_read_commit(StringInfo in,
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
-extern void logicalrep_write_insert(StringInfo out, Relation rel,
-									HeapTuple newtuple, bool binary);
+extern void logicalrep_write_insert(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple newtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup);
-extern void logicalrep_write_update(StringInfo out, Relation rel, HeapTuple oldtuple,
+extern void logicalrep_write_update(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
 									HeapTuple newtuple, bool binary);
 extern LogicalRepRelId logicalrep_read_update(StringInfo in,
 											  bool *has_oldtuple, LogicalRepTupleData *oldtup,
 											  LogicalRepTupleData *newtup);
-extern void logicalrep_write_delete(StringInfo out, Relation rel,
-									HeapTuple oldtuple, bool binary);
+extern void logicalrep_write_delete(StringInfo out, TransactionId xid,
+									Relation rel, HeapTuple oldtuple,
+									bool binary);
 extern LogicalRepRelId logicalrep_read_delete(StringInfo in,
 											  LogicalRepTupleData *oldtup);
-extern void logicalrep_write_truncate(StringInfo out, int nrelids, Oid relids[],
+extern void logicalrep_write_truncate(StringInfo out, TransactionId xid,
+									  int nrelids, Oid relids[],
 									  bool cascade, bool restart_seqs);
 extern List *logicalrep_read_truncate(StringInfo in,
 									  bool *cascade, bool *restart_seqs);
-extern void logicalrep_write_rel(StringInfo out, Relation rel);
+extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
+								 Relation rel);
 extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
-extern void logicalrep_write_typ(StringInfo out, Oid typoid);
+extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
+								 Oid typoid);
 extern void logicalrep_read_typ(StringInfo out, LogicalRepTyp *ltyp);
+extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
+										  bool first_segment);
+extern TransactionId logicalrep_read_stream_start(StringInfo in,
+												  bool *first_segment);
+extern void logicalrep_write_stream_stop(StringInfo out);
+extern TransactionId logicalrep_read_stream_stop(StringInfo in);
+extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+										   XLogRecPtr commit_lsn);
+extern TransactionId logicalrep_read_stream_commit(StringInfo out,
+												   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
+										  TransactionId subxid);
+extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
+										 TransactionId *subxid);
 
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbe..1b05b39 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -178,6 +178,7 @@ typedef struct
 			uint32		proto_version;	/* Logical protocol version */
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
+			bool		streaming;	/* Streaming of large transactions */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index d71db0d..2fa9bce 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                      List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off                | dbname=regress_doesnotexist
+                                                            List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                          List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off                | dbname=regress_doesnotexist2
+                                                                List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                            List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | local              | dbname=regress_doesnotexist2
+                                                                  List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,42 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                      List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | off                | dbname=regress_doesnotexist
+                                                            List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                      List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off                | dbname=regress_doesnotexist
+                                                            List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - streaming must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = foo);
+ERROR:  streaming requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                            List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                            List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index eeb2ec0..14fa0b2 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -132,6 +132,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 
 DROP SUBSCRIPTION regress_testsub;
 
+-- fail - streaming must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/015_stream.pl b/src/test/subscription/t/015_stream.pl
new file mode 100644
index 0000000..fffe001
--- /dev/null
+++ b/src/test/subscription/t/015_stream.pl
@@ -0,0 +1,98 @@
+# Test streaming of simple large transaction
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 4;
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
+);
+
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain local defaults');
+
+# Test the streaming in binary mode
+$node_subscriber->safe_psql('postgres',
+"ALTER SUBSCRIPTION tap_sub SET (binary = on)"
+);
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(5001, 10000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(6667|6667|6667), 'check extra columns contain local defaults');
+
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values. This is to ensure that non-streaming transactions behave
+# properly after a streaming transaction.
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab");
+is($result, qq(6667|6667|6667), 'check extra columns contain locally changed data');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3d99046..500623e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -111,6 +111,7 @@ Append
 AppendPath
 AppendRelInfo
 AppendState
+ApplySubXactData
 Archive
 ArchiveEntryPtrType
 ArchiveFormat
@@ -2370,6 +2371,7 @@ StopList
 StopWorkersData
 StrategyNumber
 StreamCtl
+StreamXidHash
 StringInfo
 StringInfoData
 StripnullState
@@ -2380,6 +2382,7 @@ SubPlanState
 SubTransactionId
 SubXactCallback
 SubXactCallbackItem
+SubXactInfo
 SubXactEvent
 SubplanResultRelHashElem
 SubqueryScan
-- 
1.8.3.1

#515Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#514)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Sep 2, 2020 at 7:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Sep 2, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Sep 2, 2020 at 10:55 AM Amit Kapila <amit.kapila16@gmail.com>

wrote:

We can combine the tests in 015_stream_simple.pl and
020_stream_binary.pl as I can't see a good reason to keep them
separate. Then, I think we can keep only this part with the main patch
and extract other tests into a separate patch. Basically, we can
commit the basic tests with the main patch and then keep the advanced
tests separately. I am afraid that there are some tests that don't add
much value so we can review them separately.

Fixed

I have slightly adjusted this test and ran pgindent on the patch. I am
planning to push this tomorrow unless you have more comments.

Looks good to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#516Bossart, Nathan
bossartn@amazon.com
In reply to: Dilip Kumar (#515)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

I noticed a small compiler warning for this.

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 812aca8011..88d3444c39 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -199,7 +199,7 @@ typedef struct ApplySubXactData
 static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
 static void subxact_filename(char *path, Oid subid, TransactionId xid);
-static void changes_filename(char *path, Oid subid, TransactionId xid);
+static inline void changes_filename(char *path, Oid subid, TransactionId xid);

/*
* Information about subtransactions of a given toplevel transaction.

Nathan

#517Amit Kapila
amit.kapila16@gmail.com
In reply to: Bossart, Nathan (#516)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Sep 4, 2020 at 3:10 AM Bossart, Nathan <bossartn@amazon.com> wrote:

I noticed a small compiler warning for this.

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 812aca8011..88d3444c39 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -199,7 +199,7 @@ typedef struct ApplySubXactData
static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
static void subxact_filename(char *path, Oid subid, TransactionId xid);
-static void changes_filename(char *path, Oid subid, TransactionId xid);
+static inline void changes_filename(char *path, Oid subid, TransactionId xid);

Thanks for the report, I'll take care of this. I think the nearby
similar function subxact_filename() should also be inline.

--
With Regards,
Amit Kapila.

#518Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#510)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Sep 1, 2020 at 9:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have fixed all the comments except the below comments.
1. verify the size of various tests to ensure that it is above
logical_decoding_work_mem.
2. I have checked that in one of the previous patches, we have a test
v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case
quite similar to what we have in
v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.
If there is any difference that can cover more scenarios then can we
consider merging them into one test?

I have compared these two tests and found that the only thing
additional in the test case present in
v53-0004-Add-TAP-test-for-streaming-vs.-DDL was that it was performing
few savepoints and DMLs after doing the first rollback to savepoint
and I included that in one of the existing tests in
018_stream_subxact_abort.pl. I have added one test for Rollback,
changed few messages, removed one test case which was not making any
sense in the patch. See attached and let me know what you think about
it?

--
With Regards,
Amit Kapila.

Attachments:

v61-0001-Add-additional-tests-to-test-streaming-of-in-pro.patchapplication/octet-stream; name=v61-0001-Add-additional-tests-to-test-streaming-of-in-pro.patchDownload
From 1e9ee56cc65f4abb33e0d752e6ed272ac0e013ed Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 5 Sep 2020 12:43:21 +0530
Subject: [PATCH v61] Add additional tests to test streaming of in-progress
 transactions.

This covers the functionality tests for streaming in-progress
subtransactions, streaming transactions containing rollback to savepoints,
and streaming transactions having DDLs.

Author: Tomas Vondra, Amit Kapila and Dilip Kumar
Reviewed-by: Dilip Kumar
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
---
 src/test/subscription/t/016_stream_subxact.pl |  81 ++++++++++++
 src/test/subscription/t/017_stream_ddl.pl     | 110 ++++++++++++++++
 .../t/018_stream_subxact_abort.pl             | 117 ++++++++++++++++++
 .../t/019_stream_subxact_ddl_abort.pl         |  76 ++++++++++++
 4 files changed, 384 insertions(+)
 create mode 100644 src/test/subscription/t/016_stream_subxact.pl
 create mode 100644 src/test/subscription/t/017_stream_ddl.pl
 create mode 100644 src/test/subscription/t/018_stream_subxact_abort.pl
 create mode 100644 src/test/subscription/t/019_stream_subxact_ddl_abort.pl

diff --git a/src/test/subscription/t/016_stream_subxact.pl b/src/test/subscription/t/016_stream_subxact.pl
new file mode 100644
index 0000000000..b6a2d10e91
--- /dev/null
+++ b/src/test/subscription/t/016_stream_subxact.pl
@@ -0,0 +1,81 @@
+# Test streaming of large transaction containing large subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
+);
+
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# Insert, update and delete enough rows to exceed 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(    3,  500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,  1000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,  1500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,  2000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001, 2500) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(1667|1667|1667), 'check data was copied to subscriber in streaming mode and extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/017_stream_ddl.pl b/src/test/subscription/t/017_stream_ddl.pl
new file mode 100644
index 0000000000..be7d7d74e3
--- /dev/null
+++ b/src/test/subscription/t/017_stream_ddl.pl
@@ -0,0 +1,110 @@
+# Test streaming of large transaction with DDL and subtransactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT, f INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
+);
+
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|0|0), 'check initial data was copied to subscriber');
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (3, md5(3::text));
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (4, md5(4::text), -4);
+COMMIT;
+});
+
+# large (streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(5, 1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001, 2000) s(i);
+COMMIT;
+});
+
+# a small (non-streamed) transaction with DDL and DML
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab VALUES (2001, md5(2001::text), -2001, 2*2001);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s1;
+INSERT INTO test_tab VALUES (2002, md5(2002::text), -2002, 2*2002, -3*2002);
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e) FROM test_tab");
+is($result, qq(2002|1999|1002|1), 'check data was copied to subscriber in streaming mode and extra columns contain local defaults');
+
+# A large (streamed) transaction with DDL and DML. One of the DDL is performed
+# after DML to ensure that we invalidate the schema sent for test_tab so that
+# the next transaction has to send the schema again.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(2003,5000) s(i);
+ALTER TABLE test_tab ADD COLUMN f INT;
+COMMIT;
+});
+
+# A small transaction that won't get streamed. This is just to ensure that we
+# send the schema again to reflect the last column added in the previous test.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i, 4*i FROM generate_series(5001,5005) s(i);
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d), count(e), count(f) FROM test_tab");
+is($result, qq(5005|5002|4005|3004|5), 'check data was copied to subscriber for both streaming and non-streaming transactions');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/018_stream_subxact_abort.pl b/src/test/subscription/t/018_stream_subxact_abort.pl
new file mode 100644
index 0000000000..ddf0621558
--- /dev/null
+++ b/src/test/subscription/t/018_stream_subxact_abort.pl
@@ -0,0 +1,117 @@
+# Test streaming of large transaction containing multiple subtransactions and rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 4;
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
+);
+
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(501,1000) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1001,1500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(1501,2000) s(i);
+ROLLBACK TO s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2001,2500) s(i);
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(2501,3000) s(i);
+SAVEPOINT s4;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3001,3500) s(i);
+SAVEPOINT s5;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3501,4000) s(i);
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2000|0), 'check rollback to savepoint was reflected on subscriber and extra columns contain local defaults');
+
+# large (streamed) transaction with subscriber receiving out of order
+# subtransaction ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(4001,4500) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(5001,5500) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(6001,6500) s(i);
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(7001,7500) s(i);
+RELEASE s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(8001,8500) s(i);
+ROLLBACK TO s1;
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2500|0), 'check rollback to savepoint was reflected on subscriber');
+
+# large (streamed) transaction with subscriber receiving rollback
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(8501,9000) s(i);
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(9001,9500) s(i);
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(9501,10000) s(i);
+ROLLBACK;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2500|0), 'check rollback was reflected on subscriber');
+
+$node_subscriber->stop;
+$node_publisher->stop;
diff --git a/src/test/subscription/t/019_stream_subxact_ddl_abort.pl b/src/test/subscription/t/019_stream_subxact_ddl_abort.pl
new file mode 100644
index 0000000000..33e42edfef
--- /dev/null
+++ b/src/test/subscription/t/019_stream_subxact_ddl_abort.pl
@@ -0,0 +1,76 @@
+# Test streaming of large transaction with subtransactions, DDLs, DMLs, and
+# rollbacks
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+# Create publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', 'logical_decoding_work_mem = 64kB');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+# Create some preexisting content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b text, c INT, d INT, e INT)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)"
+);
+
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(2|0), 'check initial data was copied to subscriber');
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3,500) s(i);
+ALTER TABLE test_tab ADD COLUMN c INT;
+SAVEPOINT s1;
+INSERT INTO test_tab SELECT i, md5(i::text), -i FROM generate_series(501,1000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+SAVEPOINT s2;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i FROM generate_series(1001,1500) s(i);
+ALTER TABLE test_tab ADD COLUMN e INT;
+SAVEPOINT s3;
+INSERT INTO test_tab SELECT i, md5(i::text), -i, 2*i, -3*i FROM generate_series(1501,2000) s(i);
+ALTER TABLE test_tab DROP COLUMN c;
+ROLLBACK TO s1;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(501,1000) s(i);
+COMMIT;
+});
+
+$node_publisher->wait_for_catchup($appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c) FROM test_tab");
+is($result, qq(1000|500), 'check rollback to savepoint was reflected on subscriber and extra columns contain local defaults');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.28.0.windows.1

#519Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#518)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, 5 Sep 2020 at 4:02 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Sep 1, 2020 at 9:28 AM Amit Kapila <amit.kapila16@gmail.com>

wrote:

I have fixed all the comments except the below comments.

1. verify the size of various tests to ensure that it is above

logical_decoding_work_mem.

2. I have checked that in one of the previous patches, we have a test

v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case

quite similar to what we have in

v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.

If there is any difference that can cover more scenarios then can we

consider merging them into one test?

I have compared these two tests and found that the only thing

additional in the test case present in

v53-0004-Add-TAP-test-for-streaming-vs.-DDL was that it was performing

few savepoints and DMLs after doing the first rollback to savepoint

and I included that in one of the existing tests in

018_stream_subxact_abort.pl. I have added one test for Rollback,

changed few messages, removed one test case which was not making any

sense in the patch. See attached and let me know what you think about

it?

I have reviewed the changes and looks fine to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#520Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#519)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Sep 5, 2020 at 8:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have reviewed the changes and looks fine to me.

Thanks, I have pushed the last patch. Let's wait for a day or so to
see the buildfarm reports and then we can probably close this CF
entry. I am aware that we have one patch related to stats still
pending but I think we can tackle it along with the spill stats patch
which is being discussed in a different thread [1]/messages/by-id/CAA4eK1JBqQh9cBKjO-nKOOE=7f6ONDCZp0TJZfn4VsQqRZ+uYA@mail.gmail.com. Do let me know if
I have missed anything?

[1]: /messages/by-id/CAA4eK1JBqQh9cBKjO-nKOOE=7f6ONDCZp0TJZfn4VsQqRZ+uYA@mail.gmail.com

--
With Regards,
Amit Kapila.

#521Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#520)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Sep 7, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Sep 5, 2020 at 8:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have reviewed the changes and looks fine to me.

Thanks, I have pushed the last patch. Let's wait for a day or so to
see the buildfarm reports and then we can probably close this CF
entry.

Thanks.

I am aware that we have one patch related to stats still
pending but I think we can tackle it along with the spill stats patch
which is being discussed in a different thread [1]. Do let me know if
I have missed anything?

[1] -
/messages/by-id/CAA4eK1JBqQh9cBKjO-nKOOE=7f6ONDCZp0TJZfn4VsQqRZ+uYA@mail.gmail.com

Sound good to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#522Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#521)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Sep 7, 2020 at 12:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Sep 7, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Sep 5, 2020 at 8:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have reviewed the changes and looks fine to me.

Thanks, I have pushed the last patch. Let's wait for a day or so to
see the buildfarm reports and then we can probably close this CF
entry.

Thanks.

I have updated the status of CF entry as committed now.

--
With Regards,
Amit Kapila.

#523Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#522)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Hi,

while looking at the streaming code I noticed two minor issues:

1) logicalrep_read_stream_stop is never defined/called, so the prototype
in logicalproto.h is unnecessary

2) minor typo in one of the comments

Patch attached.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

streaming-fixes.patchtext/plain; charset=us-asciiDownload
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index c29c088813..343f03129f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -77,7 +77,7 @@ static void send_relation_and_attrs(Relation relation, TransactionId xid,
  * and with streamed transactions the commit order may be different from
  * the order the transactions are sent in. Also, the (sub) transactions
  * might get aborted so we need to send the schema for each (sub) transaction
- * so that we don't loose the schema information on abort. For handling this,
+ * so that we don't lose the schema information on abort. For handling this,
  * we maintain the list of xids (streamed_txns) for those we have already sent
  * the schema.
  *
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 53905ee608..607a728508 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -133,7 +133,6 @@ extern void logicalrep_write_stream_start(StringInfo out, TransactionId xid,
 extern TransactionId logicalrep_read_stream_start(StringInfo in,
 												  bool *first_segment);
 extern void logicalrep_write_stream_stop(StringInfo out);
-extern TransactionId logicalrep_read_stream_stop(StringInfo in);
 extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 										   XLogRecPtr commit_lsn);
 extern TransactionId logicalrep_read_stream_commit(StringInfo out,
#524Dilip Kumar
dilipbalaut@gmail.com
In reply to: Tomas Vondra (#523)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Sep 9, 2020 at 2:13 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

Hi,

while looking at the streaming code I noticed two minor issues:

1) logicalrep_read_stream_stop is never defined/called, so the prototype
in logicalproto.h is unnecessary

Yeah, right.

2) minor typo in one of the comments

Patch attached.

Looks good to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#525Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#523)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Sep 9, 2020 at 2:13 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

while looking at the streaming code I noticed two minor issues:

1) logicalrep_read_stream_stop is never defined/called, so the prototype
in logicalproto.h is unnecessary

2) minor typo in one of the comments

Patch attached.

LGTM.

--
With Regards,
Amit Kapila.

#526Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#525)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Sep 9, 2020 at 2:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Sep 9, 2020 at 2:13 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

while looking at the streaming code I noticed two minor issues:

1) logicalrep_read_stream_stop is never defined/called, so the prototype
in logicalproto.h is unnecessary

2) minor typo in one of the comments

Patch attached.

LGTM.

Pushed.

--
With Regards,
Amit Kapila.

#527Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#526)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Amit Kapila <amit.kapila16@gmail.com> writes:

Pushed.

Observe the following reports:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&amp;dt=2020-09-13%2016%3A54%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&amp;dt=2020-09-10%2009%3A08%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&amp;dt=2020-09-05%2020%3A22%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&amp;dt=2020-09-04%2001%3A52%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&amp;dt=2020-09-03%2020%3A54%3A04

These are all on HEAD, and all within the last ten days, and I see
nothing comparable in any branch before that. So it's hard to avoid
the conclusion that somebody broke something about ten days ago.

None of these animals provided gdb backtraces; but we do have a built-in
trace from several, and they all look like pgoutput.so is trying to
list_free() garbage, somewhere inside a relcache invalidation/rebuild
scenario:

TRAP: FailedAssertion("list->length > 0", File: "/home/bf/build/buildfarm-idiacanthus/HEAD/pgsql.build/../pgsql/src/backend/nodes/list.c", Line: 68)
postgres: publisher: walsender bf [local] idle(ExceptionalCondition+0x57)[0x9081f7]
postgres: publisher: walsender bf [local] idle[0x6bcc70]
postgres: publisher: walsender bf [local] idle(list_free+0x11)[0x6bdc01]
/home/bf/build/buildfarm-idiacanthus/HEAD/pgsql.build/tmp_install/home/bf/build/buildfarm-idiacanthus/HEAD/inst/lib/postgresql/pgoutput.so(+0x35d8)[0x7fa4c5a6f5d8]
postgres: publisher: walsender bf [local] idle(LocalExecuteInvalidationMessage+0x15b)[0x8f0cdb]
postgres: publisher: walsender bf [local] idle(ReceiveSharedInvalidMessages+0x4b)[0x7bca0b]
postgres: publisher: walsender bf [local] idle(LockRelationOid+0x56)[0x7c19e6]
postgres: publisher: walsender bf [local] idle(relation_open+0x1c)[0x4a2d0c]
postgres: publisher: walsender bf [local] idle(table_open+0x6)[0x524486]
postgres: publisher: walsender bf [local] idle[0x9017f2]
postgres: publisher: walsender bf [local] idle[0x8fabd4]
postgres: publisher: walsender bf [local] idle[0x8fa58a]
postgres: publisher: walsender bf [local] idle(RelationCacheInvalidateEntry+0xaf)[0x8fbdbf]
postgres: publisher: walsender bf [local] idle(LocalExecuteInvalidationMessage+0xec)[0x8f0c6c]
postgres: publisher: walsender bf [local] idle(ReceiveSharedInvalidMessages+0xcb)[0x7bca8b]
postgres: publisher: walsender bf [local] idle(LockRelationOid+0x56)[0x7c19e6]
postgres: publisher: walsender bf [local] idle(relation_open+0x1c)[0x4a2d0c]
postgres: publisher: walsender bf [local] idle(table_open+0x6)[0x524486]
postgres: publisher: walsender bf [local] idle[0x8ee8b0]

010_truncate.pl itself hasn't changed meaningfully in a good long time.
However, I see that 464824323 added a whole boatload of code to
pgoutput.c, and the timing is right for that commit to be the culprit,
so that's what I'm betting on.

Probably this requires a relcache inval at the wrong time;
although we have recent passes from CLOBBER_CACHE_ALWAYS animals,
so that can't be the whole triggering condition. I wonder whether
it is relevant that all of the complaining animals are JIT-enabled.

regards, tom lane

#528Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#527)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

I wrote:

Probably this requires a relcache inval at the wrong time;
although we have recent passes from CLOBBER_CACHE_ALWAYS animals,
so that can't be the whole triggering condition. I wonder whether
it is relevant that all of the complaining animals are JIT-enabled.

Hmmm ... I take that back. hyrax has indeed passed since this went
in, but *it doesn't run any TAP tests*. So the buildfarm offers no
information about whether the replication tests work under
CLOBBER_CACHE_ALWAYS.

Realizing that, I built an installation that way and tried to run
the subscription tests. Results so far:

* Running 010_truncate.pl by itself passed for me. So there's still
some unexplained factor needed to trigger the buildfarm failures.
(I'm wondering about concurrent autovacuum activity now...)

* Starting over, it appears that 001_rep_changes.pl almost immediately
gets into an infinite loop. It does not complete the third test step,
rather infinitely waiting for progress to be made. The publisher log
shows a repeating loop like

2020-09-13 21:16:05.734 EDT [928529] tap_sub LOG: could not send data to client: Broken pipe
2020-09-13 21:16:05.734 EDT [928529] tap_sub CONTEXT: slot "tap_sub", output plugin "pgoutput", in the commit callback, associated LSN 0/1660628
2020-09-13 21:16:05.843 EDT [928581] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
2020-09-13 21:16:05.861 EDT [928582] tap_sub LOG: statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:16:05.929 EDT [928582] tap_sub LOG: received replication command: IDENTIFY_SYSTEM
2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"')
2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG: starting logical decoding for slot "tap_sub"
2020-09-13 21:16:05.930 EDT [928582] tap_sub DETAIL: Streaming transactions committing after 0/1652820, reading WAL from 0/1651B20.
2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG: logical decoding found consistent point at 0/1651B20
2020-09-13 21:16:05.930 EDT [928582] tap_sub DETAIL: There are no running transactions.
2020-09-13 21:16:21.560 EDT [928600] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
2020-09-13 21:16:37.291 EDT [928610] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
2020-09-13 21:16:52.959 EDT [928627] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
2020-09-13 21:17:06.866 EDT [928636] tap_sub LOG: statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:06.934 EDT [928636] tap_sub LOG: received replication command: IDENTIFY_SYSTEM
2020-09-13 21:17:06.934 EDT [928636] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"')
2020-09-13 21:17:06.934 EDT [928636] tap_sub ERROR: replication slot "tap_sub" is active for PID 928582
2020-09-13 21:17:07.811 EDT [928638] tap_sub LOG: statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:07.880 EDT [928638] tap_sub LOG: received replication command: IDENTIFY_SYSTEM
2020-09-13 21:17:07.881 EDT [928638] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"')
2020-09-13 21:17:07.881 EDT [928638] tap_sub ERROR: replication slot "tap_sub" is active for PID 928582
2020-09-13 21:17:08.618 EDT [928641] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
2020-09-13 21:17:08.753 EDT [928642] tap_sub LOG: statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:08.821 EDT [928642] tap_sub LOG: received replication command: IDENTIFY_SYSTEM
2020-09-13 21:17:08.821 EDT [928642] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"')
2020-09-13 21:17:08.821 EDT [928642] tap_sub ERROR: replication slot "tap_sub" is active for PID 928582
2020-09-13 21:17:09.689 EDT [928645] tap_sub LOG: statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:09.756 EDT [928645] tap_sub LOG: received replication command: IDENTIFY_SYSTEM
2020-09-13 21:17:09.757 EDT [928645] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"')
2020-09-13 21:17:09.757 EDT [928645] tap_sub ERROR: replication slot "tap_sub" is active for PID 928582
2020-09-13 21:17:09.841 EDT [928582] tap_sub LOG: could not send data to client: Broken pipe
2020-09-13 21:17:09.841 EDT [928582] tap_sub CONTEXT: slot "tap_sub", output plugin "pgoutput", in the commit callback, associated LSN 0/1660628

while the subscriber is repeating

2020-09-13 21:15:01.598 EDT [928528] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:16:02.178 EDT [928528] ERROR: terminating logical replication worker due to timeout
2020-09-13 21:16:02.179 EDT [920797] LOG: background worker "logical replication worker" (PID 928528) exited with exit code 1
2020-09-13 21:16:02.606 EDT [928571] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:16:03.117 EDT [928571] ERROR: could not start WAL streaming: ERROR: replication slot "tap_sub" is active for PID 928529
2020-09-13 21:16:03.118 EDT [920797] LOG: background worker "logical replication worker" (PID 928571) exited with exit code 1
2020-09-13 21:16:03.544 EDT [928574] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:16:04.053 EDT [928574] ERROR: could not start WAL streaming: ERROR: replication slot "tap_sub" is active for PID 928529
2020-09-13 21:16:04.054 EDT [920797] LOG: background worker "logical replication worker" (PID 928574) exited with exit code 1
2020-09-13 21:16:04.479 EDT [928576] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:16:04.990 EDT [928576] ERROR: could not start WAL streaming: ERROR: replication slot "tap_sub" is active for PID 928529
2020-09-13 21:16:04.990 EDT [920797] LOG: background worker "logical replication worker" (PID 928576) exited with exit code 1
2020-09-13 21:16:05.415 EDT [928579] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:17:05.994 EDT [928579] ERROR: terminating logical replication worker due to timeout

I'm out of patience to investigate this for tonight, but there is
something extremely broken here; maybe more than one something.

regards, tom lane

#529Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#527)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Sep 14, 2020 at 3:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

Pushed.

Observe the following reports:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&amp;dt=2020-09-13%2016%3A54%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&amp;dt=2020-09-10%2009%3A08%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&amp;dt=2020-09-05%2020%3A22%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&amp;dt=2020-09-04%2001%3A52%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&amp;dt=2020-09-03%2020%3A54%3A04

These are all on HEAD, and all within the last ten days, and I see
nothing comparable in any branch before that. So it's hard to avoid
the conclusion that somebody broke something about ten days ago.

I'll analyze these reports.

--
With Regards,
Amit Kapila.

#530Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#528)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

I wrote:

* Starting over, it appears that 001_rep_changes.pl almost immediately
gets into an infinite loop. It does not complete the third test step,
rather infinitely waiting for progress to be made.

Ah, looking closer, the problem is that wal_receiver_timeout = 60s
is too short when the sender is using CCA. It times out before we
can get through the needed data transmission.

regards, tom lane

#531Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#527)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Sep 14, 2020 at 3:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

Pushed.

Observe the following reports:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&amp;dt=2020-09-13%2016%3A54%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&amp;dt=2020-09-10%2009%3A08%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&amp;dt=2020-09-05%2020%3A22%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&amp;dt=2020-09-04%2001%3A52%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&amp;dt=2020-09-03%2020%3A54%3A04

These are all on HEAD, and all within the last ten days, and I see
nothing comparable in any branch before that. So it's hard to avoid
the conclusion that somebody broke something about ten days ago.

None of these animals provided gdb backtraces; but we do have a built-in
trace from several, and they all look like pgoutput.so is trying to
list_free() garbage, somewhere inside a relcache invalidation/rebuild
scenario:

Yeah, this is right, and here is some initial analysis. It seems to be
failing in below code:
rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..}

This list can have elements only in 'streaming' mode (need to enable
'streaming' with Create Subscription command) whereas none of the
tests in 010_truncate.pl is using 'streaming', so this list should be
empty (NULL). The two different assertion failures shown in BF reports
in list_free code are as below:
Assert(list->length > 0);
Assert(list->length <= list->max_length);

It seems to me that this list is not initialized properly when it is
not used or maybe that is true in some special circumstances because
we initialize it in get_rel_sync_entry(). I am not sure if CCI build
is impacting this in some way.

--
With Regards,
Amit Kapila.

#532Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#531)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Sep 14, 2020 at 8:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Sep 14, 2020 at 3:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

Pushed.

Observe the following reports:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&amp;dt=2020-09-13%2016%3A54%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&amp;dt=2020-09-10%2009%3A08%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&amp;dt=2020-09-05%2020%3A22%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&amp;dt=2020-09-04%2001%3A52%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&amp;dt=2020-09-03%2020%3A54%3A04

These are all on HEAD, and all within the last ten days, and I see
nothing comparable in any branch before that. So it's hard to avoid
the conclusion that somebody broke something about ten days ago.

None of these animals provided gdb backtraces; but we do have a built-in
trace from several, and they all look like pgoutput.so is trying to
list_free() garbage, somewhere inside a relcache invalidation/rebuild
scenario:

Yeah, this is right, and here is some initial analysis. It seems to be
failing in below code:
rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..}

This list can have elements only in 'streaming' mode (need to enable
'streaming' with Create Subscription command) whereas none of the
tests in 010_truncate.pl is using 'streaming', so this list should be
empty (NULL). The two different assertion failures shown in BF reports
in list_free code are as below:
Assert(list->length > 0);
Assert(list->length <= list->max_length);

It seems to me that this list is not initialized properly when it is
not used or maybe that is true in some special circumstances because
we initialize it in get_rel_sync_entry(). I am not sure if CCI build
is impacting this in some way.

Even I have analyzed this but did not find any reason why the
streamed_txns list should be anything other than NULL. The only thing
is we are initializing the entry->streamed_txns to NULL and the list
free is checking "if (list == NIL)" then return. However IMHO, that
should not be an issue becase NIL is defined as (List*) NULL. I am
doing further testing and investigation.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#533Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#532)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Sep 14, 2020 at 1:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Sep 14, 2020 at 8:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, this is right, and here is some initial analysis. It seems to be
failing in below code:
rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..}

This list can have elements only in 'streaming' mode (need to enable
'streaming' with Create Subscription command) whereas none of the
tests in 010_truncate.pl is using 'streaming', so this list should be
empty (NULL). The two different assertion failures shown in BF reports
in list_free code are as below:
Assert(list->length > 0);
Assert(list->length <= list->max_length);

It seems to me that this list is not initialized properly when it is
not used or maybe that is true in some special circumstances because
we initialize it in get_rel_sync_entry(). I am not sure if CCI build
is impacting this in some way.

Even I have analyzed this but did not find any reason why the
streamed_txns list should be anything other than NULL. The only thing
is we are initializing the entry->streamed_txns to NULL and the list
free is checking "if (list == NIL)" then return. However IMHO, that
should not be an issue becase NIL is defined as (List*) NULL.

Yeah, that is not the issue but it is better to initialize it with NIL
for the sake of consistency. The basic issue here was we were trying
to open/lock the relation(s) before initializing this list. Now, when
we process the invalidations during open relation, we try to access
this list in rel_sync_cache_relation_cb and that leads to assertion
failure. I have reproduced the exact scenario of 010_truncate.pl via
debugger. Basically, the backend on publisher has sent the
invalidation after truncating the relation 'tab1' and while processing
the truncate message if WALSender receives that message exactly after
creating the RelSyncEntry for 'tab1', the Assertion shown in BF can be
reproduced.

The attached patch will fix the issue. What do you think?

--
With Regards,
Amit Kapila.

Attachments:

v1-0001-Fix-initialization-of-RelationSyncEntry-for-strea.patchapplication/octet-stream; name=v1-0001-Fix-initialization-of-RelationSyncEntry-for-strea.patchDownload
From 86398ad4cc09d6dba79d43650a7d0ba0cdfdc069 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Mon, 14 Sep 2020 16:11:02 +0530
Subject: [PATCH v1] Fix initialization of RelationSyncEntry for streaming
 transactions.

In commit 464824323e, for each RelationSyncEntry we maintained the list
of xids (streamed_txns) for which we have already sent the schema. This
helps us to track when to send the schema to the downstream node for
replication of streaming transactions. Before this list got initialized,
we were processing invalidation messages which access this list and led
to an assertion failure.

In passing initialize the list of xids with NIL instead of NULL which is
our usual coding practise.
---
 src/backend/replication/pgoutput/pgoutput.c | 24 ++++++++++++++-------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index c29c088813..c4d8c32624 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -956,10 +956,24 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	/* Not found means schema wasn't sent */
 	if (!found || !entry->replicate_valid)
 	{
-		List	   *pubids = GetRelationPublications(relid);
+		List	   *pubids;
 		ListCell   *lc;
 		Oid			publish_as_relid = relid;
 
+		/*
+		 * Initialize schema sent information before trying to open/lock any
+		 * relation. We want to avoid processing invalidation messages because
+		 * that can try to access this information. See
+		 * rel_sync_cache_relation_cb.
+		 */
+		if (!found)
+		{
+			entry->schema_sent = false;
+			entry->streamed_txns = NIL;
+		}
+
+		pubids = GetRelationPublications(relid);
+
 		/* Reload publications if needed before use. */
 		if (!publications_valid)
 		{
@@ -1054,12 +1068,6 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 		entry->replicate_valid = true;
 	}
 
-	if (!found)
-	{
-		entry->schema_sent = false;
-		entry->streamed_txns = NULL;
-	}
-
 	return entry;
 }
 
@@ -1145,7 +1153,7 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	{
 		entry->schema_sent = false;
 		list_free(entry->streamed_txns);
-		entry->streamed_txns = NULL;
+		entry->streamed_txns = NIL;
 	}
 }
 
-- 
2.28.0.windows.1

#534Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#533)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Sep 14, 2020 at 4:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Sep 14, 2020 at 1:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Sep 14, 2020 at 8:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, this is right, and here is some initial analysis. It seems to be
failing in below code:
rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..}

This list can have elements only in 'streaming' mode (need to enable
'streaming' with Create Subscription command) whereas none of the
tests in 010_truncate.pl is using 'streaming', so this list should be
empty (NULL). The two different assertion failures shown in BF reports
in list_free code are as below:
Assert(list->length > 0);
Assert(list->length <= list->max_length);

It seems to me that this list is not initialized properly when it is
not used or maybe that is true in some special circumstances because
we initialize it in get_rel_sync_entry(). I am not sure if CCI build
is impacting this in some way.

Even I have analyzed this but did not find any reason why the
streamed_txns list should be anything other than NULL. The only thing
is we are initializing the entry->streamed_txns to NULL and the list
free is checking "if (list == NIL)" then return. However IMHO, that
should not be an issue becase NIL is defined as (List*) NULL.

Yeah, that is not the issue but it is better to initialize it with NIL
for the sake of consistency. The basic issue here was we were trying
to open/lock the relation(s) before initializing this list. Now, when
we process the invalidations during open relation, we try to access
this list in rel_sync_cache_relation_cb and that leads to assertion
failure. I have reproduced the exact scenario of 010_truncate.pl via
debugger. Basically, the backend on publisher has sent the
invalidation after truncating the relation 'tab1' and while processing
the truncate message if WALSender receives that message exactly after
creating the RelSyncEntry for 'tab1', the Assertion shown in BF can be
reproduced.

Yeah, this is an issue and I am also able to reproduce this manually
using gdb. Basically, I have inserted some data in publication table
and after that, I stopped in get_rel_sync_entry after creating the
reentry and before calling GetRelationPublications. Meanwhile, I have
truncated this table and then it hit the same issue you pointed here.

The attached patch will fix the issue. What do you think?

The patch looks good to me and fixing the reported issue.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#535Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#533)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Amit Kapila <amit.kapila16@gmail.com> writes:

The attached patch will fix the issue. What do you think?

I think it'd be cleaner to separate the initialization of a new entry from
validation altogether, along the lines of

/* Find cached function info, creating if not found */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
entry = (RelationSyncEntry *) hash_search(RelationSyncCache,
(void *) &relid,
HASH_ENTER, &found);
MemoryContextSwitchTo(oldctx);
Assert(entry != NULL);

if (!found)
{
/* immediately make a new entry valid enough to satisfy callbacks */
entry->schema_sent = false;
entry->streamed_txns = NIL;
entry->replicate_valid = false;
/* are there any other fields we should clear here for safety??? */
}

/* Fill it in if not valid */
if (!entry->replicate_valid)
{
List *pubids = GetRelationPublications(relid);
...

BTW, unless someone has changed the behavior of dynahash when I
wasn't looking, those MemoryContextSwitchTos shown above are useless.
Also, why does the comment refer to a "function" entry?

regards, tom lane

#536Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#535)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Sep 14, 2020 at 9:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

The attached patch will fix the issue. What do you think?

I think it'd be cleaner to separate the initialization of a new entry from
validation altogether, along the lines of

/* Find cached function info, creating if not found */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
entry = (RelationSyncEntry *) hash_search(RelationSyncCache,
(void *) &relid,
HASH_ENTER, &found);
MemoryContextSwitchTo(oldctx);
Assert(entry != NULL);

if (!found)
{
/* immediately make a new entry valid enough to satisfy callbacks */
entry->schema_sent = false;
entry->streamed_txns = NIL;
entry->replicate_valid = false;
/* are there any other fields we should clear here for safety??? */
}

If we want to separate validation then we need to initialize other
fields like 'pubactions' and 'publish_as_relid' as well. I think it
will be better to arrange it the way you are suggesting. So, I will
change it along with other fields that required initialization.

/* Fill it in if not valid */
if (!entry->replicate_valid)
{
List *pubids = GetRelationPublications(relid);
...

BTW, unless someone has changed the behavior of dynahash when I
wasn't looking, those MemoryContextSwitchTos shown above are useless.

As far as I can see they are useless in this case but I think they
might be required in case the user provides its own allocator function
(using HASH_ALLOC). So, we can probably remove those from here?

Also, why does the comment refer to a "function" entry?

It should be "relation" instead. I'll take care of changing this as well.

--
With Regards,
Amit Kapila.

#537Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#536)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Amit Kapila <amit.kapila16@gmail.com> writes:

On Mon, Sep 14, 2020 at 9:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, unless someone has changed the behavior of dynahash when I
wasn't looking, those MemoryContextSwitchTos shown above are useless.

As far as I can see they are useless in this case but I think they
might be required in case the user provides its own allocator function
(using HASH_ALLOC). So, we can probably remove those from here?

You could imagine writing a HASH_ALLOC allocator whose behavior
varies depending on CurrentMemoryContext, but it seems like a
pretty foolish/fragile way to do it. In any case I can think of,
the hash table lives in one specific context and you really
really do not want parts of it spread across other contexts.
dynahash.c is not going to look kindly on pieces of what it
is managing disappearing from under it.

(To be clear, objects that the hash entries contain pointers to
are a different question. But the hash entries themselves have
to have exactly the same lifespan as the hash table.)

regards, tom lane

#538Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#537)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Sep 15, 2020 at 8:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

On Mon, Sep 14, 2020 at 9:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, unless someone has changed the behavior of dynahash when I
wasn't looking, those MemoryContextSwitchTos shown above are useless.

As far as I can see they are useless in this case but I think they
might be required in case the user provides its own allocator function
(using HASH_ALLOC). So, we can probably remove those from here?

You could imagine writing a HASH_ALLOC allocator whose behavior
varies depending on CurrentMemoryContext, but it seems like a
pretty foolish/fragile way to do it. In any case I can think of,
the hash table lives in one specific context and you really
really do not want parts of it spread across other contexts.
dynahash.c is not going to look kindly on pieces of what it
is managing disappearing from under it.

I agree that doesn't make sense. I have fixed all the comments
discussed in the attached patch.

--
With Regards,
Amit Kapila.

Attachments:

v2-0001-Fix-initialization-of-RelationSyncEntry-for-strea.patchapplication/octet-stream; name=v2-0001-Fix-initialization-of-RelationSyncEntry-for-strea.patchDownload
From 5bc6936e96e71cd2e971e889f21b448b0e1f46a2 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Mon, 14 Sep 2020 16:11:02 +0530
Subject: [PATCH v2] Fix initialization of RelationSyncEntry for streaming
 transactions.

In commit 464824323e, for each RelationSyncEntry we maintained the list
of xids (streamed_txns) for which we have already sent the schema. This
helps us to track when to send the schema to the downstream node for
replication of streaming transactions. Before this list got initialized,
we were processing invalidation messages which access this list and led
to an assertion failure.

In passing, clean up the nearby code:

* Initialize the list of xids with NIL instead of NULL which is our usual
coding practice.
* Remove the MemoryContext switch for creating a RelationSyncEntry in dynahash.

Diagnosed-by: Amit Kapila and Tom Lane
Author: Amit Kapila
Reviewed-by: Tom Lane and Dilip Kumar
Discussion: https://postgr.es/m/904373.1600033123@sss.pgh.pa.us
---
 src/backend/replication/pgoutput/pgoutput.c | 29 +++++++++++----------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index c29c088813..e5922f8e30 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -945,16 +945,26 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 
 	Assert(RelationSyncCache != NULL);
 
-	/* Find cached function info, creating if not found */
-	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+	/* Find cached relation info, creating if not found */
 	entry = (RelationSyncEntry *) hash_search(RelationSyncCache,
 											  (void *) &relid,
 											  HASH_ENTER, &found);
-	MemoryContextSwitchTo(oldctx);
 	Assert(entry != NULL);
 
 	/* Not found means schema wasn't sent */
-	if (!found || !entry->replicate_valid)
+	if (!found)
+	{
+		/* immediately make a new entry valid enough to satisfy callbacks */
+		entry->schema_sent = false;
+		entry->streamed_txns = NIL;
+		entry->replicate_valid = false;
+		entry->pubactions.pubinsert = entry->pubactions.pubupdate =
+			entry->pubactions.pubdelete = entry->pubactions.pubtruncate = false;
+		entry->publish_as_relid = InvalidOid;
+	}
+
+	/* Validate the entry */
+	if (!entry->replicate_valid)
 	{
 		List	   *pubids = GetRelationPublications(relid);
 		ListCell   *lc;
@@ -977,9 +987,6 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 		 * relcache considers all publications given relation is in, but here
 		 * we only need to consider ones that the subscriber requested.
 		 */
-		entry->pubactions.pubinsert = entry->pubactions.pubupdate =
-			entry->pubactions.pubdelete = entry->pubactions.pubtruncate = false;
-
 		foreach(lc, data->publications)
 		{
 			Publication *pub = lfirst(lc);
@@ -1054,12 +1061,6 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 		entry->replicate_valid = true;
 	}
 
-	if (!found)
-	{
-		entry->schema_sent = false;
-		entry->streamed_txns = NULL;
-	}
-
 	return entry;
 }
 
@@ -1145,7 +1146,7 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
 	{
 		entry->schema_sent = false;
 		list_free(entry->streamed_txns);
-		entry->streamed_txns = NULL;
+		entry->streamed_txns = NIL;
 	}
 }
 
-- 
2.28.0.windows.1

#539Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#538)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Sep 15, 2020 at 10:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Sep 15, 2020 at 8:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

As far as I can see they are useless in this case but I think they
might be required in case the user provides its own allocator function
(using HASH_ALLOC). So, we can probably remove those from here?

You could imagine writing a HASH_ALLOC allocator whose behavior
varies depending on CurrentMemoryContext, but it seems like a
pretty foolish/fragile way to do it. In any case I can think of,
the hash table lives in one specific context and you really
really do not want parts of it spread across other contexts.
dynahash.c is not going to look kindly on pieces of what it
is managing disappearing from under it.

I agree that doesn't make sense. I have fixed all the comments
discussed in the attached patch.

Pushed.

--
With Regards,
Amit Kapila.

#540Noah Misch
noah@leadboat.com
In reply to: Amit Kapila (#520)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:

Thanks, I have pushed the last patch. Let's wait for a day or so to
see the buildfarm reports

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&amp;dt=2020-09-08%2006%3A24%3A14
failed the new 015_stream.pl test with the subscriber looping like this:

2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory
2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker "logical replication worker" (PID 13959252) exited with exit code 1
2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory
...

What happened there?

#541Amit Kapila
amit.kapila16@gmail.com
In reply to: Noah Misch (#540)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote:

On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:

Thanks, I have pushed the last patch. Let's wait for a day or so to
see the buildfarm reports

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&amp;dt=2020-09-08%2006%3A24%3A14
failed the new 015_stream.pl test with the subscriber looping like this:

I will look into this.

--
With Regards,
Amit Kapila.

#542Amit Kapila
amit.kapila16@gmail.com
In reply to: Noah Misch (#540)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote:

On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:

Thanks, I have pushed the last patch. Let's wait for a day or so to
see the buildfarm reports

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&amp;dt=2020-09-08%2006%3A24%3A14
failed the new 015_stream.pl test with the subscriber looping like this:

2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory
2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker "logical replication worker" (PID 13959252) exited with exit code 1
2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory
...

What happened there?

What is going on here is that the expected streaming file is missing.
Normally, the first time we send a stream of changes (some percentage
of transaction changes) we create the streaming file, and then in
respective streams we just keep on writing in that file the changes we
receive from the publisher, and on commit, we read that file and apply
all the changes.

The above kind of error can happen due to the following reasons: (a)
the first time we sent the stream and created the file and that got
removed before the second stream reached the subscriber. (b) from the
publisher-side, we never sent the indication that it is the first
stream and the subscriber directly tries to open the file thinking it
is already there.

Now, the publisher and subscriber log doesn't directly indicate any of
the above problems but I have some observations.

The subscriber log indicates that before the apply worker exits due to
an error the new apply worker gets started. We delete the
streaming-related temporary files on proc_exit, so one possibility
could have been that the new apply worker has created the streaming
file which the old apply worker has removed but that is not possible
because we always create these temp-files by having procid in the
path.

The other thing I observed in the code is that we can mark the
transaction as streamed (via ReorderBufferTruncateTxn) if we try to
stream a transaction that has no changes the first time we try to
stream the transaction. This would lead to symptom (b) because the
second-time when there are more changes we would stream the changes as
it is not the first time. However, this shouldn't happen because we
never pick-up a transaction to stream which has no changes. I can try
to fix the code here such that we don't mark the transaction as
streamed unless we have streamed at least one change but I don't see
how it is related to this particular test failure.

I am not sure why this failure is not repeated since it occurred a few
months back, it's probably a timing issue. I have few timing issues in
the last month or so related to this feature but I am not able to come
up with a theory if any of those would have fixed this problem.

--
With Regards,
Amit Kapila.

#543Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#542)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote:

On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:

Thanks, I have pushed the last patch. Let's wait for a day or so to
see the buildfarm reports

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&amp;dt=2020-09-08%2006%3A24%3A14
failed the new 015_stream.pl test with the subscriber looping like this:

2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory
2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker "logical replication worker" (PID 13959252) exited with exit code 1
2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory
...

What happened there?

What is going on here is that the expected streaming file is missing.
Normally, the first time we send a stream of changes (some percentage
of transaction changes) we create the streaming file, and then in
respective streams we just keep on writing in that file the changes we
receive from the publisher, and on commit, we read that file and apply
all the changes.

The above kind of error can happen due to the following reasons: (a)
the first time we sent the stream and created the file and that got
removed before the second stream reached the subscriber. (b) from the
publisher-side, we never sent the indication that it is the first
stream and the subscriber directly tries to open the file thinking it
is already there.

Now, the publisher and subscriber log doesn't directly indicate any of
the above problems but I have some observations.

The subscriber log indicates that before the apply worker exits due to
an error the new apply worker gets started. We delete the
streaming-related temporary files on proc_exit, so one possibility
could have been that the new apply worker has created the streaming
file which the old apply worker has removed but that is not possible
because we always create these temp-files by having procid in the
path.

Yeah, and I have tried to test on this line, basically, after the
streaming has started I have set the binary=on. Now using gdb I have
made the worker wait before it deletes the temp file and meanwhile the
new worker started and it worked properly as expected.

The other thing I observed in the code is that we can mark the
transaction as streamed (via ReorderBufferTruncateTxn) if we try to
stream a transaction that has no changes the first time we try to
stream the transaction. This would lead to symptom (b) because the
second-time when there are more changes we would stream the changes as
it is not the first time. However, this shouldn't happen because we
never pick-up a transaction to stream which has no changes. I can try
to fix the code here such that we don't mark the transaction as
streamed unless we have streamed at least one change but I don't see
how it is related to this particular test failure.

Yeah, this can be improved but as you mentioned that we never select
an empty transaction for streaming so this case should not occur. I
will perform some testing/review around this and report.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#544Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#543)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Dec 1, 2020 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

What is going on here is that the expected streaming file is missing.
Normally, the first time we send a stream of changes (some percentage
of transaction changes) we create the streaming file, and then in
respective streams we just keep on writing in that file the changes we
receive from the publisher, and on commit, we read that file and apply
all the changes.

The above kind of error can happen due to the following reasons: (a)
the first time we sent the stream and created the file and that got
removed before the second stream reached the subscriber. (b) from the
publisher-side, we never sent the indication that it is the first
stream and the subscriber directly tries to open the file thinking it
is already there.

Now, the publisher and subscriber log doesn't directly indicate any of
the above problems but I have some observations.

The subscriber log indicates that before the apply worker exits due to
an error the new apply worker gets started. We delete the
streaming-related temporary files on proc_exit, so one possibility
could have been that the new apply worker has created the streaming
file which the old apply worker has removed but that is not possible
because we always create these temp-files by having procid in the
path.

Yeah, and I have tried to test on this line, basically, after the
streaming has started I have set the binary=on. Now using gdb I have
made the worker wait before it deletes the temp file and meanwhile the
new worker started and it worked properly as expected.

The other thing I observed in the code is that we can mark the
transaction as streamed (via ReorderBufferTruncateTxn) if we try to
stream a transaction that has no changes the first time we try to
stream the transaction. This would lead to symptom (b) because the
second-time when there are more changes we would stream the changes as
it is not the first time. However, this shouldn't happen because we
never pick-up a transaction to stream which has no changes. I can try
to fix the code here such that we don't mark the transaction as
streamed unless we have streamed at least one change but I don't see
how it is related to this particular test failure.

Yeah, this can be improved but as you mentioned that we never select
an empty transaction for streaming so this case should not occur. I
will perform some testing/review around this and report.

On further thinking about this point, I think the message seen on
subscriber [1]ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory won't occur if missed the first stream. This is because
we always check the value of fileset from the stream hash table
(xidhash) and it won't be there if we directly send the second stream
and that would have lead to a different kind of problem (probably
crash). This symptom seems to be due to the reason (a) mentioned above
unless we are missing something else. Now, I am not sure how the file
can be removed without the corresponding entry in hash table (xidhash)
is still present. The only reasons that come to mind are that some
other process cleaned pgsql_tmp directory thinking these temporary
file are not required or one manually removes it, none of those seems
plausible reasons.

[1]: ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory
BufFile "16393-510.changes": No such file or directory

--
With Regards,
Amit Kapila.

#545Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#543)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Tue, Dec 1, 2020 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote:

On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:

Thanks, I have pushed the last patch. Let's wait for a day or so to
see the buildfarm reports

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&amp;dt=2020-09-08%2006%3A24%3A14
failed the new 015_stream.pl test with the subscriber looping like this:

2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory
2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker "logical replication worker" (PID 13959252) exited with exit code 1
2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory
...

What happened there?

What is going on here is that the expected streaming file is missing.
Normally, the first time we send a stream of changes (some percentage
of transaction changes) we create the streaming file, and then in
respective streams we just keep on writing in that file the changes we
receive from the publisher, and on commit, we read that file and apply
all the changes.

The above kind of error can happen due to the following reasons: (a)
the first time we sent the stream and created the file and that got
removed before the second stream reached the subscriber. (b) from the
publisher-side, we never sent the indication that it is the first
stream and the subscriber directly tries to open the file thinking it
is already there.

Now, the publisher and subscriber log doesn't directly indicate any of
the above problems but I have some observations.

The subscriber log indicates that before the apply worker exits due to
an error the new apply worker gets started. We delete the
streaming-related temporary files on proc_exit, so one possibility
could have been that the new apply worker has created the streaming
file which the old apply worker has removed but that is not possible
because we always create these temp-files by having procid in the
path.

Yeah, and I have tried to test on this line, basically, after the
streaming has started I have set the binary=on. Now using gdb I have
made the worker wait before it deletes the temp file and meanwhile the
new worker started and it worked properly as expected.

The other thing I observed in the code is that we can mark the
transaction as streamed (via ReorderBufferTruncateTxn) if we try to
stream a transaction that has no changes the first time we try to
stream the transaction. This would lead to symptom (b) because the
second-time when there are more changes we would stream the changes as
it is not the first time. However, this shouldn't happen because we
never pick-up a transaction to stream which has no changes. I can try
to fix the code here such that we don't mark the transaction as
streamed unless we have streamed at least one change but I don't see
how it is related to this particular test failure.

Yeah, this can be improved but as you mentioned that we never select
an empty transaction for streaming so this case should not occur. I
will perform some testing/review around this and report.

I have executed "make check" in the loop with only this file. I have
repeated it 5000 times but no failure, I am wondering shall we try to
execute in the same machine in a loop where it failed once?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#546Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#545)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Dec 2, 2020 at 1:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Dec 1, 2020 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote:

On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:

Thanks, I have pushed the last patch. Let's wait for a day or so to
see the buildfarm reports

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&amp;dt=2020-09-08%2006%3A24%3A14
failed the new 015_stream.pl test with the subscriber looping like this:

2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory
2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker "logical replication worker" (PID 13959252) exited with exit code 1
2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory
...

What happened there?

What is going on here is that the expected streaming file is missing.
Normally, the first time we send a stream of changes (some percentage
of transaction changes) we create the streaming file, and then in
respective streams we just keep on writing in that file the changes we
receive from the publisher, and on commit, we read that file and apply
all the changes.

The above kind of error can happen due to the following reasons: (a)
the first time we sent the stream and created the file and that got
removed before the second stream reached the subscriber. (b) from the
publisher-side, we never sent the indication that it is the first
stream and the subscriber directly tries to open the file thinking it
is already there.

I have executed "make check" in the loop with only this file. I have
repeated it 5000 times but no failure, I am wondering shall we try to
execute in the same machine in a loop where it failed once?

Yes, that might help. Noah, would it be possible for you to try that
out, and if it failed then probably get the stack trace of subscriber?
If we are able to reproduce it then we can add elogs in functions
SharedFileSetInit, BufFileCreateShared, BufFileOpenShared, and
SharedFileSetDeleteAll to print the paths to see if we are sometimes
unintentionally removing some files. I have checked the code and there
doesn't appear to be any such problems but I might be missing
something.

--
With Regards,
Amit Kapila.

#547Noah Misch
noah@leadboat.com
In reply to: Amit Kapila (#546)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Dec 02, 2020 at 01:50:25PM +0530, Amit Kapila wrote:

On Wed, Dec 2, 2020 at 1:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote:

On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:

Thanks, I have pushed the last patch. Let's wait for a day or so to
see the buildfarm reports

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&amp;dt=2020-09-08%2006%3A24%3A14
failed the new 015_stream.pl test with the subscriber looping like this:

2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory
2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication apply worker for subscription "tap_sub" has started
2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker "logical replication worker" (PID 13959252) exited with exit code 1
2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory
...

The above kind of error can happen due to the following reasons: (a)
the first time we sent the stream and created the file and that got
removed before the second stream reached the subscriber. (b) from the
publisher-side, we never sent the indication that it is the first
stream and the subscriber directly tries to open the file thinking it
is already there.

Further testing showed it was a file location problem, not a deletion problem.
The worker tried to open
base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset/16393-510.changes.0, but these
were the files actually existing:

[nm@power-aix 0:2 2020-12-08T13:56:35 64gcc 0]$ ls -la $(find src/test/subscription/tmp_check -name '*sharedfileset*')
src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.0.sharedfileset:
total 408
drwx------ 2 nm usr 256 Dec 08 03:20 .
drwx------ 4 nm usr 256 Dec 08 03:20 ..
-rw------- 1 nm usr 207806 Dec 08 03:20 16393-510.changes.0

src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset:
total 0
drwx------ 2 nm usr 256 Dec 08 03:20 .
drwx------ 4 nm usr 256 Dec 08 03:20 ..
-rw------- 1 nm usr 0 Dec 08 03:20 16393-511.changes.0

I have executed "make check" in the loop with only this file. I have
repeated it 5000 times but no failure, I am wondering shall we try to
execute in the same machine in a loop where it failed once?

Yes, that might help. Noah, would it be possible for you to try that

The problem is xidhash using strcmp() to compare keys; it needs memcmp(). For
this to matter, xidhash must contain more than one element. Existing tests
rarely exercise the multi-element scenario. Under heavy load, on this system,
the test publisher can have two active transactions at once, in which case it
does exercise multi-element xidhash. (The publisher is sensitive to timing,
but the subscriber is not; once WAL contains interleaved records of two XIDs,
the subscriber fails every time.) This would be much harder to reproduce on a
little-endian system, where strcmp(&xid, &xid_plus_one)!=0. On big-endian,
every small XID has zero in the first octet; they all look like empty strings.

The attached patch has the one-line fix and some test suite changes that make
this reproduce frequently on any big-endian system. I'm currently planning to
drop the test suite changes from the commit, but I could keep them if folks
like them. (They'd need more comments and timeout handling.)

Attachments:

xidhash-blobs-v1.patchtext/plain; charset=us-asciiDownload
Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    Use HASH_BLOBS for xidhash.
    
    This caused BufFile errors on buildfarm member sungazer, and SIGSEGV was
    possible.  Conditions for reaching those symptoms were more frequent on
    big-endian systems.
    
    Reviewed by FIXME.
    
    Discussion: https://postgr.es/m/20201129214441.GA691200@rfd.leadboat.com

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index c37aafe..fce1dee 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -804,7 +804,7 @@ apply_handle_stream_start(StringInfo s)
 		hash_ctl.entrysize = sizeof(StreamXidHash);
 		hash_ctl.hcxt = ApplyContext;
 		xidhash = hash_create("StreamXidHash", 1024, &hash_ctl,
-							  HASH_ELEM | HASH_CONTEXT);
+							  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 	}
 
 	/* open the spool file for this transaction */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1488bff..40610f1 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -1626,6 +1626,42 @@ sub interactive_psql
 	return $harness;
 }
 
+# return IPC::Run harness object for non-interactive psql
+# FIXME pick a better name, and add POD docs
+sub psql_printable
+{
+	my ($self, $dbname, $stdin, $stdout, $timer, %params) = @_;
+
+	my $replication       = $params{replication};
+
+	my @psql_params       = (
+		'psql',
+		'-XAtq',
+		'-d',
+		$self->connstr($dbname)
+		  . (defined $replication ? " replication=$replication" : ""),
+		'-f',
+		'-');
+
+	$params{on_error_stop} = 1 unless defined $params{on_error_stop};
+
+	push @psql_params, '-v', 'ON_ERROR_STOP=1' if $params{on_error_stop};
+	push @psql_params, @{ $params{extra_params} }
+	  if defined $params{extra_params};
+
+	# Ensure there is no data waiting to be sent:
+	$$stdin = "" if ref($stdin);
+	# IPC::Run would otherwise append to existing contents:
+	$$stdout = "" if ref($stdout);
+
+	my $harness = IPC::Run::start \@psql_params,
+	  '<', $stdin, '>', $stdout, $timer;
+
+	die "psql startup timed out" if $timer->is_expired;
+
+	return $harness;
+}
+
 =pod
 
 =item $node->poll_query_until($dbname, $query [, $expected ])
diff --git a/src/test/subscription/t/015_stream.pl b/src/test/subscription/t/015_stream.pl
index fffe001..9ebe166 100644
--- a/src/test/subscription/t/015_stream.pl
+++ b/src/test/subscription/t/015_stream.pl
@@ -47,14 +47,35 @@ my $result =
 is($result, qq(2|2|2), 'check initial data was copied to subscriber');
 
 # Insert, update and delete enough rows to exceed the 64kB limit.
-$node_publisher->safe_psql('postgres', q{
+my $in  = '';
+my $out = '';
+
+my $timer = IPC::Run::timer(180);
+
+my $h = $node_publisher->psql_printable('postgres', \$in, \$out, $timer);
+
+$in .= q{
 BEGIN;
 INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
 UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
 DELETE FROM test_tab WHERE mod(a,3) = 0;
+};
+$h->pump;
+
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(5001, 9999) s(i);
+DELETE FROM test_tab WHERE a > 5000;
 COMMIT;
 });
 
+$in .= q{
+COMMIT;
+\q
+};
+$h->pump;
+$h->finish;
+
 $node_publisher->wait_for_catchup($appname);
 
 $result =
#548Amit Kapila
amit.kapila16@gmail.com
In reply to: Noah Misch (#547)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Dec 9, 2020 at 2:56 PM Noah Misch <noah@leadboat.com> wrote:

Further testing showed it was a file location problem, not a deletion problem.
The worker tried to open
base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset/16393-510.changes.0, but these
were the files actually existing:

[nm@power-aix 0:2 2020-12-08T13:56:35 64gcc 0]$ ls -la $(find src/test/subscription/tmp_check -name '*sharedfileset*')
src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.0.sharedfileset:
total 408
drwx------ 2 nm usr 256 Dec 08 03:20 .
drwx------ 4 nm usr 256 Dec 08 03:20 ..
-rw------- 1 nm usr 207806 Dec 08 03:20 16393-510.changes.0

src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset:
total 0
drwx------ 2 nm usr 256 Dec 08 03:20 .
drwx------ 4 nm usr 256 Dec 08 03:20 ..
-rw------- 1 nm usr 0 Dec 08 03:20 16393-511.changes.0

I have executed "make check" in the loop with only this file. I have
repeated it 5000 times but no failure, I am wondering shall we try to
execute in the same machine in a loop where it failed once?

Yes, that might help. Noah, would it be possible for you to try that

The problem is xidhash using strcmp() to compare keys; it needs memcmp(). For
this to matter, xidhash must contain more than one element. Existing tests
rarely exercise the multi-element scenario. Under heavy load, on this system,
the test publisher can have two active transactions at once, in which case it
does exercise multi-element xidhash. (The publisher is sensitive to timing,
but the subscriber is not; once WAL contains interleaved records of two XIDs,
the subscriber fails every time.) This would be much harder to reproduce on a
little-endian system, where strcmp(&xid, &xid_plus_one)!=0. On big-endian,
every small XID has zero in the first octet; they all look like empty strings.

Your analysis is correct.

The attached patch has the one-line fix and some test suite changes that make
this reproduce frequently on any big-endian system. I'm currently planning to
drop the test suite changes from the commit, but I could keep them if folks
like them. (They'd need more comments and timeout handling.)

I think it is better to keep this test which can always test multiple
streams on the subscriber.

Thanks for working on this.

--
With Regards,
Amit Kapila.

#549Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#548)
HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

Amit Kapila <amit.kapila16@gmail.com> writes:

On Wed, Dec 9, 2020 at 2:56 PM Noah Misch <noah@leadboat.com> wrote:

The problem is xidhash using strcmp() to compare keys; it needs memcmp().

Your analysis is correct.

Sorry for not having noticed this thread before. Noah's fix is
clearly correct, and I have no objection to the added test case.
But what jumps out at me here is that this sort of error seems way
too easy to make, and evidently way too hard to detect. What can we
do to make it more obvious if one has incorrectly used or omitted
HASH_BLOBS? Both directions of error might easily escape notice on
little-endian hardware.

I thought of a few ideas, all of which have drawbacks:

1. Invert the sense of the flag, ie HASH_BLOBS becomes the default.
This seems to just move the problem somewhere else, besides which
it'd require touching an awful lot of callers, and would silently
break third-party callers.

2. Don't allow a default: invent a new HASH_STRING flag, and
require that hash_create() calls specify exactly one of HASH_BLOBS,
HASH_STRING, or HASH_FUNCTION. This doesn't completely fix the
hazard of mindless-copy-and-paste, but I think it might make it
a little more obvious. Still requires touching a lot of calls.

3. Add some sort of heuristic restriction on keysize. A keysize
that's only 4 or 8 bytes almost certainly is not a string.
This doesn't give us much traction for larger keysizes, though.

4. Disallow empty string keys, ie something like "Assert(s_len > 0)"
in string_hash(). I think we could get away with that given that
SQL disallows empty identifiers. However, it would only help to
catch one direction of error (omitting HASH_BLOBS), and it would
only help on big-endian hardware, which is getting harder to find.
Still, we could hope that the buildfarm would detect errors.

There might be some more options. Also, some of these ideas
could be applied in combination.

A quick count of grep hits suggest that the large majority of
existing hash_create() calls use HASH_BLOBS, and there might be
only order-of-ten calls that would need to be touched if we
required an explicit HASH_STRING flag. So option #2 is seeming
kind of attractive. Maybe that together with an assertion that
string keys have to exceed 8 or 16 bytes would be enough protection.

Also, this census now suggests to me that the opposite problem
(copy-and-paste HASH_BLOBS when you meant string keys) might be
a real hazard, since so many of the existing prototypes that you
might copy have HASH_BLOBS. I'm not sure if there's much to be
done for this case though. A small saving grace is that it seems
relatively likely that you'd notice a functional problem pretty
quickly with this type of mistake, since lookups would tend to
fail due to trailing garbage after your lookup string.

A different angle we could think about is that the name "HASH_BLOBS"
is kind of un-obvious. Maybe we should deprecate that spelling in
favor of something like "HASH_BINARY".

Thoughts?

regards, tom lane

#550Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#549)
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

On Sun, Dec 13, 2020 at 11:49:31AM -0500, Tom Lane wrote:

But what jumps out at me here is that this sort of error seems way
too easy to make, and evidently way too hard to detect. What can we
do to make it more obvious if one has incorrectly used or omitted
HASH_BLOBS? Both directions of error might easily escape notice on
little-endian hardware.

I thought of a few ideas, all of which have drawbacks:

1. Invert the sense of the flag, ie HASH_BLOBS becomes the default.
This seems to just move the problem somewhere else, besides which
it'd require touching an awful lot of callers, and would silently
break third-party callers.

2. Don't allow a default: invent a new HASH_STRING flag, and
require that hash_create() calls specify exactly one of HASH_BLOBS,
HASH_STRING, or HASH_FUNCTION. This doesn't completely fix the
hazard of mindless-copy-and-paste, but I think it might make it
a little more obvious. Still requires touching a lot of calls.

I like (2), for making the bug harder and for greppability. Probably
pluralize it to HASH_STRINGS, for the parallel with HASH_BLOBS.

3. Add some sort of heuristic restriction on keysize. A keysize
that's only 4 or 8 bytes almost certainly is not a string.
This doesn't give us much traction for larger keysizes, though.

4. Disallow empty string keys, ie something like "Assert(s_len > 0)"
in string_hash(). I think we could get away with that given that
SQL disallows empty identifiers. However, it would only help to
catch one direction of error (omitting HASH_BLOBS), and it would
only help on big-endian hardware, which is getting harder to find.
Still, we could hope that the buildfarm would detect errors.

It's nontrivial to confirm that the empty-string key can't happen for a given
hash table. (In contrast, what (3) asserts on is usually a compile-time
constant.) I would stop short of adding (4), though it could be okay.

A quick count of grep hits suggest that the large majority of
existing hash_create() calls use HASH_BLOBS, and there might be
only order-of-ten calls that would need to be touched if we
required an explicit HASH_STRING flag. So option #2 is seeming
kind of attractive. Maybe that together with an assertion that
string keys have to exceed 8 or 16 bytes would be enough protection.

Agreed. I expect (2) gives most of the benefit. Requiring 8-byte capacity
should be harmless, and most architectures can zero 8 bytes in one
instruction. Requiring more bytes trades specificity for sensitivity.

A different angle we could think about is that the name "HASH_BLOBS"
is kind of un-obvious. Maybe we should deprecate that spelling in
favor of something like "HASH_BINARY".

With (2) in place, I wouldn't worry about renaming HASH_BLOBS. It's hard to
confuse with HASH_STRINGS or HASH_FUNCTION. If anything, HASH_BLOBS conveys
something more specific. HASH_FUNCTION cases see binary data, but that data
has structure that promotes it out of "blob" status.

#551Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Tom Lane (#549)
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

On 2020-12-13 17:49, Tom Lane wrote:

2. Don't allow a default: invent a new HASH_STRING flag, and
require that hash_create() calls specify exactly one of HASH_BLOBS,
HASH_STRING, or HASH_FUNCTION. This doesn't completely fix the
hazard of mindless-copy-and-paste, but I think it might make it
a little more obvious. Still requires touching a lot of calls.

I think this sounds best, and also expand the documentation of these
flags a bit.

--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/

#552Amit Kapila
amit.kapila16@gmail.com
In reply to: Noah Misch (#550)
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

On Mon, Dec 14, 2020 at 1:36 AM Noah Misch <noah@leadboat.com> wrote:

On Sun, Dec 13, 2020 at 11:49:31AM -0500, Tom Lane wrote:

But what jumps out at me here is that this sort of error seems way
too easy to make, and evidently way too hard to detect. What can we
do to make it more obvious if one has incorrectly used or omitted
HASH_BLOBS? Both directions of error might easily escape notice on
little-endian hardware.

I thought of a few ideas, all of which have drawbacks:

1. Invert the sense of the flag, ie HASH_BLOBS becomes the default.
This seems to just move the problem somewhere else, besides which
it'd require touching an awful lot of callers, and would silently
break third-party callers.

2. Don't allow a default: invent a new HASH_STRING flag, and
require that hash_create() calls specify exactly one of HASH_BLOBS,
HASH_STRING, or HASH_FUNCTION. This doesn't completely fix the
hazard of mindless-copy-and-paste, but I think it might make it
a little more obvious. Still requires touching a lot of calls.

I like (2), for making the bug harder and for greppability. Probably
pluralize it to HASH_STRINGS, for the parallel with HASH_BLOBS.

3. Add some sort of heuristic restriction on keysize. A keysize
that's only 4 or 8 bytes almost certainly is not a string.
This doesn't give us much traction for larger keysizes, though.

4. Disallow empty string keys, ie something like "Assert(s_len > 0)"
in string_hash(). I think we could get away with that given that
SQL disallows empty identifiers. However, it would only help to
catch one direction of error (omitting HASH_BLOBS), and it would
only help on big-endian hardware, which is getting harder to find.
Still, we could hope that the buildfarm would detect errors.

It's nontrivial to confirm that the empty-string key can't happen for a given
hash table. (In contrast, what (3) asserts on is usually a compile-time
constant.) I would stop short of adding (4), though it could be okay.

A quick count of grep hits suggest that the large majority of
existing hash_create() calls use HASH_BLOBS, and there might be
only order-of-ten calls that would need to be touched if we
required an explicit HASH_STRING flag. So option #2 is seeming
kind of attractive. Maybe that together with an assertion that
string keys have to exceed 8 or 16 bytes would be enough protection.

Agreed. I expect (2) gives most of the benefit. Requiring 8-byte capacity
should be harmless, and most architectures can zero 8 bytes in one
instruction. Requiring more bytes trades specificity for sensitivity.

+1. I also think in most cases (2) would be sufficient to avoid such
bugs. Adding restriction on string size might annoy some out-of-core
user which is already using small strings. However, adding an 8-byte
restriction on string size would be still okay.

--
With Regards,
Amit Kapila.

#553Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#550)
1 attachment(s)
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

Noah Misch <noah@leadboat.com> writes:

On Sun, Dec 13, 2020 at 11:49:31AM -0500, Tom Lane wrote:

A quick count of grep hits suggest that the large majority of
existing hash_create() calls use HASH_BLOBS, and there might be
only order-of-ten calls that would need to be touched if we
required an explicit HASH_STRING flag. So option #2 is seeming
kind of attractive. Maybe that together with an assertion that
string keys have to exceed 8 or 16 bytes would be enough protection.

Agreed. I expect (2) gives most of the benefit. Requiring 8-byte capacity
should be harmless, and most architectures can zero 8 bytes in one
instruction. Requiring more bytes trades specificity for sensitivity.

Attached is a proposed patch that requires HASH_STRINGS to be stated
explicitly (in the event, there are 13 callers needing that) and insists
on keysize > 8 for string keys. In examining the now-easily-visible uses
of string keys, almost all of them are using NAMEDATALEN-sized keys, or
in a few places larger values. Only two are smaller:

1. ShmemIndex uses SHMEM_INDEX_KEYSIZE, which is only set to 48.

2. ResetUnloggedRelationsInDbspaceDir is using OIDCHARS + 1, because
it stores relfilenode OIDs as strings. That seems pretty damfool
to me, so I'm inclined to change it to store binary OIDs instead;
those'd be a third the size (or probably a quarter the size after
alignment padding) and likely faster to hash or compare. But I
didn't do that here, since it's still more than 8. (I did whack
it upside the head to the extent of not storing its temporary
hash table in CacheMemoryContext.)

So it seems to me that insisting on keysize > 8 is fine.

There are a couple of other API oddities that maybe we should think
about while we're here:

* Should we just have a blanket insistence that all callers supply
HASH_ELEM? The default sizes that dynahash.c uses without that are
undocumented and basically useless. We're already asserting that
in the HASH_BLOBS path, which is the majority use-case, and this
patch now asserts it for HASH_STRINGS too.

* The coding convention that the HASHCTL argument struct should be
pre-zeroed seems to have been ignored at a lot of call sites.
I added a memset call to a couple of callers that I was touching
in this patch, but I'm having second thoughts about that. Maybe
we should just rip out all those memsets as pointless, since there's
basically no case where you'd use the memset to fill a field that
you meant to pass as zero. The fact that hash_create() doesn't
read fields it's not told to by a flag means we should not need
the memsets to avoid uninitialized-memory reads.

regards, tom lane

Attachments:

invent-HASH_STRINGS-flag-1.patchtext/x-diff; charset=us-ascii; name=invent-HASH_STRINGS-flag-1.patchDownload
diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 2dc9e44ae6..8b17fb06eb 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -2604,10 +2604,12 @@ createConnHash(void)
 {
 	HASHCTL		ctl;
 
+	memset(&ctl, 0, sizeof(ctl));
 	ctl.keysize = NAMEDATALEN;
 	ctl.entrysize = sizeof(remoteConnHashEnt);
 
-	return hash_create("Remote Con hash", NUMCONN, &ctl, HASH_ELEM);
+	return hash_create("Remote Con hash", NUMCONN, &ctl,
+					   HASH_ELEM | HASH_STRINGS);
 }
 
 static void
diff --git a/contrib/tablefunc/tablefunc.c b/contrib/tablefunc/tablefunc.c
index 85986ec24a..ec7819ca77 100644
--- a/contrib/tablefunc/tablefunc.c
+++ b/contrib/tablefunc/tablefunc.c
@@ -726,7 +726,7 @@ load_categories_hash(char *cats_sql, MemoryContext per_query_ctx)
 	crosstab_hash = hash_create("crosstab hash",
 								INIT_CATS,
 								&ctl,
-								HASH_ELEM | HASH_CONTEXT);
+								HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
 
 	/* Connect to SPI manager */
 	if ((ret = SPI_connect()) < 0)
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 4b18be5b27..5ba7c2eb3c 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -414,7 +414,7 @@ InitQueryHashTable(void)
 	prepared_queries = hash_create("Prepared Queries",
 								   32,
 								   &hash_ctl,
-								   HASH_ELEM);
+								   HASH_ELEM | HASH_STRINGS);
 }
 
 /*
diff --git a/src/backend/nodes/extensible.c b/src/backend/nodes/extensible.c
index ab04459c55..2fe89fd361 100644
--- a/src/backend/nodes/extensible.c
+++ b/src/backend/nodes/extensible.c
@@ -51,7 +51,8 @@ RegisterExtensibleNodeEntry(HTAB **p_htable, const char *htable_label,
 		ctl.keysize = EXTNODENAME_MAX_LEN;
 		ctl.entrysize = sizeof(ExtensibleNodeEntry);
 
-		*p_htable = hash_create(htable_label, 100, &ctl, HASH_ELEM);
+		*p_htable = hash_create(htable_label, 100, &ctl,
+								HASH_ELEM | HASH_STRINGS);
 	}
 
 	if (strlen(extnodename) >= EXTNODENAME_MAX_LEN)
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 0c2094f766..f21ab67ae4 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -175,7 +175,9 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
 		memset(&ctl, 0, sizeof(ctl));
 		ctl.keysize = sizeof(unlogged_relation_entry);
 		ctl.entrysize = sizeof(unlogged_relation_entry);
-		hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM);
+		ctl.hcxt = CurrentMemoryContext;
+		hash = hash_create("unlogged hash", 32, &ctl,
+						   HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
 
 		/* Scan the directory. */
 		dbspace_dir = AllocateDir(dbspacedirname);
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 97716f6aef..0afd87e075 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -292,7 +292,6 @@ void
 InitShmemIndex(void)
 {
 	HASHCTL		info;
-	int			hash_flags;
 
 	/*
 	 * Create the shared memory shmem index.
@@ -302,13 +301,14 @@ InitShmemIndex(void)
 	 * initializing the ShmemIndex itself.  The special "ShmemIndex" hash
 	 * table name will tell ShmemInitStruct to fake it.
 	 */
+	memset(&info, 0, sizeof(info));
 	info.keysize = SHMEM_INDEX_KEYSIZE;
 	info.entrysize = sizeof(ShmemIndexEnt);
-	hash_flags = HASH_ELEM;
 
 	ShmemIndex = ShmemInitHash("ShmemIndex",
 							   SHMEM_INDEX_SIZE, SHMEM_INDEX_SIZE,
-							   &info, hash_flags);
+							   &info,
+							   HASH_ELEM | HASH_STRINGS);
 }
 
 /*
@@ -329,6 +329,10 @@ InitShmemIndex(void)
  * whose maximum size is certain, this should be equal to max_size; that
  * ensures that no run-time out-of-shared-memory failures can occur.
  *
+ * *infoP and hash_flags should specify at least the entry sizes and key
+ * comparison semantics (see hash_create()).  Flag bits specific to
+ * shared-memory hash tables are added here.
+ *
  * Note: before Postgres 9.0, this function returned NULL for some failure
  * cases.  Now, it always throws error instead, so callers need not check
  * for NULL.
diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c
index 12557ce3af..be0a45b55e 100644
--- a/src/backend/utils/adt/jsonfuncs.c
+++ b/src/backend/utils/adt/jsonfuncs.c
@@ -3446,7 +3446,7 @@ get_json_object_as_hash(char *json, int len, const char *funcname)
 	tab = hash_create("json object hashtable",
 					  100,
 					  &ctl,
-					  HASH_ELEM | HASH_CONTEXT);
+					  HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
 
 	state = palloc0(sizeof(JHashState));
 	sem = palloc0(sizeof(JsonSemAction));
@@ -3838,7 +3838,7 @@ populate_recordset_object_start(void *state)
 	_state->json_hash = hash_create("json object hashtable",
 									100,
 									&ctl,
-									HASH_ELEM | HASH_CONTEXT);
+									HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
 }
 
 static void
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index ad582f99a5..87a3154c1a 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -3471,7 +3471,7 @@ set_rtable_names(deparse_namespace *dpns, List *parent_namespaces,
 	names_hash = hash_create("set_rtable_names names",
 							 list_length(dpns->rtable),
 							 &hash_ctl,
-							 HASH_ELEM | HASH_CONTEXT);
+							 HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
 	/* Preload the hash table with names appearing in parent_namespaces */
 	foreach(lc, parent_namespaces)
 	{
diff --git a/src/backend/utils/fmgr/dfmgr.c b/src/backend/utils/fmgr/dfmgr.c
index bd779fdaf7..e83e30defe 100644
--- a/src/backend/utils/fmgr/dfmgr.c
+++ b/src/backend/utils/fmgr/dfmgr.c
@@ -686,7 +686,7 @@ find_rendezvous_variable(const char *varName)
 		rendezvousHash = hash_create("Rendezvous variable hash",
 									 16,
 									 &ctl,
-									 HASH_ELEM);
+									 HASH_ELEM | HASH_STRINGS);
 	}
 
 	/* Find or create the hashtable entry for this varName */
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index d14d875c93..07cae638df 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -30,11 +30,12 @@
  * dynahash.c provides support for these types of lookup keys:
  *
  * 1. Null-terminated C strings (truncated if necessary to fit in keysize),
- * compared as though by strcmp().  This is the default behavior.
+ * compared as though by strcmp().  This is selected by specifying the
+ * HASH_STRINGS flag to hash_create.
  *
  * 2. Arbitrary binary data of size keysize, compared as though by memcmp().
  * (Caller must ensure there are no undefined padding bits in the keys!)
- * This is selected by specifying HASH_BLOBS flag to hash_create.
+ * This is selected by specifying the HASH_BLOBS flag to hash_create.
  *
  * 3. More complex key behavior can be selected by specifying user-supplied
  * hashing, comparison, and/or key-copying functions.  At least a hashing
@@ -47,8 +48,8 @@
  *   locks.
  * - Shared memory hashes are allocated in a fixed size area at startup and
  *   are discoverable by name from other processes.
- * - Because entries don't need to be moved in the case of hash conflicts, has
- *   better performance for large entries
+ * - Because entries don't need to be moved in the case of hash conflicts,
+ *   dynahash has better performance for large entries.
  * - Guarantees stable pointers to entries.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
@@ -316,6 +317,12 @@ string_compare(const char *key1, const char *key2, Size keysize)
  *	*info: additional table parameters, as indicated by flags
  *	flags: bitmask indicating which parameters to take from *info
  *
+ * The flags value must include exactly one of HASH_STRINGS, HASH_BLOBS,
+ * or HASH_FUNCTION, to define the key hashing semantics (C strings,
+ * binary blobs, or custom, respectively).  Callers specifying a custom
+ * hash function will likely also want to use HASH_COMPARE, and perhaps
+ * also HASH_KEYCOPY, to control key comparison and copying.
+ *
  * Note: for a shared-memory hashtable, nelem needs to be a pretty good
  * estimate, since we can't expand the table on the fly.  But an unshared
  * hashtable can be expanded on-the-fly, so it's better for nelem to be
@@ -370,9 +377,13 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
 	 * Select the appropriate hash function (see comments at head of file).
 	 */
 	if (flags & HASH_FUNCTION)
+	{
+		Assert(!(flags & (HASH_BLOBS | HASH_STRINGS)));
 		hashp->hash = info->hash;
+	}
 	else if (flags & HASH_BLOBS)
 	{
+		Assert(!(flags & HASH_STRINGS));
 		/* We can optimize hashing for common key sizes */
 		Assert(flags & HASH_ELEM);
 		if (info->keysize == sizeof(uint32))
@@ -381,17 +392,30 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
 			hashp->hash = tag_hash;
 	}
 	else
-		hashp->hash = string_hash;	/* default hash function */
+	{
+		/*
+		 * string_hash used to be considered the default hash method, and in a
+		 * non-assert build it effectively still is.  But we now consider it
+		 * an assertion error to not say HASH_STRINGS explicitly.  To help
+		 * catch mistaken usage of HASH_STRINGS, we also insist on a
+		 * reasonably long string length: if the keysize is only 4 or 8 bytes,
+		 * it's almost certainly an integer or pointer not a string.
+		 */
+		Assert(flags & HASH_STRINGS);
+		Assert(flags & HASH_ELEM);
+		Assert(info->keysize > 8);
+
+		hashp->hash = string_hash;
+	}
 
 	/*
 	 * If you don't specify a match function, it defaults to string_compare if
-	 * you used string_hash (either explicitly or by default) and to memcmp
-	 * otherwise.
+	 * you used string_hash, and to memcmp otherwise.
 	 *
 	 * Note: explicitly specifying string_hash is deprecated, because this
 	 * might not work for callers in loadable modules on some platforms due to
 	 * referencing a trampoline instead of the string_hash function proper.
-	 * Just let it default, eh?
+	 * Specify HASH_STRINGS instead.
 	 */
 	if (flags & HASH_COMPARE)
 		hashp->match = info->match;
diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index ec6f80ee99..a382c4219b 100644
--- a/src/backend/utils/mmgr/portalmem.c
+++ b/src/backend/utils/mmgr/portalmem.c
@@ -111,6 +111,7 @@ EnablePortalManager(void)
 											 "TopPortalContext",
 											 ALLOCSET_DEFAULT_SIZES);
 
+	memset(&ctl, 0, sizeof(ctl));
 	ctl.keysize = MAX_PORTALNAME_LEN;
 	ctl.entrysize = sizeof(PortalHashEnt);
 
@@ -119,7 +120,7 @@ EnablePortalManager(void)
 	 * create, initially
 	 */
 	PortalHashTable = hash_create("Portal hash", PORTALS_PER_USER,
-								  &ctl, HASH_ELEM);
+								  &ctl, HASH_ELEM | HASH_STRINGS);
 }
 
 /*
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index bebf89b3c4..666ad33567 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -82,7 +82,8 @@ typedef struct HASHCTL
 #define HASH_PARTITION	0x0001	/* Hashtable is used w/partitioned locking */
 #define HASH_SEGMENT	0x0002	/* Set segment size */
 #define HASH_DIRSIZE	0x0004	/* Set directory size (initial and max) */
-#define HASH_ELEM		0x0010	/* Set keysize and entrysize */
+#define HASH_ELEM		0x0008	/* Set keysize and entrysize */
+#define HASH_STRINGS	0x0010	/* Select support functions for string keys */
 #define HASH_BLOBS		0x0020	/* Select support functions for binary keys */
 #define HASH_FUNCTION	0x0040	/* Set user defined hash function */
 #define HASH_COMPARE	0x0080	/* Set user defined comparison function */
@@ -119,7 +120,8 @@ typedef struct
  *
  * Note: It is deprecated for callers of hash_create to explicitly specify
  * string_hash, tag_hash, uint32_hash, or oid_hash.  Just set HASH_BLOBS or
- * not.  Use HASH_FUNCTION only when you want something other than those.
+ * HASH_STRINGS.  Use HASH_FUNCTION only when you want something other than
+ * one of these.
  */
 extern HTAB *hash_create(const char *tabname, long nelem,
 						 HASHCTL *info, int flags);
diff --git a/src/pl/plperl/plperl.c b/src/pl/plperl/plperl.c
index 4de756455d..60f5d66264 100644
--- a/src/pl/plperl/plperl.c
+++ b/src/pl/plperl/plperl.c
@@ -586,7 +586,7 @@ select_perl_context(bool trusted)
 		interp_desc->query_hash = hash_create("PL/Perl queries",
 											  32,
 											  &hash_ctl,
-											  HASH_ELEM);
+											  HASH_ELEM | HASH_STRINGS);
 	}
 
 	/*
diff --git a/src/timezone/pgtz.c b/src/timezone/pgtz.c
index 3f0fb51e91..5240cab022 100644
--- a/src/timezone/pgtz.c
+++ b/src/timezone/pgtz.c
@@ -211,7 +211,7 @@ init_timezone_hashtable(void)
 	timezone_cache = hash_create("Timezones",
 								 4,
 								 &hash_ctl,
-								 HASH_ELEM);
+								 HASH_ELEM | HASH_STRINGS);
 	if (!timezone_cache)
 		return false;
 
#554Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#553)
1 attachment(s)
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

I wrote:

There are a couple of other API oddities that maybe we should think
about while we're here:

* Should we just have a blanket insistence that all callers supply
HASH_ELEM? The default sizes that dynahash.c uses without that are
undocumented and basically useless. We're already asserting that
in the HASH_BLOBS path, which is the majority use-case, and this
patch now asserts it for HASH_STRINGS too.

Here's a follow-up patch for that part, which also tries to respond
a bit to Heikki's complaint about skimpy documentation. While at it,
I const-ified the HASHCTL argument, since there's no need for
hash_create to modify that.

regards, tom lane

Attachments:

require-HASH_ELEM.patchtext/x-diff; charset=us-ascii; name=require-HASH_ELEM.patchDownload
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 07cae638df..49f21b77bb 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -317,11 +317,20 @@ string_compare(const char *key1, const char *key2, Size keysize)
  *	*info: additional table parameters, as indicated by flags
  *	flags: bitmask indicating which parameters to take from *info
  *
- * The flags value must include exactly one of HASH_STRINGS, HASH_BLOBS,
+ * The flags value *must* include HASH_ELEM.  (Formerly, this was nominally
+ * optional, but the default keysize and entrysize values were useless.)
+ * The flags value must also include exactly one of HASH_STRINGS, HASH_BLOBS,
  * or HASH_FUNCTION, to define the key hashing semantics (C strings,
  * binary blobs, or custom, respectively).  Callers specifying a custom
  * hash function will likely also want to use HASH_COMPARE, and perhaps
  * also HASH_KEYCOPY, to control key comparison and copying.
+ * Another often-used flag is HASH_CONTEXT, to allocate the hash table
+ * under info->hcxt rather than under TopMemoryContext; the default
+ * behavior is only suitable for session-lifespan hash tables.
+ * Other flags bits are special-purpose and seldom used.
+ *
+ * Fields in *info are read only when the associated flags bit is set.
+ * It is not necessary to initialize other fields of *info.
  *
  * Note: for a shared-memory hashtable, nelem needs to be a pretty good
  * estimate, since we can't expand the table on the fly.  But an unshared
@@ -330,11 +339,19 @@ string_compare(const char *key1, const char *key2, Size keysize)
  * large nelem will penalize hash_seq_search speed without buying much.
  */
 HTAB *
-hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
+hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
 
+	/*
+	 * Hash tables now allocate space for key and data, but you have to say
+	 * how much space to allocate.
+	 */
+	Assert(flags & HASH_ELEM);
+	Assert(info->keysize > 0);
+	Assert(info->entrysize >= info->keysize);
+
 	/*
 	 * For shared hash tables, we have a local hash header (HTAB struct) that
 	 * we allocate in TopMemoryContext; all else is in shared memory.
@@ -385,7 +402,6 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
 	{
 		Assert(!(flags & HASH_STRINGS));
 		/* We can optimize hashing for common key sizes */
-		Assert(flags & HASH_ELEM);
 		if (info->keysize == sizeof(uint32))
 			hashp->hash = uint32_hash;
 		else
@@ -402,7 +418,6 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
 		 * it's almost certainly an integer or pointer not a string.
 		 */
 		Assert(flags & HASH_STRINGS);
-		Assert(flags & HASH_ELEM);
 		Assert(info->keysize > 8);
 
 		hashp->hash = string_hash;
@@ -529,16 +544,9 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
 		hctl->dsize = info->dsize;
 	}
 
-	/*
-	 * hash table now allocates space for key and data but you have to say how
-	 * much space to allocate
-	 */
-	if (flags & HASH_ELEM)
-	{
-		Assert(info->entrysize >= info->keysize);
-		hctl->keysize = info->keysize;
-		hctl->entrysize = info->entrysize;
-	}
+	/* remember the entry sizes, too */
+	hctl->keysize = info->keysize;
+	hctl->entrysize = info->entrysize;
 
 	/* make local copies of heavily-used constant fields */
 	hashp->keysize = hctl->keysize;
@@ -617,10 +625,6 @@ hdefault(HTAB *hashp)
 	hctl->dsize = DEF_DIRSIZE;
 	hctl->nsegs = 0;
 
-	/* rather pointless defaults for key & entry size */
-	hctl->keysize = sizeof(char *);
-	hctl->entrysize = 2 * sizeof(char *);
-
 	hctl->num_partitions = 0;	/* not partitioned */
 
 	/* table has no fixed maximum size */
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 666ad33567..c3daaae92b 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -124,7 +124,7 @@ typedef struct
  * one of these.
  */
 extern HTAB *hash_create(const char *tabname, long nelem,
-						 HASHCTL *info, int flags);
+						 const HASHCTL *info, int flags);
 extern void hash_destroy(HTAB *hashp);
 extern void hash_stats(const char *where, HTAB *hashp);
 extern void *hash_search(HTAB *hashp, const void *keyPtr, HASHACTION action,
#555Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#554)
1 attachment(s)
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

Here's a rolled-up patch that does some further documentation work
and gets rid of the unnecessary memset's as well.

regards, tom lane

Attachments:

hash_create-API-cleanups-3.patchtext/x-diff; charset=us-ascii; name=hash_create-API-cleanups-3.patchDownload
diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 2dc9e44ae6..651227f510 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -2607,7 +2607,8 @@ createConnHash(void)
 	ctl.keysize = NAMEDATALEN;
 	ctl.entrysize = sizeof(remoteConnHashEnt);
 
-	return hash_create("Remote Con hash", NUMCONN, &ctl, HASH_ELEM);
+	return hash_create("Remote Con hash", NUMCONN, &ctl,
+					   HASH_ELEM | HASH_STRINGS);
 }
 
 static void
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 70cfdb2c9d..2f00344b7f 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -567,7 +567,6 @@ pgss_shmem_startup(void)
 		pgss->stats.dealloc = 0;
 	}
 
-	memset(&info, 0, sizeof(info));
 	info.keysize = sizeof(pgssHashKey);
 	info.entrysize = sizeof(pgssEntry);
 	pgss_hash = ShmemInitHash("pg_stat_statements hash",
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index ab3226287d..66581e5414 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -119,14 +119,11 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	{
 		HASHCTL		ctl;
 
-		MemSet(&ctl, 0, sizeof(ctl));
 		ctl.keysize = sizeof(ConnCacheKey);
 		ctl.entrysize = sizeof(ConnCacheEntry);
-		/* allocate ConnectionHash in the cache context */
-		ctl.hcxt = CacheMemoryContext;
 		ConnectionHash = hash_create("postgres_fdw connections", 8,
 									 &ctl,
-									 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+									 HASH_ELEM | HASH_BLOBS);
 
 		/*
 		 * Register some callback functions that manage connection cleanup.
diff --git a/contrib/postgres_fdw/shippable.c b/contrib/postgres_fdw/shippable.c
index 3433c19712..b4766dc5ff 100644
--- a/contrib/postgres_fdw/shippable.c
+++ b/contrib/postgres_fdw/shippable.c
@@ -93,7 +93,6 @@ InitializeShippableCache(void)
 	HASHCTL		ctl;
 
 	/* Create the hash table. */
-	MemSet(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(ShippableCacheKey);
 	ctl.entrysize = sizeof(ShippableCacheEntry);
 	ShippableCacheHash =
diff --git a/contrib/tablefunc/tablefunc.c b/contrib/tablefunc/tablefunc.c
index 85986ec24a..e9a9741154 100644
--- a/contrib/tablefunc/tablefunc.c
+++ b/contrib/tablefunc/tablefunc.c
@@ -714,7 +714,6 @@ load_categories_hash(char *cats_sql, MemoryContext per_query_ctx)
 	MemoryContext SPIcontext;
 
 	/* initialize the category hash table */
-	MemSet(&ctl, 0, sizeof(ctl));
 	ctl.keysize = MAX_CATNAME_LEN;
 	ctl.entrysize = sizeof(crosstab_HashEnt);
 	ctl.hcxt = per_query_ctx;
@@ -726,7 +725,7 @@ load_categories_hash(char *cats_sql, MemoryContext per_query_ctx)
 	crosstab_hash = hash_create("crosstab hash",
 								INIT_CATS,
 								&ctl,
-								HASH_ELEM | HASH_CONTEXT);
+								HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
 
 	/* Connect to SPI manager */
 	if ((ret = SPI_connect()) < 0)
diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c
index 4ad67c88b4..217c199a14 100644
--- a/src/backend/access/gist/gistbuildbuffers.c
+++ b/src/backend/access/gist/gistbuildbuffers.c
@@ -76,7 +76,6 @@ gistInitBuildBuffers(int pagesPerBuffer, int levelStep, int maxLevel)
 	 * nodeBuffersTab hash is association between index blocks and it's
 	 * buffers.
 	 */
-	memset(&hashCtl, 0, sizeof(hashCtl));
 	hashCtl.keysize = sizeof(BlockNumber);
 	hashCtl.entrysize = sizeof(GISTNodeBuffer);
 	hashCtl.hcxt = CurrentMemoryContext;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index a664ecf494..c77a189907 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -1363,7 +1363,6 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
 	bool		found;
 
 	/* Initialize hash tables used to track TIDs */
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = sizeof(ItemPointerData);
 	hash_ctl.entrysize = sizeof(ItemPointerData);
 	hash_ctl.hcxt = CurrentMemoryContext;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 39e33763df..65942cc428 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -266,7 +266,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 	state->rs_cxt = rw_cxt;
 
 	/* Initialize hash tables used to track update chains */
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = sizeof(TidHashKey);
 	hash_ctl.entrysize = sizeof(UnresolvedTupData);
 	hash_ctl.hcxt = state->rs_cxt;
@@ -824,7 +823,6 @@ logical_begin_heap_rewrite(RewriteState state)
 	state->rs_begin_lsn = GetXLogInsertRecPtr();
 	state->rs_num_rewrite_mappings = 0;
 
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(RewriteMappingFile);
 	hash_ctl.hcxt = state->rs_cxt;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 32a3099c1f..e0ca3859a9 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -113,7 +113,6 @@ log_invalid_page(RelFileNode node, ForkNumber forkno, BlockNumber blkno,
 		/* create hash table when first needed */
 		HASHCTL		ctl;
 
-		memset(&ctl, 0, sizeof(ctl));
 		ctl.keysize = sizeof(xl_invalid_page_key);
 		ctl.entrysize = sizeof(xl_invalid_page);
 
diff --git a/src/backend/catalog/pg_enum.c b/src/backend/catalog/pg_enum.c
index 6a2c6685a0..f2e7bab62a 100644
--- a/src/backend/catalog/pg_enum.c
+++ b/src/backend/catalog/pg_enum.c
@@ -188,7 +188,6 @@ init_enum_blacklist(void)
 {
 	HASHCTL		hash_ctl;
 
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = sizeof(Oid);
 	hash_ctl.entrysize = sizeof(Oid);
 	hash_ctl.hcxt = TopTransactionContext;
diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index 17f37eb39f..5c3c78a0e6 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -171,7 +171,6 @@ find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, List **numparents)
 			   *rel_numparents;
 	ListCell   *l;
 
-	memset(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(Oid);
 	ctl.entrysize = sizeof(SeenRelsEntry);
 	ctl.hcxt = CurrentMemoryContext;
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index c0763c63e2..e04afd9963 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2375,7 +2375,6 @@ AddEventToPendingNotifies(Notification *n)
 		ListCell   *l;
 
 		/* Create the hash table */
-		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
 		hash_ctl.keysize = sizeof(Notification *);
 		hash_ctl.entrysize = sizeof(NotificationHash);
 		hash_ctl.hash = notification_hash;
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 4b18be5b27..89087a7be3 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -406,15 +406,13 @@ InitQueryHashTable(void)
 {
 	HASHCTL		hash_ctl;
 
-	MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-
 	hash_ctl.keysize = NAMEDATALEN;
 	hash_ctl.entrysize = sizeof(PreparedStatement);
 
 	prepared_queries = hash_create("Prepared Queries",
 								   32,
 								   &hash_ctl,
-								   HASH_ELEM);
+								   HASH_ELEM | HASH_STRINGS);
 }
 
 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 632b34af61..fa2eea8af2 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -1087,7 +1087,6 @@ create_seq_hashtable(void)
 {
 	HASHCTL		ctl;
 
-	memset(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(Oid);
 	ctl.entrysize = sizeof(SeqTableData);
 
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 86594bd056..97bfc8bd71 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -521,7 +521,6 @@ ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
 	HTAB	   *htab;
 	int			i;
 
-	memset(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(Oid);
 	ctl.entrysize = sizeof(SubplanResultRelHashElem);
 	ctl.hcxt = CurrentMemoryContext;
diff --git a/src/backend/nodes/extensible.c b/src/backend/nodes/extensible.c
index ab04459c55..3a6cfc44d3 100644
--- a/src/backend/nodes/extensible.c
+++ b/src/backend/nodes/extensible.c
@@ -47,11 +47,11 @@ RegisterExtensibleNodeEntry(HTAB **p_htable, const char *htable_label,
 	{
 		HASHCTL		ctl;
 
-		memset(&ctl, 0, sizeof(HASHCTL));
 		ctl.keysize = EXTNODENAME_MAX_LEN;
 		ctl.entrysize = sizeof(ExtensibleNodeEntry);
 
-		*p_htable = hash_create(htable_label, 100, &ctl, HASH_ELEM);
+		*p_htable = hash_create(htable_label, 100, &ctl,
+								HASH_ELEM | HASH_STRINGS);
 	}
 
 	if (strlen(extnodename) >= EXTNODENAME_MAX_LEN)
diff --git a/src/backend/optimizer/util/predtest.c b/src/backend/optimizer/util/predtest.c
index 0edd873dca..d6e83e5f8e 100644
--- a/src/backend/optimizer/util/predtest.c
+++ b/src/backend/optimizer/util/predtest.c
@@ -1982,7 +1982,6 @@ lookup_proof_cache(Oid pred_op, Oid clause_op, bool refute_it)
 		/* First time through: initialize the hash table */
 		HASHCTL		ctl;
 
-		MemSet(&ctl, 0, sizeof(ctl));
 		ctl.keysize = sizeof(OprProofCacheKey);
 		ctl.entrysize = sizeof(OprProofCacheEntry);
 		OprProofCacheHash = hash_create("Btree proof lookup cache", 256,
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 76245c1ff3..9c9a738c80 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -400,7 +400,6 @@ build_join_rel_hash(PlannerInfo *root)
 	ListCell   *l;
 
 	/* Create the hash table */
-	MemSet(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = sizeof(Relids);
 	hash_ctl.entrysize = sizeof(JoinHashEntry);
 	hash_ctl.hash = bitmap_hash;
diff --git a/src/backend/parser/parse_oper.c b/src/backend/parser/parse_oper.c
index 6613a3a8f8..e72d3676f1 100644
--- a/src/backend/parser/parse_oper.c
+++ b/src/backend/parser/parse_oper.c
@@ -999,7 +999,6 @@ find_oper_cache_entry(OprCacheKey *key)
 		/* First time through: initialize the hash table */
 		HASHCTL		ctl;
 
-		MemSet(&ctl, 0, sizeof(ctl));
 		ctl.keysize = sizeof(OprCacheKey);
 		ctl.entrysize = sizeof(OprCacheEntry);
 		OprCacheHash = hash_create("Operator lookup cache", 256,
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 9a292290ed..5b0a15ac0b 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -286,13 +286,13 @@ CreatePartitionDirectory(MemoryContext mcxt)
 	PartitionDirectory pdir;
 	HASHCTL		ctl;
 
-	MemSet(&ctl, 0, sizeof(HASHCTL));
+	pdir = palloc(sizeof(PartitionDirectoryData));
+	pdir->pdir_mcxt = mcxt;
+
 	ctl.keysize = sizeof(Oid);
 	ctl.entrysize = sizeof(PartitionDirectoryEntry);
 	ctl.hcxt = mcxt;
 
-	pdir = palloc(sizeof(PartitionDirectoryData));
-	pdir->pdir_mcxt = mcxt;
 	pdir->pdir_hash = hash_create("partition directory", 256, &ctl,
 								  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 7e28944d2f..ed127a1032 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2043,7 +2043,6 @@ do_autovacuum(void)
 	pg_class_desc = CreateTupleDescCopy(RelationGetDescr(classRel));
 
 	/* create hash table for toast <-> main relid mapping */
-	MemSet(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(Oid);
 	ctl.entrysize = sizeof(av_relation);
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 429c8010ef..a62c6d4d0a 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1161,7 +1161,6 @@ CompactCheckpointerRequestQueue(void)
 	skip_slot = palloc0(sizeof(bool) * CheckpointerShmem->num_requests);
 
 	/* Initialize temporary hash table */
-	MemSet(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(CheckpointerRequest);
 	ctl.entrysize = sizeof(struct CheckpointerSlotMapping);
 	ctl.hcxt = CurrentMemoryContext;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7c75a25d21..6b60f293e9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1265,7 +1265,6 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
 	HeapTuple	tup;
 	Snapshot	snapshot;
 
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = sizeof(Oid);
 	hash_ctl.entrysize = sizeof(Oid);
 	hash_ctl.hcxt = CurrentMemoryContext;
@@ -1815,7 +1814,6 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
 		/* First time through - initialize function stat table */
 		HASHCTL		hash_ctl;
 
-		memset(&hash_ctl, 0, sizeof(hash_ctl));
 		hash_ctl.keysize = sizeof(Oid);
 		hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
 		pgStatFunctions = hash_create("Function stat entries",
@@ -1975,7 +1973,6 @@ get_tabstat_entry(Oid rel_id, bool isshared)
 	{
 		HASHCTL		ctl;
 
-		memset(&ctl, 0, sizeof(ctl));
 		ctl.keysize = sizeof(Oid);
 		ctl.entrysize = sizeof(TabStatHashEntry);
 
@@ -4994,7 +4991,6 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->stat_reset_timestamp = GetCurrentTimestamp();
 	dbentry->stats_timestamp = 0;
 
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = sizeof(Oid);
 	hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
 	dbentry->tables = hash_create("Per-database table",
@@ -5423,7 +5419,6 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	/*
 	 * Create the DB hashtable
 	 */
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = sizeof(Oid);
 	hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
 	hash_ctl.hcxt = pgStatLocalContext;
@@ -5608,7 +5603,6 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 						break;
 				}
 
-				memset(&hash_ctl, 0, sizeof(hash_ctl));
 				hash_ctl.keysize = sizeof(Oid);
 				hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
 				hash_ctl.hcxt = pgStatLocalContext;
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 07aa52977f..f4dbbbe2dd 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -111,7 +111,6 @@ logicalrep_relmap_init(void)
 								  ALLOCSET_DEFAULT_SIZES);
 
 	/* Initialize the relation hash table. */
-	MemSet(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(LogicalRepRelId);
 	ctl.entrysize = sizeof(LogicalRepRelMapEntry);
 	ctl.hcxt = LogicalRepRelMapContext;
@@ -120,7 +119,6 @@ logicalrep_relmap_init(void)
 								   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
 	/* Initialize the type hash table. */
-	MemSet(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(Oid);
 	ctl.entrysize = sizeof(LogicalRepTyp);
 	ctl.hcxt = LogicalRepRelMapContext;
@@ -606,7 +604,6 @@ logicalrep_partmap_init(void)
 								  ALLOCSET_DEFAULT_SIZES);
 
 	/* Initialize the relation hash table. */
-	MemSet(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(Oid);	/* partition OID */
 	ctl.entrysize = sizeof(LogicalRepPartMapEntry);
 	ctl.hcxt = LogicalRepPartMapContext;
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 15dc51a94d..7359fa9df2 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1619,8 +1619,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
 		return;
 
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
-
 	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
 	hash_ctl.entrysize = sizeof(ReorderBufferTupleCidEnt);
 	hash_ctl.hcxt = rb->context;
@@ -4116,7 +4114,6 @@ ReorderBufferToastInitHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	Assert(txn->toast_hash == NULL);
 
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = sizeof(Oid);
 	hash_ctl.entrysize = sizeof(ReorderBufferToastEnt);
 	hash_ctl.hcxt = rb->context;
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 1904f3471c..6259606537 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -372,7 +372,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	{
 		HASHCTL		ctl;
 
-		memset(&ctl, 0, sizeof(ctl));
 		ctl.keysize = sizeof(Oid);
 		ctl.entrysize = sizeof(struct tablesync_start_time_mapping);
 		last_start_times = hash_create("Logical replication table sync worker start times",
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997aed83..49d25b02d7 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -867,22 +867,18 @@ static void
 init_rel_sync_cache(MemoryContext cachectx)
 {
 	HASHCTL		ctl;
-	MemoryContext old_ctxt;
 
 	if (RelationSyncCache != NULL)
 		return;
 
 	/* Make a new hash table for the cache */
-	MemSet(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(Oid);
 	ctl.entrysize = sizeof(RelationSyncEntry);
 	ctl.hcxt = cachectx;
 
-	old_ctxt = MemoryContextSwitchTo(cachectx);
 	RelationSyncCache = hash_create("logical replication output relation cache",
 									128, &ctl,
 									HASH_ELEM | HASH_CONTEXT | HASH_BLOBS);
-	(void) MemoryContextSwitchTo(old_ctxt);
 
 	Assert(RelationSyncCache != NULL);
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ad0d1a9abc..c5e8707151 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2505,7 +2505,6 @@ InitBufferPoolAccess(void)
 
 	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
 
-	MemSet(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = sizeof(int32);
 	hash_ctl.entrysize = sizeof(PrivateRefCountEntry);
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 6ffd7b3306..cd3475e9e1 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -465,7 +465,6 @@ InitLocalBuffers(void)
 	}
 
 	/* Create the lookup hash table */
-	MemSet(&info, 0, sizeof(info));
 	info.keysize = sizeof(BufferTag);
 	info.entrysize = sizeof(LocalBufferLookupEnt);
 
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 0c2094f766..8700f7f19a 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -30,7 +30,7 @@ static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
 
 typedef struct
 {
-	char		oid[OIDCHARS + 1];
+	Oid			reloid;			/* hash key */
 } unlogged_relation_entry;
 
 /*
@@ -172,10 +172,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
 		 * need to be reset.  Otherwise, this cleanup operation would be
 		 * O(n^2).
 		 */
-		memset(&ctl, 0, sizeof(ctl));
-		ctl.keysize = sizeof(unlogged_relation_entry);
+		ctl.keysize = sizeof(Oid);
 		ctl.entrysize = sizeof(unlogged_relation_entry);
-		hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM);
+		ctl.hcxt = CurrentMemoryContext;
+		hash = hash_create("unlogged relation OIDs", 32, &ctl,
+						   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
 		/* Scan the directory. */
 		dbspace_dir = AllocateDir(dbspacedirname);
@@ -198,9 +199,8 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
 			 * Put the OID portion of the name into the hash table, if it
 			 * isn't already.
 			 */
-			memset(ent.oid, 0, sizeof(ent.oid));
-			memcpy(ent.oid, de->d_name, oidchars);
-			hash_search(hash, &ent, HASH_ENTER, NULL);
+			ent.reloid = atooid(de->d_name);
+			(void) hash_search(hash, &ent, HASH_ENTER, NULL);
 		}
 
 		/* Done with the first pass. */
@@ -224,7 +224,6 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
 		{
 			ForkNumber	forkNum;
 			int			oidchars;
-			bool		found;
 			unlogged_relation_entry ent;
 
 			/* Skip anything that doesn't look like a relation data file. */
@@ -238,14 +237,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
 
 			/*
 			 * See whether the OID portion of the name shows up in the hash
-			 * table.
+			 * table.  If so, nuke it!
 			 */
-			memset(ent.oid, 0, sizeof(ent.oid));
-			memcpy(ent.oid, de->d_name, oidchars);
-			hash_search(hash, &ent, HASH_FIND, &found);
-
-			/* If so, nuke it! */
-			if (found)
+			ent.reloid = atooid(de->d_name);
+			if (hash_search(hash, &ent, HASH_FIND, NULL))
 			{
 				snprintf(rm_path, sizeof(rm_path), "%s/%s",
 						 dbspacedirname, de->d_name);
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 97716f6aef..b0fc9f160d 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -292,7 +292,6 @@ void
 InitShmemIndex(void)
 {
 	HASHCTL		info;
-	int			hash_flags;
 
 	/*
 	 * Create the shared memory shmem index.
@@ -304,11 +303,11 @@ InitShmemIndex(void)
 	 */
 	info.keysize = SHMEM_INDEX_KEYSIZE;
 	info.entrysize = sizeof(ShmemIndexEnt);
-	hash_flags = HASH_ELEM;
 
 	ShmemIndex = ShmemInitHash("ShmemIndex",
 							   SHMEM_INDEX_SIZE, SHMEM_INDEX_SIZE,
-							   &info, hash_flags);
+							   &info,
+							   HASH_ELEM | HASH_STRINGS);
 }
 
 /*
@@ -329,6 +328,11 @@ InitShmemIndex(void)
  * whose maximum size is certain, this should be equal to max_size; that
  * ensures that no run-time out-of-shared-memory failures can occur.
  *
+ * *infoP and hash_flags should specify at least the entry sizes and key
+ * comparison semantics (see hash_create()).  Flag bits and values specific
+ * to shared-memory hash tables are added here, except that callers may
+ * choose to specify HASH_PARTITION and/or HASH_FIXED_SIZE.
+ *
  * Note: before Postgres 9.0, this function returned NULL for some failure
  * cases.  Now, it always throws error instead, so callers need not check
  * for NULL.
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 52b2809dac..4ea3cf1f5c 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -81,7 +81,6 @@ InitRecoveryTransactionEnvironment(void)
 	 * Initialize the hash table for tracking the list of locks held by each
 	 * transaction.
 	 */
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(RecoveryLockListsEntry);
 	RecoveryLockLists = hash_create("RecoveryLockLists",
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index d86566f455..53472dd21e 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -419,7 +419,6 @@ InitLocks(void)
 	 * Allocate hash table for LOCK structs.  This stores per-locked-object
 	 * information.
 	 */
-	MemSet(&info, 0, sizeof(info));
 	info.keysize = sizeof(LOCKTAG);
 	info.entrysize = sizeof(LOCK);
 	info.num_partitions = NUM_LOCK_PARTITIONS;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 108e652179..26bcce9735 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -342,7 +342,6 @@ init_lwlock_stats(void)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextAllowInCriticalSection(lwlock_stats_cxt, true);
 
-	MemSet(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(lwlock_stats_key);
 	ctl.entrysize = sizeof(lwlock_stats);
 	ctl.hcxt = lwlock_stats_cxt;
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 8a365b400c..e42e131543 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1096,7 +1096,6 @@ InitPredicateLocks(void)
 	 * Allocate hash table for PREDICATELOCKTARGET structs.  This stores
 	 * per-predicate-lock-target information.
 	 */
-	MemSet(&info, 0, sizeof(info));
 	info.keysize = sizeof(PREDICATELOCKTARGETTAG);
 	info.entrysize = sizeof(PREDICATELOCKTARGET);
 	info.num_partitions = NUM_PREDICATELOCK_PARTITIONS;
@@ -1129,7 +1128,6 @@ InitPredicateLocks(void)
 	 * Allocate hash table for PREDICATELOCK structs.  This stores per
 	 * xact-lock-of-a-target information.
 	 */
-	MemSet(&info, 0, sizeof(info));
 	info.keysize = sizeof(PREDICATELOCKTAG);
 	info.entrysize = sizeof(PREDICATELOCK);
 	info.hash = predicatelock_hash;
@@ -1212,7 +1210,6 @@ InitPredicateLocks(void)
 	 * Allocate hash table for SERIALIZABLEXID structs.  This stores per-xid
 	 * information for serializable transactions which have accessed data.
 	 */
-	MemSet(&info, 0, sizeof(info));
 	info.keysize = sizeof(SERIALIZABLEXIDTAG);
 	info.entrysize = sizeof(SERIALIZABLEXID);
 
@@ -1853,7 +1850,6 @@ CreateLocalPredicateLockHash(void)
 
 	/* Initialize the backend-local hash table of parent locks */
 	Assert(LocalPredicateLockHash == NULL);
-	MemSet(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = sizeof(PREDICATELOCKTARGETTAG);
 	hash_ctl.entrysize = sizeof(LOCALPREDICATELOCK);
 	LocalPredicateLockHash = hash_create("Local predicate lock",
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dcc09df0c7..072bdd118f 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -154,7 +154,6 @@ smgropen(RelFileNode rnode, BackendId backend)
 		/* First time through: initialize the hash table */
 		HASHCTL		ctl;
 
-		MemSet(&ctl, 0, sizeof(ctl));
 		ctl.keysize = sizeof(RelFileNodeBackend);
 		ctl.entrysize = sizeof(SMgrRelationData);
 		SMgrRelationHash = hash_create("smgr relation table", 400,
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 1d635d596c..a49588f6b9 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -150,7 +150,6 @@ InitSync(void)
 											  ALLOCSET_DEFAULT_SIZES);
 		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
 
-		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
 		hash_ctl.keysize = sizeof(FileTag);
 		hash_ctl.entrysize = sizeof(PendingFsyncEntry);
 		hash_ctl.hcxt = pendingOpsCxt;
diff --git a/src/backend/tsearch/ts_typanalyze.c b/src/backend/tsearch/ts_typanalyze.c
index 2eed0cd137..19e9611a3a 100644
--- a/src/backend/tsearch/ts_typanalyze.c
+++ b/src/backend/tsearch/ts_typanalyze.c
@@ -180,7 +180,6 @@ compute_tsvector_stats(VacAttrStats *stats,
 	 * worry about overflowing the initial size. Also we don't need to pay any
 	 * attention to locking and memory management.
 	 */
-	MemSet(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = sizeof(LexemeHashKey);
 	hash_ctl.entrysize = sizeof(TrackItem);
 	hash_ctl.hash = lexeme_hash;
diff --git a/src/backend/utils/adt/array_typanalyze.c b/src/backend/utils/adt/array_typanalyze.c
index 4912cabc61..cb2a834193 100644
--- a/src/backend/utils/adt/array_typanalyze.c
+++ b/src/backend/utils/adt/array_typanalyze.c
@@ -277,7 +277,6 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
 	 * worry about overflowing the initial size. Also we don't need to pay any
 	 * attention to locking and memory management.
 	 */
-	MemSet(&elem_hash_ctl, 0, sizeof(elem_hash_ctl));
 	elem_hash_ctl.keysize = sizeof(Datum);
 	elem_hash_ctl.entrysize = sizeof(TrackItem);
 	elem_hash_ctl.hash = element_hash;
@@ -289,7 +288,6 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
 							   HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);
 
 	/* hashtable for array distinct elements counts */
-	MemSet(&count_hash_ctl, 0, sizeof(count_hash_ctl));
 	count_hash_ctl.keysize = sizeof(int);
 	count_hash_ctl.entrysize = sizeof(DECountItem);
 	count_hash_ctl.hcxt = CurrentMemoryContext;
diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c
index 12557ce3af..7a25415078 100644
--- a/src/backend/utils/adt/jsonfuncs.c
+++ b/src/backend/utils/adt/jsonfuncs.c
@@ -3439,14 +3439,13 @@ get_json_object_as_hash(char *json, int len, const char *funcname)
 	JsonLexContext *lex = makeJsonLexContextCstringLen(json, len, GetDatabaseEncoding(), true);
 	JsonSemAction *sem;
 
-	memset(&ctl, 0, sizeof(ctl));
 	ctl.keysize = NAMEDATALEN;
 	ctl.entrysize = sizeof(JsonHashEntry);
 	ctl.hcxt = CurrentMemoryContext;
 	tab = hash_create("json object hashtable",
 					  100,
 					  &ctl,
-					  HASH_ELEM | HASH_CONTEXT);
+					  HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
 
 	state = palloc0(sizeof(JHashState));
 	sem = palloc0(sizeof(JsonSemAction));
@@ -3831,14 +3830,13 @@ populate_recordset_object_start(void *state)
 		return;
 
 	/* Object at level 1: set up a new hash table for this object */
-	memset(&ctl, 0, sizeof(ctl));
 	ctl.keysize = NAMEDATALEN;
 	ctl.entrysize = sizeof(JsonHashEntry);
 	ctl.hcxt = CurrentMemoryContext;
 	_state->json_hash = hash_create("json object hashtable",
 									100,
 									&ctl,
-									HASH_ELEM | HASH_CONTEXT);
+									HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
 }
 
 static void
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index b6d05ac98d..c39d67645c 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1297,7 +1297,6 @@ lookup_collation_cache(Oid collation, bool set_flags)
 		/* First time through, initialize the hash table */
 		HASHCTL		ctl;
 
-		memset(&ctl, 0, sizeof(ctl));
 		ctl.keysize = sizeof(Oid);
 		ctl.entrysize = sizeof(collation_cache_entry);
 		collation_cache = hash_create("Collation cache", 100, &ctl,
diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index 02b1a3868f..5ab134a853 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -2540,7 +2540,6 @@ ri_InitHashTables(void)
 {
 	HASHCTL		ctl;
 
-	memset(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(Oid);
 	ctl.entrysize = sizeof(RI_ConstraintInfo);
 	ri_constraint_cache = hash_create("RI constraint cache",
@@ -2552,14 +2551,12 @@ ri_InitHashTables(void)
 								  InvalidateConstraintCacheCallBack,
 								  (Datum) 0);
 
-	memset(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(RI_QueryKey);
 	ctl.entrysize = sizeof(RI_QueryHashEntry);
 	ri_query_cache = hash_create("RI query cache",
 								 RI_INIT_QUERYHASHSIZE,
 								 &ctl, HASH_ELEM | HASH_BLOBS);
 
-	memset(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(RI_CompareKey);
 	ctl.entrysize = sizeof(RI_CompareHashEntry);
 	ri_compare_cache = hash_create("RI compare cache",
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index ad582f99a5..7d4443e807 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -3464,14 +3464,14 @@ set_rtable_names(deparse_namespace *dpns, List *parent_namespaces,
 	 * We use a hash table to hold known names, so that this process is O(N)
 	 * not O(N^2) for N names.
 	 */
-	MemSet(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = NAMEDATALEN;
 	hash_ctl.entrysize = sizeof(NameHashEntry);
 	hash_ctl.hcxt = CurrentMemoryContext;
 	names_hash = hash_create("set_rtable_names names",
 							 list_length(dpns->rtable),
 							 &hash_ctl,
-							 HASH_ELEM | HASH_CONTEXT);
+							 HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
+
 	/* Preload the hash table with names appearing in parent_namespaces */
 	foreach(lc, parent_namespaces)
 	{
diff --git a/src/backend/utils/cache/attoptcache.c b/src/backend/utils/cache/attoptcache.c
index 05ac366b40..934a84e03f 100644
--- a/src/backend/utils/cache/attoptcache.c
+++ b/src/backend/utils/cache/attoptcache.c
@@ -79,7 +79,6 @@ InitializeAttoptCache(void)
 	HASHCTL		ctl;
 
 	/* Initialize the hash table. */
-	MemSet(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(AttoptCacheKey);
 	ctl.entrysize = sizeof(AttoptCacheEntry);
 	AttoptCacheHash =
diff --git a/src/backend/utils/cache/evtcache.c b/src/backend/utils/cache/evtcache.c
index 0427795395..0877bc7e0e 100644
--- a/src/backend/utils/cache/evtcache.c
+++ b/src/backend/utils/cache/evtcache.c
@@ -118,7 +118,6 @@ BuildEventTriggerCache(void)
 	EventTriggerCacheState = ETCS_REBUILD_STARTED;
 
 	/* Create new hash table. */
-	MemSet(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(EventTriggerEvent);
 	ctl.entrysize = sizeof(EventTriggerCacheEntry);
 	ctl.hcxt = EventTriggerCacheContext;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 66393becfb..3bd5e18042 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1607,7 +1607,6 @@ LookupOpclassInfo(Oid operatorClassOid,
 		/* First time through: initialize the opclass cache */
 		HASHCTL		ctl;
 
-		MemSet(&ctl, 0, sizeof(ctl));
 		ctl.keysize = sizeof(Oid);
 		ctl.entrysize = sizeof(OpClassCacheEnt);
 		OpClassCache = hash_create("Operator class cache", 64,
@@ -3775,7 +3774,6 @@ RelationCacheInitialize(void)
 	/*
 	 * create hashtable that indexes the relcache
 	 */
-	MemSet(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(Oid);
 	ctl.entrysize = sizeof(RelIdCacheEnt);
 	RelationIdCache = hash_create("Relcache by OID", INITRELCACHESIZE,
diff --git a/src/backend/utils/cache/relfilenodemap.c b/src/backend/utils/cache/relfilenodemap.c
index 0dbdbff603..38e6379974 100644
--- a/src/backend/utils/cache/relfilenodemap.c
+++ b/src/backend/utils/cache/relfilenodemap.c
@@ -110,17 +110,15 @@ InitializeRelfilenodeMap(void)
 	relfilenode_skey[0].sk_attno = Anum_pg_class_reltablespace;
 	relfilenode_skey[1].sk_attno = Anum_pg_class_relfilenode;
 
-	/* Initialize the hash table. */
-	MemSet(&ctl, 0, sizeof(ctl));
-	ctl.keysize = sizeof(RelfilenodeMapKey);
-	ctl.entrysize = sizeof(RelfilenodeMapEntry);
-	ctl.hcxt = CacheMemoryContext;
-
 	/*
 	 * Only create the RelfilenodeMapHash now, so we don't end up partially
 	 * initialized when fmgr_info_cxt() above ERRORs out with an out of memory
 	 * error.
 	 */
+	ctl.keysize = sizeof(RelfilenodeMapKey);
+	ctl.entrysize = sizeof(RelfilenodeMapEntry);
+	ctl.hcxt = CacheMemoryContext;
+
 	RelfilenodeMapHash =
 		hash_create("RelfilenodeMap cache", 64, &ctl,
 					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
diff --git a/src/backend/utils/cache/spccache.c b/src/backend/utils/cache/spccache.c
index e0c3c1b1c1..c8387e2541 100644
--- a/src/backend/utils/cache/spccache.c
+++ b/src/backend/utils/cache/spccache.c
@@ -79,7 +79,6 @@ InitializeTableSpaceCache(void)
 	HASHCTL		ctl;
 
 	/* Initialize the hash table. */
-	MemSet(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(Oid);
 	ctl.entrysize = sizeof(TableSpaceCacheEntry);
 	TableSpaceCacheHash =
diff --git a/src/backend/utils/cache/ts_cache.c b/src/backend/utils/cache/ts_cache.c
index f9f7912cb8..a2867fac7d 100644
--- a/src/backend/utils/cache/ts_cache.c
+++ b/src/backend/utils/cache/ts_cache.c
@@ -117,7 +117,6 @@ lookup_ts_parser_cache(Oid prsId)
 		/* First time through: initialize the hash table */
 		HASHCTL		ctl;
 
-		MemSet(&ctl, 0, sizeof(ctl));
 		ctl.keysize = sizeof(Oid);
 		ctl.entrysize = sizeof(TSParserCacheEntry);
 		TSParserCacheHash = hash_create("Tsearch parser cache", 4,
@@ -215,7 +214,6 @@ lookup_ts_dictionary_cache(Oid dictId)
 		/* First time through: initialize the hash table */
 		HASHCTL		ctl;
 
-		MemSet(&ctl, 0, sizeof(ctl));
 		ctl.keysize = sizeof(Oid);
 		ctl.entrysize = sizeof(TSDictionaryCacheEntry);
 		TSDictionaryCacheHash = hash_create("Tsearch dictionary cache", 8,
@@ -365,7 +363,6 @@ init_ts_config_cache(void)
 {
 	HASHCTL		ctl;
 
-	MemSet(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(Oid);
 	ctl.entrysize = sizeof(TSConfigCacheEntry);
 	TSConfigCacheHash = hash_create("Tsearch configuration cache", 16,
diff --git a/src/backend/utils/cache/typcache.c b/src/backend/utils/cache/typcache.c
index 5883fde367..1e331098c0 100644
--- a/src/backend/utils/cache/typcache.c
+++ b/src/backend/utils/cache/typcache.c
@@ -341,7 +341,6 @@ lookup_type_cache(Oid type_id, int flags)
 		/* First time through: initialize the hash table */
 		HASHCTL		ctl;
 
-		MemSet(&ctl, 0, sizeof(ctl));
 		ctl.keysize = sizeof(Oid);
 		ctl.entrysize = sizeof(TypeCacheEntry);
 		TypeCacheHash = hash_create("Type information cache", 64,
@@ -1874,7 +1873,6 @@ assign_record_type_typmod(TupleDesc tupDesc)
 		/* First time through: initialize the hash table */
 		HASHCTL		ctl;
 
-		MemSet(&ctl, 0, sizeof(ctl));
 		ctl.keysize = sizeof(TupleDesc);	/* just the pointer */
 		ctl.entrysize = sizeof(RecordCacheEntry);
 		ctl.hash = record_type_typmod_hash;
diff --git a/src/backend/utils/fmgr/dfmgr.c b/src/backend/utils/fmgr/dfmgr.c
index bd779fdaf7..adb31e109f 100644
--- a/src/backend/utils/fmgr/dfmgr.c
+++ b/src/backend/utils/fmgr/dfmgr.c
@@ -680,13 +680,12 @@ find_rendezvous_variable(const char *varName)
 	{
 		HASHCTL		ctl;
 
-		MemSet(&ctl, 0, sizeof(ctl));
 		ctl.keysize = NAMEDATALEN;
 		ctl.entrysize = sizeof(rendezvousHashEntry);
 		rendezvousHash = hash_create("Rendezvous variable hash",
 									 16,
 									 &ctl,
-									 HASH_ELEM);
+									 HASH_ELEM | HASH_STRINGS);
 	}
 
 	/* Find or create the hashtable entry for this varName */
diff --git a/src/backend/utils/fmgr/fmgr.c b/src/backend/utils/fmgr/fmgr.c
index 2681b7fbc6..fa5f7ac615 100644
--- a/src/backend/utils/fmgr/fmgr.c
+++ b/src/backend/utils/fmgr/fmgr.c
@@ -565,7 +565,6 @@ record_C_func(HeapTuple procedureTuple,
 	{
 		HASHCTL		hash_ctl;
 
-		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
 		hash_ctl.keysize = sizeof(Oid);
 		hash_ctl.entrysize = sizeof(CFuncHashTabEntry);
 		CFuncHash = hash_create("CFuncHash",
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index d14d875c93..fbd849b8f7 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -30,11 +30,12 @@
  * dynahash.c provides support for these types of lookup keys:
  *
  * 1. Null-terminated C strings (truncated if necessary to fit in keysize),
- * compared as though by strcmp().  This is the default behavior.
+ * compared as though by strcmp().  This is selected by specifying the
+ * HASH_STRINGS flag to hash_create.
  *
  * 2. Arbitrary binary data of size keysize, compared as though by memcmp().
  * (Caller must ensure there are no undefined padding bits in the keys!)
- * This is selected by specifying HASH_BLOBS flag to hash_create.
+ * This is selected by specifying the HASH_BLOBS flag to hash_create.
  *
  * 3. More complex key behavior can be selected by specifying user-supplied
  * hashing, comparison, and/or key-copying functions.  At least a hashing
@@ -47,8 +48,8 @@
  *   locks.
  * - Shared memory hashes are allocated in a fixed size area at startup and
  *   are discoverable by name from other processes.
- * - Because entries don't need to be moved in the case of hash conflicts, has
- *   better performance for large entries
+ * - Because entries don't need to be moved in the case of hash conflicts,
+ *   dynahash has better performance for large entries.
  * - Guarantees stable pointers to entries.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
@@ -316,6 +317,28 @@ string_compare(const char *key1, const char *key2, Size keysize)
  *	*info: additional table parameters, as indicated by flags
  *	flags: bitmask indicating which parameters to take from *info
  *
+ * The flags value *must* include HASH_ELEM.  (Formerly, this was nominally
+ * optional, but the default keysize and entrysize values were useless.)
+ * The flags value must also include exactly one of HASH_STRINGS, HASH_BLOBS,
+ * or HASH_FUNCTION, to define the key hashing semantics (C strings,
+ * binary blobs, or custom, respectively).  Callers specifying a custom
+ * hash function will likely also want to use HASH_COMPARE, and perhaps
+ * also HASH_KEYCOPY, to control key comparison and copying.
+ * Another often-used flag is HASH_CONTEXT, to allocate the hash table
+ * under info->hcxt rather than under TopMemoryContext; the default
+ * behavior is only suitable for session-lifespan hash tables.
+ * Other flags bits are special-purpose and seldom used, except for those
+ * associated with shared-memory hash tables, for which see ShmemInitHash().
+ *
+ * Fields in *info are read only when the associated flags bit is set.
+ * It is not necessary to initialize other fields of *info.
+ * Neither tabname nor *info need persist after the hash_create() call.
+ *
+ * Note: It is deprecated for callers of hash_create() to explicitly specify
+ * string_hash, tag_hash, uint32_hash, or oid_hash.  Just set HASH_BLOBS or
+ * HASH_STRINGS.  Use HASH_FUNCTION only when you want something other than
+ * one of these.
+ *
  * Note: for a shared-memory hashtable, nelem needs to be a pretty good
  * estimate, since we can't expand the table on the fly.  But an unshared
  * hashtable can be expanded on-the-fly, so it's better for nelem to be
@@ -323,11 +346,19 @@ string_compare(const char *key1, const char *key2, Size keysize)
  * large nelem will penalize hash_seq_search speed without buying much.
  */
 HTAB *
-hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
+hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
 
+	/*
+	 * Hash tables now allocate space for key and data, but you have to say
+	 * how much space to allocate.
+	 */
+	Assert(flags & HASH_ELEM);
+	Assert(info->keysize > 0);
+	Assert(info->entrysize >= info->keysize);
+
 	/*
 	 * For shared hash tables, we have a local hash header (HTAB struct) that
 	 * we allocate in TopMemoryContext; all else is in shared memory.
@@ -370,28 +401,43 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
 	 * Select the appropriate hash function (see comments at head of file).
 	 */
 	if (flags & HASH_FUNCTION)
+	{
+		Assert(!(flags & (HASH_BLOBS | HASH_STRINGS)));
 		hashp->hash = info->hash;
+	}
 	else if (flags & HASH_BLOBS)
 	{
+		Assert(!(flags & HASH_STRINGS));
 		/* We can optimize hashing for common key sizes */
-		Assert(flags & HASH_ELEM);
 		if (info->keysize == sizeof(uint32))
 			hashp->hash = uint32_hash;
 		else
 			hashp->hash = tag_hash;
 	}
 	else
-		hashp->hash = string_hash;	/* default hash function */
+	{
+		/*
+		 * string_hash used to be considered the default hash method, and in a
+		 * non-assert build it effectively still is.  But we now consider it
+		 * an assertion error to not say HASH_STRINGS explicitly.  To help
+		 * catch mistaken usage of HASH_STRINGS, we also insist on a
+		 * reasonably long string length: if the keysize is only 4 or 8 bytes,
+		 * it's almost certainly an integer or pointer not a string.
+		 */
+		Assert(flags & HASH_STRINGS);
+		Assert(info->keysize > 8);
+
+		hashp->hash = string_hash;
+	}
 
 	/*
 	 * If you don't specify a match function, it defaults to string_compare if
-	 * you used string_hash (either explicitly or by default) and to memcmp
-	 * otherwise.
+	 * you used string_hash, and to memcmp otherwise.
 	 *
 	 * Note: explicitly specifying string_hash is deprecated, because this
 	 * might not work for callers in loadable modules on some platforms due to
 	 * referencing a trampoline instead of the string_hash function proper.
-	 * Just let it default, eh?
+	 * Specify HASH_STRINGS instead.
 	 */
 	if (flags & HASH_COMPARE)
 		hashp->match = info->match;
@@ -505,16 +551,9 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
 		hctl->dsize = info->dsize;
 	}
 
-	/*
-	 * hash table now allocates space for key and data but you have to say how
-	 * much space to allocate
-	 */
-	if (flags & HASH_ELEM)
-	{
-		Assert(info->entrysize >= info->keysize);
-		hctl->keysize = info->keysize;
-		hctl->entrysize = info->entrysize;
-	}
+	/* remember the entry sizes, too */
+	hctl->keysize = info->keysize;
+	hctl->entrysize = info->entrysize;
 
 	/* make local copies of heavily-used constant fields */
 	hashp->keysize = hctl->keysize;
@@ -593,10 +632,6 @@ hdefault(HTAB *hashp)
 	hctl->dsize = DEF_DIRSIZE;
 	hctl->nsegs = 0;
 
-	/* rather pointless defaults for key & entry size */
-	hctl->keysize = sizeof(char *);
-	hctl->entrysize = 2 * sizeof(char *);
-
 	hctl->num_partitions = 0;	/* not partitioned */
 
 	/* table has no fixed maximum size */
diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index ec6f80ee99..283dfe2d9e 100644
--- a/src/backend/utils/mmgr/portalmem.c
+++ b/src/backend/utils/mmgr/portalmem.c
@@ -119,7 +119,7 @@ EnablePortalManager(void)
 	 * create, initially
 	 */
 	PortalHashTable = hash_create("Portal hash", PORTALS_PER_USER,
-								  &ctl, HASH_ELEM);
+								  &ctl, HASH_ELEM | HASH_STRINGS);
 }
 
 /*
diff --git a/src/backend/utils/time/combocid.c b/src/backend/utils/time/combocid.c
index 4ee9ef0ffe..9626f98100 100644
--- a/src/backend/utils/time/combocid.c
+++ b/src/backend/utils/time/combocid.c
@@ -223,7 +223,6 @@ GetComboCommandId(CommandId cmin, CommandId cmax)
 		sizeComboCids = CCID_ARRAY_SIZE;
 		usedComboCids = 0;
 
-		memset(&hash_ctl, 0, sizeof(hash_ctl));
 		hash_ctl.keysize = sizeof(ComboCidKeyData);
 		hash_ctl.entrysize = sizeof(ComboCidEntryData);
 		hash_ctl.hcxt = TopTransactionContext;
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index bebf89b3c4..13c6602217 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -64,25 +64,36 @@ typedef struct HTAB HTAB;
 /* Only those fields indicated by hash_flags need be set */
 typedef struct HASHCTL
 {
+	/* Used if HASH_PARTITION flag is set: */
 	long		num_partitions; /* # partitions (must be power of 2) */
+	/* Used if HASH_SEGMENT flag is set: */
 	long		ssize;			/* segment size */
+	/* Used if HASH_DIRSIZE flag is set: */
 	long		dsize;			/* (initial) directory size */
 	long		max_dsize;		/* limit to dsize if dir size is limited */
+	/* Used if HASH_ELEM flag is set (which is now required): */
 	Size		keysize;		/* hash key length in bytes */
 	Size		entrysize;		/* total user element size in bytes */
+	/* Used if HASH_FUNCTION flag is set: */
 	HashValueFunc hash;			/* hash function */
+	/* Used if HASH_COMPARE flag is set: */
 	HashCompareFunc match;		/* key comparison function */
+	/* Used if HASH_KEYCOPY flag is set: */
 	HashCopyFunc keycopy;		/* key copying function */
+	/* Used if HASH_ALLOC flag is set: */
 	HashAllocFunc alloc;		/* memory allocator */
+	/* Used if HASH_CONTEXT flag is set: */
 	MemoryContext hcxt;			/* memory context to use for allocations */
+	/* Used if HASH_SHARED_MEM flag is set: */
 	HASHHDR    *hctl;			/* location of header in shared mem */
 } HASHCTL;
 
-/* Flags to indicate which parameters are supplied */
+/* Flag bits for hash_create; most indicate which parameters are supplied */
 #define HASH_PARTITION	0x0001	/* Hashtable is used w/partitioned locking */
 #define HASH_SEGMENT	0x0002	/* Set segment size */
 #define HASH_DIRSIZE	0x0004	/* Set directory size (initial and max) */
-#define HASH_ELEM		0x0010	/* Set keysize and entrysize */
+#define HASH_ELEM		0x0008	/* Set keysize and entrysize (now required!) */
+#define HASH_STRINGS	0x0010	/* Select support functions for string keys */
 #define HASH_BLOBS		0x0020	/* Select support functions for binary keys */
 #define HASH_FUNCTION	0x0040	/* Set user defined hash function */
 #define HASH_COMPARE	0x0080	/* Set user defined comparison function */
@@ -93,7 +104,6 @@ typedef struct HASHCTL
 #define HASH_ATTACH		0x1000	/* Do not initialize hctl */
 #define HASH_FIXED_SIZE 0x2000	/* Initial size is a hard limit */
 
-
 /* max_dsize value to indicate expansible directory */
 #define NO_MAX_DSIZE			(-1)
 
@@ -116,13 +126,9 @@ typedef struct
 
 /*
  * prototypes for functions in dynahash.c
- *
- * Note: It is deprecated for callers of hash_create to explicitly specify
- * string_hash, tag_hash, uint32_hash, or oid_hash.  Just set HASH_BLOBS or
- * not.  Use HASH_FUNCTION only when you want something other than those.
  */
 extern HTAB *hash_create(const char *tabname, long nelem,
-						 HASHCTL *info, int flags);
+						 const HASHCTL *info, int flags);
 extern void hash_destroy(HTAB *hashp);
 extern void hash_stats(const char *where, HTAB *hashp);
 extern void *hash_search(HTAB *hashp, const void *keyPtr, HASHACTION action,
diff --git a/src/pl/plperl/plperl.c b/src/pl/plperl/plperl.c
index 4de756455d..6299adf71a 100644
--- a/src/pl/plperl/plperl.c
+++ b/src/pl/plperl/plperl.c
@@ -458,7 +458,6 @@ _PG_init(void)
 	/*
 	 * Create hash tables.
 	 */
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = sizeof(Oid);
 	hash_ctl.entrysize = sizeof(plperl_interp_desc);
 	plperl_interp_hash = hash_create("PL/Perl interpreters",
@@ -466,7 +465,6 @@ _PG_init(void)
 									 &hash_ctl,
 									 HASH_ELEM | HASH_BLOBS);
 
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = sizeof(plperl_proc_key);
 	hash_ctl.entrysize = sizeof(plperl_proc_ptr);
 	plperl_proc_hash = hash_create("PL/Perl procedures",
@@ -580,13 +578,12 @@ select_perl_context(bool trusted)
 	{
 		HASHCTL		hash_ctl;
 
-		memset(&hash_ctl, 0, sizeof(hash_ctl));
 		hash_ctl.keysize = NAMEDATALEN;
 		hash_ctl.entrysize = sizeof(plperl_query_entry);
 		interp_desc->query_hash = hash_create("PL/Perl queries",
 											  32,
 											  &hash_ctl,
-											  HASH_ELEM);
+											  HASH_ELEM | HASH_STRINGS);
 	}
 
 	/*
diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c
index b610b28d70..555da952e1 100644
--- a/src/pl/plpgsql/src/pl_comp.c
+++ b/src/pl/plpgsql/src/pl_comp.c
@@ -2567,7 +2567,6 @@ plpgsql_HashTableInit(void)
 	/* don't allow double-initialization */
 	Assert(plpgsql_HashTable == NULL);
 
-	memset(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(PLpgSQL_func_hashkey);
 	ctl.entrysize = sizeof(plpgsql_HashEnt);
 	plpgsql_HashTable = hash_create("PLpgSQL function hash",
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index ccbc50fc45..112f6ab0ae 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -4058,7 +4058,6 @@ plpgsql_estate_setup(PLpgSQL_execstate *estate,
 	{
 		estate->simple_eval_estate = simple_eval_estate;
 		/* Private cast hash just lives in function's main context */
-		memset(&ctl, 0, sizeof(ctl));
 		ctl.keysize = sizeof(plpgsql_CastHashKey);
 		ctl.entrysize = sizeof(plpgsql_CastHashEntry);
 		ctl.hcxt = CurrentMemoryContext;
@@ -4077,7 +4076,6 @@ plpgsql_estate_setup(PLpgSQL_execstate *estate,
 			shared_cast_context = AllocSetContextCreate(TopMemoryContext,
 														"PLpgSQL cast info",
 														ALLOCSET_DEFAULT_SIZES);
-			memset(&ctl, 0, sizeof(ctl));
 			ctl.keysize = sizeof(plpgsql_CastHashKey);
 			ctl.entrysize = sizeof(plpgsql_CastHashEntry);
 			ctl.hcxt = shared_cast_context;
diff --git a/src/pl/plpython/plpy_plpymodule.c b/src/pl/plpython/plpy_plpymodule.c
index 7f54d093ac..0365acc95b 100644
--- a/src/pl/plpython/plpy_plpymodule.c
+++ b/src/pl/plpython/plpy_plpymodule.c
@@ -214,7 +214,6 @@ PLy_add_exceptions(PyObject *plpy)
 	PLy_exc_spi_error = PLy_create_exception("plpy.SPIError", NULL, NULL,
 											 "SPIError", plpy);
 
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = sizeof(int);
 	hash_ctl.entrysize = sizeof(PLyExceptionEntry);
 	PLy_spi_exceptions = hash_create("PL/Python SPI exceptions", 256,
diff --git a/src/pl/plpython/plpy_procedure.c b/src/pl/plpython/plpy_procedure.c
index 1f05c633ef..b7c0b5cebe 100644
--- a/src/pl/plpython/plpy_procedure.c
+++ b/src/pl/plpython/plpy_procedure.c
@@ -34,7 +34,6 @@ init_procedure_caches(void)
 {
 	HASHCTL		hash_ctl;
 
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = sizeof(PLyProcedureKey);
 	hash_ctl.entrysize = sizeof(PLyProcedureEntry);
 	PLy_procedure_cache = hash_create("PL/Python procedures", 32, &hash_ctl,
diff --git a/src/pl/tcl/pltcl.c b/src/pl/tcl/pltcl.c
index a3a2dc8e89..e11837559d 100644
--- a/src/pl/tcl/pltcl.c
+++ b/src/pl/tcl/pltcl.c
@@ -439,7 +439,6 @@ _PG_init(void)
 	/************************************************************
 	 * Create the hash table for working interpreters
 	 ************************************************************/
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = sizeof(Oid);
 	hash_ctl.entrysize = sizeof(pltcl_interp_desc);
 	pltcl_interp_htab = hash_create("PL/Tcl interpreters",
@@ -450,7 +449,6 @@ _PG_init(void)
 	/************************************************************
 	 * Create the hash table for function lookup
 	 ************************************************************/
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = sizeof(pltcl_proc_key);
 	hash_ctl.entrysize = sizeof(pltcl_proc_ptr);
 	pltcl_proc_htab = hash_create("PL/Tcl functions",
diff --git a/src/timezone/pgtz.c b/src/timezone/pgtz.c
index 3f0fb51e91..4a360f5077 100644
--- a/src/timezone/pgtz.c
+++ b/src/timezone/pgtz.c
@@ -203,15 +203,13 @@ init_timezone_hashtable(void)
 {
 	HASHCTL		hash_ctl;
 
-	MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-
 	hash_ctl.keysize = TZ_STRLEN_MAX + 1;
 	hash_ctl.entrysize = sizeof(pg_tz_cache);
 
 	timezone_cache = hash_create("Timezones",
 								 4,
 								 &hash_ctl,
-								 HASH_ELEM);
+								 HASH_ELEM | HASH_STRINGS);
 	if (!timezone_cache)
 		return false;
 
#556Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#555)
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

On Mon, Dec 14, 2020 at 01:59:03PM -0500, Tom Lane wrote:

* Should we just have a blanket insistence that all callers supply
HASH_ELEM? The default sizes that dynahash.c uses without that are
undocumented and basically useless.

+1

we should just rip out all those memsets as pointless, since there's
basically no case where you'd use the memset to fill a field that
you meant to pass as zero. The fact that hash_create() doesn't
read fields it's not told to by a flag means we should not need
the memsets to avoid uninitialized-memory reads.

On Mon, Dec 14, 2020 at 06:55:20PM -0500, Tom Lane wrote:

Here's a rolled-up patch that does some further documentation work
and gets rid of the unnecessary memset's as well.

+1 on removing the memset() calls. That said, it's not a big deal if more
creep in over time; it doesn't qualify as a project policy violation.

@@ -329,6 +328,11 @@ InitShmemIndex(void)
* whose maximum size is certain, this should be equal to max_size; that
* ensures that no run-time out-of-shared-memory failures can occur.
*
+ * *infoP and hash_flags should specify at least the entry sizes and key

s/should/must/

#557Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#556)
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

Noah Misch <noah@leadboat.com> writes:

On Mon, Dec 14, 2020 at 01:59:03PM -0500, Tom Lane wrote:

Here's a rolled-up patch that does some further documentation work
and gets rid of the unnecessary memset's as well.

+1 on removing the memset() calls. That said, it's not a big deal if more
creep in over time; it doesn't qualify as a project policy violation.

Right, that part is just neatnik-ism. Neither the calls with memset
nor the ones without are buggy.

+ * *infoP and hash_flags should specify at least the entry sizes and key

s/should/must/

OK; thanks for reviewing!

regards, tom lane

#558Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#548)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tom Lane has raised a complaint on pgsql-commiters [1]/messages/by-id/2752962.1619568098@sss.pgh.pa.us about one of
the commits related to this work [2]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=7259736a6e5b7c7588fff9578370736a6648acbb. The new member wrasse is showing
Warning:

"/export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/reorderbuffer.c",
line 2510: Warning: Likely null pointer dereference (*(curtxn+272)):
ReorderBufferProcessTXN

The Warning is for line:
curtxn->concurrent_abort = true;

Now, we can simply fix this warning by adding an if check like:
if (curtxn)
curtxn->concurrent_abort = true;

However, on further discussion, it seems that is not sufficient here
because the callbacks can throw the surrounding error code
(ERRCODE_TRANSACTION_ROLLBACK) where we set concurrent_abort flag for
a completely different scenario. I think here we need a
stronger check to ensure that we set concurrent abort flag and do
other things in that check only when we are decoding non-committed
xacts. The idea I have is to additionally check that we are decoding
streaming or prepared transaction (the same check as we have for
setting curtxn) or we can check if CheckXidAlive is a valid
transaction id. What do you think?

[1]: /messages/by-id/2752962.1619568098@sss.pgh.pa.us
[2]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=7259736a6e5b7c7588fff9578370736a6648acbb

--
With Regards,
Amit Kapila.

#559Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#558)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Apr 28, 2021 at 11:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Tom Lane has raised a complaint on pgsql-commiters [1] about one of
the commits related to this work [2]. The new member wrasse is showing
Warning:

"/export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/reorderbuffer.c",
line 2510: Warning: Likely null pointer dereference (*(curtxn+272)):
ReorderBufferProcessTXN

The Warning is for line:
curtxn->concurrent_abort = true;

Now, we can simply fix this warning by adding an if check like:
if (curtxn)
curtxn->concurrent_abort = true;

However, on further discussion, it seems that is not sufficient here
because the callbacks can throw the surrounding error code
(ERRCODE_TRANSACTION_ROLLBACK) where we set concurrent_abort flag for
a completely different scenario. I think here we need a
stronger check to ensure that we set concurrent abort flag and do
other things in that check only when we are decoding non-committed
xacts.

That makes sense.

The idea I have is to additionally check that we are decoding

streaming or prepared transaction (the same check as we have for
setting curtxn) or we can check if CheckXidAlive is a valid
transaction id. What do you think?

I think a check based on CheckXidAlive looks good to me. This will
protect against if a similar error is raised from any other path as
you mentioned above.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#560Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#559)
1 attachment(s)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Apr 28, 2021 at 11:03 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Apr 28, 2021 at 11:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

The idea I have is to additionally check that we are decoding

streaming or prepared transaction (the same check as we have for
setting curtxn) or we can check if CheckXidAlive is a valid
transaction id. What do you think?

I think a check based on CheckXidAlive looks good to me. This will
protect against if a similar error is raised from any other path as
you mentioned above.

We can't use CheckXidAlive because it is reset by that time. So, I
used the other approach which led to the attached.

--
With Regards,
Amit Kapila.

Attachments:

v1-0001-Tighten-the-concurrent-abort-check-during-decodin.patchapplication/octet-stream; name=v1-0001-Tighten-the-concurrent-abort-check-during-decodin.patchDownload
From 127bacc3c20dc35fcb874ae76d11ec006784075b Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 30 Apr 2021 14:56:43 +0530
Subject: [PATCH v1] Tighten the concurrent abort check during decoding.

During decoding of an in-progress or prepared transaction, we detect
concurrent abort with an error code ERRCODE_TRANSACTION_ROLLBACK. That is
not sufficient because a callback can decide to throw that error code
at other times as well.
---
 .../replication/logical/reorderbuffer.c       | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e1e17962e7d..7ebfb3007f4 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2491,17 +2491,18 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 * abort of the (sub)transaction we are streaming or preparing. We
 		 * need to do the cleanup and return gracefully on this error, see
 		 * SetupCheckXidLive.
+		 *
+		 * This error code can be thrown by one of the callbacks we call during
+		 * decoding so we need to ensure that we return gracefully only when we are
+		 * sending the data in streaming mode and the streaming is not finished yet
+		 * or when we are sending the data out on a PREPARE during a two-phase
+		 * commit.
 		 */
-		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK &&
+			(stream_started || rbtxn_prepared(txn)))
 		{
-			/*
-			 * This error can occur either when we are sending the data in
-			 * streaming mode and the streaming is not finished yet or when we
-			 * are sending the data out on a PREPARE during a two-phase
-			 * commit.
-			 */
-			Assert(streaming || rbtxn_prepared(txn));
-			Assert(stream_started || rbtxn_prepared(txn));
+			/* curtxn must be set for streaming or prepared transactions */
+			Assert(curtxn);
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
-- 
2.28.0.windows.1

#561Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#560)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Apr 30, 2021 at 3:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Apr 28, 2021 at 11:03 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Apr 28, 2021 at 11:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

The idea I have is to additionally check that we are decoding

streaming or prepared transaction (the same check as we have for
setting curtxn) or we can check if CheckXidAlive is a valid
transaction id. What do you think?

I think a check based on CheckXidAlive looks good to me. This will
protect against if a similar error is raised from any other path as
you mentioned above.

We can't use CheckXidAlive because it is reset by that time.

Right.

So, I

used the other approach which led to the attached.

The patch looks fine to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#562Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#561)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Fri, Apr 30, 2021 at 7:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

So, I

used the other approach which led to the attached.

The patch looks fine to me.

Thanks, pushed!

--
With Regards,
Amit Kapila.

#563Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#562)
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Thu, May 6, 2021 at 9:01 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Apr 30, 2021 at 7:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

So, I

used the other approach which led to the attached.

The patch looks fine to me.

Thanks, pushed!

Thanks!

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com